when distilling a teacher model to a student model, the student model learns faster when trained to reproduce the distribution of next tokens than if it's only trained to reproduce a token randomly selected from that distribution.
is there a similar effect where transferring the Hessian speeds up knowledge distillation even faster?
Suppose one has a candidate alternative model architecture, how can one estimate the amount of compute needed for the knowledge distillation to a student model?
Consider for example the following model: each token (or character or bit) corresponds to a matrix (or a multivector), and a sequence of tokens corresponds to the matrix product (geometric product) of the appended tokens in the same order. The partition function / relative likelihood is taken as exp(-||Product(M_i)||) where ||matrix/multivector|| is the positive-definite squared norm of the matrix or multivector (basically sum of the squares of the components).
To get P(nextToken | productOfPreviousTokens) you calculate: P(productOfPreviousTokens * nextToken)/P(productOfPreviousTokens)
How does one calculate the expected number of forward inferences of the teacher network, and same number of gradient descents on the student network before its performance plateaus given their parameter sizes? How does this expected number of required forward inferences scale with or without adding a Hessian loss term?
ah yes, the classic downvote for asking questions?