Back in 2018 I published pytorch-hessian-eigenthings, a niche open source package for GPU-accelerated curvature analysis of PyTorch models. Loss landscape curvature metrics like the eigenvalues of the Hessian have been implicated in many generalization properties of neural networks (like flat-minima hypotheses, low-rank Hessian claims, etc.). But the full Hessian costs memory quadratic in the parameter count, which is usually infeasible. This library uses Hessian-vector products + iterative methods (Lanczos, power iteration) to get the eigendecomposition in linear memory instead. I stepped away from the project for years, but it ended up being used by other researchers doing curvature analysis work. I noticed the original implementation had aged so I thought I'd revisit it. I also have more professional engineering experience under my belt to inform the design.
I just shipped a v1.0 rewrite. The new version adds new curvature operators (Generalized Gauss-Newton, empirical Fisher), and new algorithms (Hutchinson + Hutch++ trace estimation, spectral density via Stochastic Lanczos Quadrature). It also has a fused Triton/torch.compile cross-entropy Hessian-vector kernel for foundation-model-scale vocabularies (where standard implementations blow up). More importantly it adds a lot of numerical analysis validating the operators: closed-form correctness on linear/logistic regression where the Hessian is known analytically, and cross-library tests against curvlinops to catch any regressions.
https://github.com/noahgolmant/pytorch-hessian-eigenthings
I'm hoping to use it for some follow-up analysis. For example right now I'm looking at inter-agreement between various optimizer updates (Muon, K-FAC, Natural Gradient Descent) on Pythia checkpoints.
Very open to suggestions or requests from anyone who's been working in this space. I've been out of the field for a while, so pointers to recent work I should be aware of are very welcome.
when distilling a teacher model to a student model, the student model learns faster when trained to reproduce the distribution of next tokens than if it's only trained to reproduce a token randomly selected from that distribution.
is there a similar effect where transferring the Hessian speeds up knowledge distillation even faster?
Suppose one has a candidate alternative model architecture, how can one estimate the amount of compute needed for the knowledge distillation to a student model?
Consider for example the following model: each token (or character or bit) corresponds to a matrix (or a multivector), and a sequence of tokens corresponds to the matrix product (geometric product) of the appended tokens in the same order. The partition function / relative likelihood is taken as exp(-||Product(M_i)||) where ||matrix/multivector|| is the positive-definite squared norm of the matrix or multivector (basically sum of the squares of the components).
To get P(nextToken | productOfPreviousTokens) you calculate: P(productOfPreviousTokens * nextToken)/P(productOfPreviousTokens)
How does one calculate the expected number of forward inferences of the teacher network, and same number of gradient descents on the student network before its performance plateaus given their parameter sizes? How does this expected number of required forward inferences scale with or without adding a Hessian loss term?