logoalt Hacker News

jimmypktoday at 9:31 AM1 replyview on HN

@hodgehog11 The grokking phenomenon (Power et al. 2022) is a puzzle for the compression view: models trained on algorithmic tasks like modular arithmetic memorize training data first (near-zero training loss, near-random test accuracy) and then, after many more gradient steps, suddenly generalize. The transition happens long after any obvious compression pressure would have fired. Do you think grokking is consistent with implicit regularization as compression, or does it require a separate mechanism - something more like a phase transition in the weight norms or the Fourier frequency structure?


Replies

hackinthebochstoday at 10:10 AM

>Do you think grokking is consistent with implicit regularization as compression

Pretty sure it's been shown that grokking requires L1 regularization which pushes model parameters towards zero. This can be viewed as compression in the sense of encoding the distribution in the fewest bits possible, which happens to correspond to better generalization.

show 1 reply