logoalt Hacker News

hashtayesterday at 9:04 PM1 replyview on HN

Interesting read. I remember the grokking paper when it came out but I don't think I've ever seen that classic grokking loss curve in my own hands on real data. Curious if others have seen it more often in practice


Replies

yorwbatoday at 6:50 AM

To get pure grokking, you need a model large enough to easily memorize the entire training data and keep training for a long time after memorization. In practice, you'll probably use a more realistically-sized model that might grok on some subset of the data, but not so strongly that it's extremely obvious.