logoalt Hacker News

irthomasthomasyesterday at 9:16 AM4 repliesview on HN

Naturally. That's how LLMs work. During training you measure the loss, the difference between the model output and the ground-truth and try to minimize it. We prize models for their ability to learn. Here we can see that the large model does a great job at learning to draw bob, while the small model performs poorly.


Replies

ACCount37yesterday at 11:57 AM

We don't value LLMs for rote memorization though. Perfect memorization is a long solved task. We value LLMs for their generalization capabilities.

A scuffed but fully original ASCII SpongeBob is usually more valuable than a perfect recall of an existing one.

One major issue with highly sparse MoE is that it appears to advance memorization more than it advances generalization. Which might be what we're seeing here.

endymion-lightyesterday at 9:21 AM

I'd argue that actually, the smaller model is doing a better job at "learning" - in that it's including key characteristics within an ascii image while poor.

The larger model already has it in the training corpus so it's not particularly a good measure though. I'd much rather see the capabilities of a model in trying to represent in ascii something that it's unlikely to have in it's training.

Maybe a pelican riding a bike as ascii for both?

mdp2021yesterday at 12:49 PM

> That's how LLMs work

And that is also exactly how we want them not to work: we want them to be able to solve new problems. (Because Pandora's box is open, and they are not sold as a flexible query machine.)

"Where was Napoleon born": easy. "How to resolve the conflict effectively": hard. Solved problems are interesting to students. Professionals have to deal with non trivial ones.

show 1 reply
WhitneyLandyesterday at 1:30 PM

Not really.

Typically less than 1% of training data is memorized.