logoalt Hacker News

volume_techtoday at 1:16 PM0 repliesview on HN

the '13 parameters' framing is misleading in both directions. the base model's billions of parameters do the heavy lifting; the 13 just steer existing circuitry. but that's actually the interesting finding: the behavior gap between a capable model and a reasoning model is geometrically tiny. you can find it with gradient descent in 13 dimensions of a multi-billion-dimensional space. that's a strong claim about the structure of task-relevant representations in transformers.