TinyLoRA – Learning to Reason in 13 Parameters

228 points • by sorenjan • last Friday at 12:11 PM • 44 comments • view on HN

Comments

Not sure if I buy it. First, SVD decomposition to obtain U, Σ, V is computationally expensive, so it would work only if we are not finetuning very big models.

But my real concern comes at the results. The "13 parameters" looks like bait, because it is one result of finetuning a model on a very simple math benchmark, grade-school-math (GSM8K), an already very saturated benchmark on every model. Besides, it seems to happen only for the qwen family model... It looks like GSM8K was part of the training set of the qwen model, and this tinylora finetuning did the last adjustments to perfectly reflect that overtraining.

➕ show 2 replies

kashifr • today at 1:34 PM

You can try out TinyLoRA in PEFT main now: https://huggingface.co/docs/peft/main/en/package_reference/t...

cestith • today at 1:49 PM

This is interesting and all, but “LoRA” is painfully close to “LoRa” (which is related to radio networking, not AI) when just scanning a list of topics. We’re never going to beat the Shannon limit on acronyms and initialisms.

I’m glad the rest of the anchor text gave some context.

➕ show 2 replies

MASNeo • today at 6:28 AM

Is it an Aprils Fools publication?

➕ show 1 reply

kgeist • today at 5:34 AM

>One theory is that the knowledge required to solve the task is already stored in the parameters of the model, and only the style has to change for task success

>In particular, learning to generate longer outputs may be possible in few parameters

Reminded me of: https://arxiv.org/abs/2501.19393

>we develop budget forcing to control test-time compute by forcefully terminating the model’s thinking process or lengthening it by appending “Wait” multiple times to the model’s generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps

Maybe, indeed, the model simply learns to insert the EOS token (or similar) later, and the capability is already in the base model

ashater • today at 4:53 PM

Likely reasoning is part of the original model. It is well known that it is not possible to get a 1bn parameter model to reason, even with RL.

volume_tech • today at 1:16 PM

the '13 parameters' framing is misleading in both directions. the base model's billions of parameters do the heavy lifting; the 13 just steer existing circuitry. but that's actually the interesting finding: the behavior gap between a capable model and a reasoning model is geometrically tiny. you can find it with gradient descent in 13 dimensions of a multi-billion-dimensional space. that's a strong claim about the structure of task-relevant representations in transformers.

nekusar • today at 6:59 PM

Can a model that small dynamically grow? In other words, can it train itself AS it progresses through the network?

Xx_crazy420_xX • today at 6:30 AM

If i understand it correctly, the analogy could be:

Let's say we have a low level programmer expert and we try to teach him algebra either we:

  - (SFT): give him algebra book with new nomenclature, definitions, syntax
  - (RL): let him learn algebra using C syntax

➕ show 1 reply

a-t-c-g • today at 2:01 AM

The quality of custom models trained with proper reasoning datasets[0] even with small parameters (3-7B is sweet spot) is incredible now

[0]: cartesien.io or Salesforce's WebscaleRL

➕ show 1 reply

measurablefunc • today at 1:08 AM

With four parameters I can fit an elephant, and with five I can make him wiggle his trunk so there is still room for improvement.

➕ show 2 replies

vasco • today at 11:24 AM

Most data in the training set of most reasoning models is crap I guess.

Sim-In-Silico • today at 3:20 AM

[dead]

evermore611 • today at 10:03 AM

[dead]

ValveFan6969 • today at 5:01 AM

[dead]

ValveFan6969 • today at 1:56 AM

[dead]

matt123456789 • today at 3:54 AM

Such low dimensionality of the LoRA vector must surely result in a close-to-linear modification to the KV calculation. This seems to me to imply that what we call "reasoning" is latent within the model. Pretty clear I didn't read the paper, I'm sure the authors address this.

➕ show 1 reply

sachaa • today at 5:53 AM

If 13 parameters can unlock better reasoning, then we will not be "training" models, we'll be steering them. Most of the capability is already there.

The real unlock isn’t TinyLoRA, it’s what this implies: ultra-cheap, continuous adaptation. The bottleneck shifts from compute to having a good reward signal.

alt Hacker News

TinyLoRA – Learning to Reason in 13 Parameters

Comments