logoalt Hacker News

perfmodetoday at 5:23 PM0 repliesview on HN

The quality cliff question is the right one to be asking. There's a pattern in systems work where something that scales cleanly in theory hits emergent failure modes at production scale that weren't visible in smaller tests. The loss landscape concern is exactly that kind of thing, and nobody has actually run the experiment.

That said, I think the comparison to improving GGUF quantization isn't quite apples to apples. Post-training quantization is compressing a model that already learned its representations in high precision. Native ternary training is making an architectural bet that the model can learn equally expressive representations under a much tighter constraint from the start. Those are different propositions with different scaling characteristics. The BitNet papers suggest the native approach wins at small scale, but that could easily be because the quantization baselines they compared against (Llama 3 at 1.58 bits) were just bad. A full-precision model wasn't designed to survive that level of compression.

The real tell will be whether anyone with serious compute (not Microsoft, apparently) decides the potential inference cost savings justify a full training run. The framework existing lowers one barrier, but the more important barrier is that a failed 100B training run is extremely expensive, and right now there's not enough evidence to derisk it. Two years of framework polish without a flagship model is a notable absence.