logoalt Hacker News

luulinh90syesterday at 5:43 AM1 replyview on HN

in the "Performance" section of the post: https://www.guidelabs.ai/post/steerling-8b-base-model-releas..., the authors show the model lags behind llama 8b but worth noting that llama 8b trained on > 2x more computes (see the FLOPs axis)


Replies

adebayojyesterday at 9:29 AM

Thanks for pointing this out. LLama 3 8B was trained on ~15T tokens. The Qwen models on 15-18T tokens as well. We trained on 1.35T tokens, and are within shot of these models on benchmarks. We expect to, at the very minimum, match these models' performance when we scale our token budget.

One side effect that we are excited about is that interpretable model training might make for a data efficient training process.