in the "Performance" section of the post:

luulinh90s • yesterday at 5:43 AM • 1 reply • view on HN

in the "Performance" section of the post: https://www.guidelabs.ai/post/steerling-8b-base-model-releas..., the authors show the model lags behind llama 8b but worth noting that llama 8b trained on > 2x more computes (see the FLOPs axis)

Replies

adebayoj • yesterday at 9:29 AM

Thanks for pointing this out. LLama 3 8B was trained on ~15T tokens. The Qwen models on 15-18T tokens as well. We trained on 1.35T tokens, and are within shot of these models on benchmarks. We expect to, at the very minimum, match these models' performance when we scale our token budget.

One side effect that we are excited about is that interpretable model training might make for a data efficient training process.

alt Hacker News

Replies