Thanks for pointing this out. LLama 3 8B was trained on ~15T tokens. The Qwen models on 15-18T tokens as well. We trained on 1.35T tokens, and are within shot of these models on benchmarks. We expect to, at the very minimum, match these models' performance when we scale our token budget.
One side effect that we are excited about is that interpretable model training might make for a data efficient training process.
Thanks for pointing this out. LLama 3 8B was trained on ~15T tokens. The Qwen models on 15-18T tokens as well. We trained on 1.35T tokens, and are within shot of these models on benchmarks. We expect to, at the very minimum, match these models' performance when we scale our token budget.
One side effect that we are excited about is that interpretable model training might make for a data efficient training process.