Wish they gave some numbers for total GPU hours to train this model, seems comparatively tiny when compared to LLMs so interested to know how close this is to something trainable by your average hobbyist/university/small lab
Edit, it looks like the paper does
TPUv5e with 16 tensor cores for 2 days for the 200M param model.
Claude reckons this is 60 hours on a 8xA100 rig, so very accessibile compared to LLMs for smaller labs
Edit, it looks like the paper does
TPUv5e with 16 tensor cores for 2 days for the 200M param model.
Claude reckons this is 60 hours on a 8xA100 rig, so very accessibile compared to LLMs for smaller labs