Call me back when you can run these models on 16GB of RAM and any recent i5/i7. Until then, there’s no point on using these toy models.
You need it to run in about 8 GB so you have extra space for the context window.
Hello, it's the internet calling, today is that day.
https://github.com/ikawrakow/ik_llama.cpp
Edit: it's gonna be slow if you're not using any VRAM. But it's possible. Software isn't going to speed that up anytime soon, it's just a hardware bandwidth limit.
They can be ran on 32GB with 8GB VRAM. I don't think these will be on 16GB for a while. (35B MoE)
Its so funny, these "toy models" would be the wet dreams of researchers not 5 years ago.
Progress marches without mercy.