It'd be way slower since you'd be doing that work every token

conradkay • today at 7:03 AM • 1 reply • view on HN

It'd be way slower since you'd be doing that work every token

Replies

True (with 64GB RAM it'd have to fetch 20% of its active experts from disk already, about 650MB/tok at 2-bit quant - and that percentage rises quickly as you lower RAM further); my question is just a more practical one about whether it runs at all, how bad the slowdown is, and to what extent you might be able to get some of that decode throughput back by running multiple (slower) agent sessions in parallel under a single Dwarf Star 4 server.

alt Hacker News

Replies