They weren't the first to do MTP like this, and arguably did it wrong: the MTP heads are kept i...

DiabloD3 • yesterday at 4:30 PM • 2 replies • view on HN

They weren't the first to do MTP like this, and arguably did it wrong: the MTP heads are kept in a separate file and have to be welded in by the inference engine.

Qwen 3.6 shipped with working MTP first, and had working MTP in llama.cpp first.

Replies

spijdar • yesterday at 6:14 PM

Given the MTP drafter is basically a separate model, keeping it separate makes more sense IMO. It's out of my wheelhouse but it seems like you could adjust the MTP drafter model separately from the main model, too.

Ultimately though the real explanation, I think, is Google doesn't care since for their own purposes (in LiteRT-LM), they do bundle them. As far as I know, anyway.

➕ show 3 replies

kcb • yesterday at 7:08 PM

Nvidia's Nemotron 3 Super also shipped with MTP.

alt Hacker News

Replies