the paper is burying the lede here (i think?) > The key technical unlock is to restrict lookup ...

dnautics • today at 3:44 PM • 1 reply • view on HN

the paper is burying the lede here (i think?)

> The key technical unlock is to restrict lookup heads to head dimension 2, which enables a decoding path where the dominant retrieval/update operations can be computed in log time in the sequence length (for this structured executor regime), rather than by a full prefix-sized attention sweep.

edit: i understand how hullkv works now. very clever.

I dont understand why this strategy is applicable only to "code tokens"

lastly, im not sure why wasm is a good target, iirc wasm seems to be really inefficient (not so much in code but in expressivity). i wonder if that curtails the llms ability to plan higher order stuff (since its always forced to think in the small)

Replies

D-Machine • today at 3:51 PM

> i have a pretty good understanding of how transformers work but this did not make sense to me. also i dont understand why this strategy is applicable only to "code tokens"

Yes, there is a monstrous lack of detail here and you should be skeptical about most of the article claims. The language is also IMO non-standard (serious people don't talk about self-attention as lookup tables anymore, that was never a good analogy in the first place) and no good work would just use language to express this, there would also be a simple equation showing the typical scaled dot-product attention formula, and then e.g. some dimension notation/details indicating which matrix (or inserted projection matrix) got some dimension of two somewhere, otherwise, the claims are inscrutable (EDIT: see edit below).

There are also no training details or loss function details, both of which would be necessary (and almost certainly highly novel) to make this kind of thing end-to-end trainable, which is another red flag.

EDIT: The key line seems to be around:

    gate, val = ff_in(x).chunk(2, dim=-1)

and related code, plus the lines "Notice: d_model = 36 with n_heads = 18 gives exactly 2D per head" but, again, this is very unclear and non-standard.

➕ show 2 replies

alt Hacker News

Replies