Is their convex hull attention mechanism new and generally useable? I mean, it substantially restric...

bee_rider • today at 2:51 PM • 1 reply • view on HN

Is their convex hull attention mechanism new and generally useable? I mean, it substantially restricts the shape of the model, so it isn’t a universal solution of course, but it does seem to overcome a pretty annoying limitation.

Replies

D-Machine • today at 8:06 PM

If you read the section "Richer attention mechanisms", you can see, no, the mechanism is not generally useable (it requires significant modification to become differentiable). They later speculate:

    While we do not yet know whether exact softmax attention
    can be maintained with the same efficiency, it is easy to
    approximate it with k-sparse softmax attention: retrieve
    the top-k keys and perform the softmax only over those

but if you have played around with training models that use e.g. topk or other hard thresholding operations in e.g. PyTorch (or just think about how many gradients become zero with such an operation) you know that these tend to work only in extremely limited / specific cases, and make training even more finicky than it already is.

➕ show 1 reply

alt Hacker News

Replies