logoalt Hacker News

derangedHorsetoday at 4:32 PM2 repliesview on HN

I initially agreed with a lot of the sentiment that asks "why," but have reframed my opinion. Instead of seeing this as a way to run programs via inference, I'm now seeing this as a way to bootstrap training. Think about the task of classification. If I have an expert system that classifies correctly 80% of the time, now I can embed it into a model and train the model to try to raise the success rate. The lower we can make the cost of training on various tasks, the better it levels the playing field of who can compete in the AI landscape.


Replies

yorwbatoday at 5:24 PM

The approach here is very bad for training though, because unlike softmax attention, average-hard attention is not differentiable with respect to the keys and queries, and if you try to fix that e.g. with straight-through estimation, the backward pass cannot be sped up in the same way as the forward pass.

refulgentistoday at 5:55 PM

Training is ruled out (see peer comment), however you may find this fascinating, somewhat rhymes: https://arxiv.org/abs/2603.10055