FlashAttention-T: Towards Tensorized Attention

67 points • by matt_d • yesterday at 9:15 PM • 32 comments • view on HN

Comments

I built guided window attn (literally predict the position of the window) a while ago and that works great. Why are we still stuck on any form of attn that looks at the entire context in any meaningful way? Do humans work this way? Do I need a whole book to predict the next word? Who out there is working on really new unique ways to deal with infinite history, other than me of course :)

➕ show 1 reply

simianwords • yesterday at 11:33 PM

OT but instead of quadratic attention can we not have n^10 or something crazier? I feel like we are limiting the intelligence just to save cost. But I can imagine that there might be some questions that may be worth paying higher cost for.

I feel like n^10 attention can capture patterns that lower complexity attention may not. So it seems arbitrary that we have n^2 attention.

➕ show 6 replies

sigbottle • yesterday at 11:27 PM

Oh wow there's still work being done on ampere?

I was wondering - I've been thinking about switching to AI systems programming (I know, easy task), but from what I understand, industry cloud GPUs are the main winners, right? Nobody's going to pay me (assuming I even had the skills) to optimize for consumer GPUs?

From what I understand, it's not just number + capacity + performance, it's literal core primitives. I don't think any of the "Blackwell" chips like the grace one or rtx 5090 have for example SM pairs in their ISA? And likewise similar fundamental differences between consumer and cloud hopper (where the majority of the perf is the cloud one's ISA?)

So I guess I'm wondering if I should buy a GPU myself or should I just rent on the cloud if I wanted to start getting some experience in this field. How do you even get experience in this normally anyways, do you get into really good schools and into their AI labs which have a lot of funding?

➕ show 4 replies

semiinfinitely • yesterday at 10:28 PM

tri dao isn't on the paper is it even allowed to call it "FlashAttention"???

saagarjha • yesterday at 10:40 PM

Less annoying link directly to the paper: https://dl.acm.org/doi/pdf/10.1145/3774934.3786425?download=...

➕ show 1 reply

verytrivial • today at 12:10 AM

Tldr: 5% - 17% speedup due to removing a bottleneck by juggling where on a GPU/compute core a computation is done during Flash attention.

measurablefunc • yesterday at 10:07 PM

[flagged]

➕ show 1 reply

alt Hacker News

FlashAttention-T: Towards Tensorized Attention

Comments