logoalt Hacker News

make3today at 5:33 PM2 repliesview on HN

I guarantee you there's positional information one way or another. they just don't mention it because positional embeddings are extremely cheap computationally, not worth mentioning


Replies

neosattoday at 5:39 PM

Agree. Audio has strongly temporal so there is almost certainly some positional encoding one way or another.

mchinentoday at 5:52 PM

Ah yeah, thinking further it's probably just using some positioning embedding based on sequence numbering added in the LLM layers. For vision it needs the patch location as well.