>I don't think there is anything in a transformer I couldn't explain in the smallest detail now.
If you're up for it I would love to know how and why positional encodings work
Well, as I suggested, working through the implementation yourself will give you that intuition. That said, I think the simplest way to explain why positional encodings are useful is that it gives the transformer just enough information to make attention meaningful without negatively impacting any parallel, content-based comparisons.
A vanilla self-attention layer is just a set of token vectors. Without positional info, swapping two identical embeddings changes very little about what attention can compute. We can "fix" this problem by using positional encodings. Text that has meaning isn't just a set of characters; the location and order of those characters is what provides meaning.
Learn about superposition and then you will see nobody really know why this stuff works. Its actually a good interview question to set the bar....