Not OP but I worked through Sebastian Raschka's "Build a Large Language Model (From Scratch)" [0] and Raj Abhijit Dandekar's "Build a DeepSeek Model (From Scratch)" [1] books.
I don't think there is anything in a transformer I couldn't explain in the smallest detail now.
[0]: https://www.amazon.com/Build-Large-Language-Model-Scratch/dp...
[1]: https://www.amazon.com/Build-DeepSeek-Scratch-Abhijit-Dandek...
>I don't think there is anything in a transformer I couldn't explain in the smallest detail now.
If you're up for it I would love to know how and why positional encodings work