Could you perhaps cite the core papers for LLMs beyond „Attention is all you need“?

wuschel • today at 6:26 AM • 3 replies • view on HN

Replies

"Attention is all you need" is actually a bad paper if you want to learn about autoregressive LLMs specifically, because it describes a more complicated encoder-decoder architecture while modern LLMs are decoder only. So it's an unnecessarily hard way to get into the subject. "Language Models are Unsupervised Multitask Learners" is probably what you are looking for (aka the GPT-2 paper). This was the first time LLMs really showed what is possible, i.e. they can learn to generalize very well from unstructured data. So no more human labelling necessary, which until then was the primary bottleneck in ML. The paper also lists several key ingredients beyond transformers that are mostly still in place today. This also highlights that there was more to it than just "scaling the transformer algorithm" like many people claim. Most developments since then were about improving training data, until "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" drastically changed the architecture landscape again. Later big developments like thinking/reasoning/chain of thought/inference time compute (whatever you want to call it nowadays) are actually all about training again. They work using the exact same architecture.

➕ show 1 reply

blackbear_ • today at 6:55 AM

The GPT3 paper is a good starting point

Language Models are Few-Shot Learners https://arxiv.org/abs/2005.14165

I also enjoyed the papers for DeepSeek and GLM for an overview of all the tricks you need to make these things work

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models https://arxiv.org/abs/2512.02556

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models https://arxiv.org/abs/2508.06471

sharma-arjun • today at 7:17 AM

Not a core paper, but I found Formal Algorithms for Transformers [1] (a Google paper from 2022) to have a great pedagogical style.

[1] https://arxiv.org/abs/2207.09238

➕ show 1 reply

alt Hacker News

Replies