LLMs are not deterministic per my understanding. A program always produces the same output for the same input and instructions (ignore FP accuracy for now). How is determinism achieved here?
LLMs produce a distribution of token probabilities which is then sampled. This sampling is the only random part of the system.
If you just take the most probable token every time, the system becomes fully deterministic. We don't do this as the output becomes more stiff and less creative.
LLMs may be deterministic for a subset of inputs, if one output (or intermediate layer) neuron-state probability is significantly higher than the rest. My understanding is, when probabilities are close they diverge.
LLMs (or at least transformer-based LLMs) are effectively almost entirely deterministic, the randomness being largely only present due to (unnecessary) optimizations and other tweaks.
Temperature is not at all core to LLMs, it is something that rather makes the outputs more varied and desirable for human consumption generally. It is trivial to set to zero for applications like this.
On CPUs, the models are essentially fully deterministic, even with FP accuracy, and most common kernels have reproducible (albeit slower) variants even on GPUs. Otherwise, yes, FP non-associativity on GPUs is the only real source of randomness in inference.
The other issue arises from batch invariance, but this is a problem that occurs only at scale when serving multiple users / inputs have some randomness too. You can (usually) trivially eliminate this by controlling what goes in the batch or making the batch size be one. There are also other more clever mitigations for this, none of which are secrets.
EDIT - Forgot reference: https://thinkingmachines.ai/blog/defeating-nondeterminism-in...