I don't know if we are talking past each other, but I don't think this conversation is about absolute probabilities? The question is about relative uncertainty, and the softmax values are just fine for that.
It is too computationally expensive, which is why nobody does this for production inference. But there are alignment tools to extract out these latent-space probabilities for researchers in the frontier labs.
> The question is about relative uncertainty, and the softmax values are just fine for that.
They really aren't, especially if you consider the chain of thought / recursive application case, and also that you can't even assume e.g. a difference of 0.1 in softmax values means the same relative difference from input to input, or that e.g. an 0.9 is always "extremely confident", and etc. You really have no idea unless you are testing the calibration explicitly on calibration data.
> But there are alignment tools to extract out these latent-space probabilities for researchers in the frontier labs
You can get embeddings: if you can get calibrated probabilities, you'll need to provide a citation, because this would be a huge deal for all sorts of applications.