The softmax, after the network has been trained, yields an estimate of the probability in the training data, but it is not that probability itself.
Which models are not trained with the log softmax as the loss function?
Which models are not trained with the log softmax as the loss function?