The properties that the uniform approximation theorem proves are not unique to neural networks.
Any models using an infinite dimensional Hilbert space, such as SVMs with RBF or polynomial kernels, Gaussian process regression, gradient boosted decision trees, etc. have the same property (though proven via a different theorem of course).
So the universal approximation theorem tells us nothing about why should expect neural networks to perform better than those models.
Whenever people bring this up I like to remind them that linear interpolation is a universal function approximator.
Universal approximation is like saying that a problem is computable
sure, that gives some relief - but it says nothing in practice unlike f.e. which side of P/NP divide the problem is on
Extremely well said. Universal approximation is necessary but not sufficient for the performance we are seeing. The secret sauce is implicit regularization, which comes about analogously to enforcing compression.