https://en.wikipedia.org/wiki/Universal_approximation_theore...
the better question is why does gradient descent work for them
Interestingly, there exist problems which provably can't be learned via gradient descent for them.
I don't follow. Why wouldn't it work? It seems to me that a biased random walk down a gradient is about as universal as it gets. A bit like asking why walking uphill eventually results in you arriving at the top.
The properties that the uniform approximation theorem proves are not unique to neural networks.
Any models using an infinite dimensional Hilbert space, such as SVMs with RBF or polynomial kernels, Gaussian process regression, gradient boosted decision trees, etc. have the same property (though proven via a different theorem of course).
So the universal approximation theorem tells us nothing about why should expect neural networks to perform better than those models.