It's been used, along with every other divergence and distance you can think of.
In practice, which divergence you use doesn't seem to be very important. The KL is the one with the most theoretic foundation though, i.e. will work with infinite data. The important aspect seems to be that neural networks are Lipschitz bound, and that that is the most important constraint preventing collapse.