Why not use this instead of KL in reinforcement learning?
It's been used, along with every other divergence and distance you can think of.
In practice, which divergence you use doesn't seem to be very important. The KL is the one with the most theoretic foundation though, i.e. will work with infinite data. The important aspect seems to be that neural networks are Lipschitz bound, and that that is the most important constraint preventing collapse.
JSD is just symmetrized KL, it's the forward KL + reverse KL.
In reinforcement learning, usually what we want is to find the optimal action, i.e. action that maximizes the reward, this translates to the so-called "mode-seeking" optimization, which is the reverse KL.
To minimise the KL you just calculate the surprisal. The integral can be approximated by sampling over your training data. It's a direct expression of the information loss between your real data and your fitted probability distribution.
Calculating the JSD could be more difficult, the expression uses a mixture between the 'true' and 'fitted' distribution. You can still simulate this, but half the time you'd be fitting the model to itself, and I just don't see why that would be useful.
I think the JSD is most useful when you need an actual metric, but as long as you have a fitted and target distribution the KL divergence is a natural fit since you can interpret the result as information loss.