To minimise the KL you just calculate the surprisal. The integral can be approximated by sampling ov...

contravariant • today at 3:00 PM • 0 replies • view on HN

To minimise the KL you just calculate the surprisal. The integral can be approximated by sampling over your training data. It's a direct expression of the information loss between your real data and your fitted probability distribution.

Calculating the JSD could be more difficult, the expression uses a mixture between the 'true' and 'fitted' distribution. You can still simulate this, but half the time you'd be fitting the model to itself, and I just don't see why that would be useful.

I think the JSD is most useful when you need an actual metric, but as long as you have a fitted and target distribution the KL divergence is a natural fit since you can interpret the result as information loss.

alt Hacker News