Jensen–Shannon Divergence

129 points • by teleforce • last Friday at 7:27 PM • 21 comments • view on HN

Comments

Love me some JSD. Here is a problem most people don't consider with generative modeling (e.g., AI text, image, music, video models): basically all standard pre-training algorithms for generative models (i.e., cross entropy, basically all diffusion/flow formulations) are closer to a Forward KL divergence. In other words, given limited capacity the model will try to stretch itself to cover every mode. This gives you a jack of all trades (lots of knowledge and diversity), but a master of none (you get blurry images and text filled with nonsense).

The real magic in generative modeling comes from the post training process that comes after, which usually (e.g., RLHF) approximates Reverse KL (given limited capacity, try to perfectly cover what you can, but it's fine to drop the rest entirely). This gives amazing results, but is also the cause of AI oddities like the "AI Image Pixar Look", many of the verbal tics of LLMs, and all AI music using the same small set of voices. Jensen-Shannon Divergence sits right in the middle of Forward and Reverse KL and is what many GANs are claimed to approximate. Ideally, it is a better trade-off between diversity and fidelity.

sansseriff • today at 12:23 AM

It has applications outside of machine learning too! I used symmetric Kullback–Leibler divergence for a project with photon number resolving single photon detectors during my PhD. I used it with an adjacency matrix to split a gaussian mixture model (modelling some data with multivariate gaussians) into a series of clusters.

https://snsphd.online/chapter_04/section_05_results/#photon-...

➕ show 1 reply

jalospinoso • today at 8:05 AM

I've been working on a field guide in working with colleagues. I'm interested if this is helpful for folks wanting a more applied view:

https://lospino.so/statistics/jensen-shannon-divergence/

Feedback welcome both from initiates (on helpfulness) and experts (on correctness)!

imurray • today at 8:51 AM

For those wanting alternatives to KL-divergence, the KL and Jensen–Shannon divergences are both F-divergences: https://en.wikipedia.org/wiki/F-divergence

wilted-iris • yesterday at 8:59 PM

This looks interesting and I'm curious if anyone has more context for why it's on the frontpage today.

➕ show 2 replies

ernsheong • today at 1:22 AM

I thought Jensen Huang was getting a divorce :D

➕ show 1 reply

lasermatts • yesterday at 10:08 PM

The Hacker News hive mind is real!

I was just reading about JSD the other day after reading about KL divergence...seems like a nifty measurement device for things like sim-to-real evaluations in robots (the reason I was going down this rabbit hole.)

I think the appeal over raw KL is that JSD behaves a bit nicer when the simulated and real distributions don't perfectly overlap...which is basically always true in the real world!

navs • today at 5:11 AM

Currently piloting the use of JSD for a synthetic audience survey application, measuring how closely the synthetic response distribution matches a human panel.

Been knee-deep trying to understand this world, so seeing this on Hacker News today is kind of scary.

rappatic • today at 12:13 AM

There is so much I don't understand

➕ show 1 reply

mountainriver • yesterday at 11:19 PM

Why not use this instead of KL in reinforcement learning?

➕ show 1 reply

alt Hacker News

Jensen–Shannon Divergence

Comments