So this is basically an “embedding of embeddings”, an approximation of multiple embeddings compressed into one, to reduce dimensionality/increase performance.
All this tells me is that: the “multiple embeddings” are probably mostly overlapping and the marginal value of each additional one is probably low, if you can represent them with a single embedding.
I don’t otherwise see how you can keep comparable performance without breaking information theory.
> marginal value of each additional one is probably low
This is the point of the paper. Specifically, that single embedding vectors are sparse enough that you can compact more data from additional vectors together to improve retrieval performance.