One thing I considered some months ago that was very similar to what you guys have done, but at a higher abstraction layer:
1. Consult many models (or a single model with higher temp) with the same prompt
2. Intelligently chunk the outputs (by entity, concept, subject, etc.)
3. Put each chunk into a semantic bucket (similar chunks live in the same bucket)
4. Select winning buckets for each chunk.
4a. Optionally push the undervoted chunks back into the model contexts for followup: is this a good idea, does it fit with what you recommended, etc.
4b. do the whole chunk/vote thing again
5. Fuse outputs. Mention outliers.
Token spend is heavy here, where we rely on LLMs to make decisions instead of the underlying math you guys went with. IMO, the solution y'all have reached is far more elegant than my idea.
I like the direction you're going with this strategy. There are many approaches, nuances, edge cases, and clever tricks to each of these steps, even without taking into account token probability distributions. Very powerful to get it right.