I worked on it for a more specialized task (query rewriting). It’s blazing fast.
A lot of inference code is set up for autoregressive decoding now. Diffusion is less mature. Not sure if Ollama or llama cpp support it.
Did you publish anything you could link wrt. query rewriting?
How was the quality?
Did you publish anything you could link wrt. query rewriting?