Isn't the standard to use continuous batching? If they are using continuous batching -- I'm curious why generated token length matters, and why they might be clustering them. If not -- I'm curious why they aren't and what is the tradeoff here.
This "~512 batching" makes me think of things like diffusion or prefill.
If they managed to put together some dirty hack that lets them generate about 512 tokens worth of reasoning in parallel instead of in sequence? That would explain it.
This "~512 batching" makes me think of things like diffusion or prefill.
If they managed to put together some dirty hack that lets them generate about 512 tokens worth of reasoning in parallel instead of in sequence? That would explain it.