I think this is on point, I've really started to think about LLMs in terms of attention budget more than tokens. There's only so many things they can do at once, which ones are most important to you?
Outputting "filler" tokens is also basically doesn't require much "thinking" for an LLM, so the "attention budget" can be used to compute something else during the forward passes of producing that token. So besides the additional constraints imposed, you're also removing one of the ways which it thinks. Explicit COT helps mitigates some of this, but if you want to squeeze out every drop of computational budget you can get, I'd think it beneficial to keep the filler as-is.
If you really wanted just have a separate model summarize the output to remove the filler.
Outputting "filler" tokens is also basically doesn't require much "thinking" for an LLM, so the "attention budget" can be used to compute something else during the forward passes of producing that token. So besides the additional constraints imposed, you're also removing one of the ways which it thinks. Explicit COT helps mitigates some of this, but if you want to squeeze out every drop of computational budget you can get, I'd think it beneficial to keep the filler as-is.
If you really wanted just have a separate model summarize the output to remove the filler.