I think you’ll have to do multi-shot generation to correct this, each diffusion is going to represent a single “thought”.
Though with the speed it’s running that’s not necessarily a deal breaker. I suspect diffusion models will need different harnesses to be effective.