I would think it’s due to the non determinism. Leaking context would be an unacceptable flaw since many users rely on the same instance.
A/B test is plausible but unlikely since that is typically for testing user behavior. For testing model output you can do that with offline evaluations.
Can you explain the "same instance" and user isolation? Can context be leaked since it is (secretly?) shared? Explain pls, genuinely curious