I would respectfully disagree on this. How i write tests right now I ask claude/codex to create an eval and it just spins up a bg LLM agent worker which verifies the tests in the sandbox/internally.
So i would say that atm in house testing is easier than external testing for us