If you have particular performance requirements like that, then include them. Test for them. You still don’t have to actually look at the code. Either the software meets expectations or it doesn’t, and keep having AI work at it until you’re satisfied.
How deep do you want to go? Because reasonable person wouldn't have expected to hand hold AI(ntelligence) to that level. Of course after pointing it out, it has corrected itself. But that involved looking at the code and knowing the code is poor. If you don't look at the code how would you know to state this requirement? Somehow you have to assess the level of intelligence you are dealing with.