That’s an interesting way to think about it. While tests don’t satisfy mathematicians‘ standards for rigor one could instead look at interactive proofs from complexity theory. These are of interest if a problem doesn’t allow for short proofs, i.e. when the problem is not in NP [1]. In your scenario an adapted AI-assisted theorem prover would be the prover, and a mathematician the verifier.
Thank you for explaining my point more logically and coherently. I'll read it over.