> But more importantly, LLMs aren’t stateless across executions, which is a huge no-no.
They are, actually. A "fresh chat" with an LLM is non-deterministic but also stateless. Of course agentic workflows add memory, possibly RAG etc. but that memory is stored somewhere in plain English; you can just go and look at it. It may not be stateless but the state is fully known.
Using the managed runtime analogy, what you are saying is that, if I wanted to benchmark LLMs like I would do with runtimes, I would need to take the delta between versions, plus that between whatever memory they may have. I don’t see how that helps with reproducibility.
Perhaps more importantly, how would I quantify such “memory”? In other words, how could I verify that two memory inputs are the same, and how could I formalize the entirety of such inputs with the same outputs?