Using the managed runtime analogy, what you are saying is that, if I wanted to benchmark LLMs like I...

manuelabeledo • yesterday at 11:31 AM • 0 replies • view on HN

Using the managed runtime analogy, what you are saying is that, if I wanted to benchmark LLMs like I would do with runtimes, I would need to take the delta between versions, plus that between whatever memory they may have. I don’t see how that helps with reproducibility.

Perhaps more importantly, how would I quantify such “memory”? In other words, how could I verify that two memory inputs are the same, and how could I formalize the entirety of such inputs with the same outputs?

alt Hacker News