I don't even look at benchmarks anymore. I just try different models as they're released on our large, proprietary, systems software codebases in real, shipping products or projects that will ship eventually. It's pretty clear which models help me do my job better or faster. I'm fortunate enough to have the token budget to use basically as much as I need, for now.
No need for benchmarks, evals, marketing, system cards or anything like that. I read the web for tips, practices and release announcements. My colleagues and I share our experiences with each other but beyond that, everything else is just noise.
This is the way. Not that big of budget here. But if there’s something promising, I just try that for a month or so. But even then… at this moment I’m using z.ai models and those do the job. No need for anything else. So I’m staying until there is something new, same affordability, but a lot better. (Using a coding plan)
I am not against AI but I do wonder how you guys handle the fact this leaks all your code and is stored forever on servers belonging to God knows who?
I “trust” OpenAI and Anthropic (somewhat) but to be honest I still feel only safe using it on code without any secret sauce whatsoever. Luckily that’s a lot of code, but still. I wonder how others are looking at this?
(FYI I feel the same about Github and we also don’t store our code there)