For real production I find the switching cost is not as trivial as you portray. Even going to a new model version in the same model family, say GPT-4o to GPT-5.2, a transition I just finished on a not too complicated application, requires extensive retesting and tweaking of prompts, guardrails and parameters.
Maybe OP meant switching in a coding harness way? Not an application using AI? I had similar issues like you in the latter case, but in the former it's trivial.
if you’re building on LLMs you gotta have an eval and prompt iteration pipeline, and you ought to be evaling every model release — your competitors will do this, and your users will want the latest and greatest (for frontier tasks) and the cheapest/fastest. So you should already be paying this cost anyways. i guess it depends on your team size and scale but not building this muscle seems like not having continuous delivery for regular code or even like not having tests and ci to merge to main.
SOTA models are typically used for interactive coding and other human in the loop work
> say GPT-4o to GPT-5.2, a transition I just finished on a not too complicated application
Neither of which is close to SOTA, because tasks like these are typically built on a cost conscious manner which tries to keep token costs in check.
I’m primarily responding to all of the commenters who are acting like nobody is going to use American SOTA models for anything because the government interfered with them for a couple weeks. It’s obviously not true, and I expect these models to be oversubscribed instead of avoided like some are claiming.
Vendor diversity is a longstanding risk management principle. For it to work you need to invest in it as you build, not when the rug is pulled.
I second this; even switching between minor versions of a model, you need to adjust prompts: the new model is better by implying a bunch of things that, when included in the prompt, will overdo that thing.
Assessing quality of output is often not trivial, either. Typically, problems that are solved by offloading something to an LLM are super subjective, and customers “feel” something is different is vulnerable.
We try to quantify output differences by many different similarity metrics. But a lot of energy goes into subjectively evaluating if something still works.