I believe the lack of quick evident profit increases are partly a failure of imagination or a failure of understanding that AI agents are different from people. More impressive or faster in some ways, but much much less reliable in others.
The evolution of harnesses like claude code or open cause, and metaharnesses like Ralph loops, gas town, claws, etc. Will progressively allow for gradually better results and abilities even if models stopped evolving, and if the Mythos eval numbers are to be believed, there is still no hard ceiling to be felt yet.
At the same time, small models that can run on PCs VRAM/UNIFIED RAM have like Qwen are becoming more useful.
I predict that having more and more loops within loops within loops and layers of cloud/local models of different capabilities will solve a great many limitations of LLMS today...at the cost of speed and token count.
We've never had a tool that is at the same time so unreliable and complicated as GenAI before. It will take us a minute to figure out how to use it properly.
Actually I think the opposite - we will learn that the most important thing is the ability to manage context & steer these models instead of using a rube goldberg machine. Some of the top performing agent harnesses on Terminal Bench provide literally one tool: tmux, which outperforms Claude Code et al. To me, the most important thing by far when getting reasonable output from these machines are what you put into it.