Yeah, you're definitely right about the shifting goalposts ("it's a stochastic parrot" -> "it hallucinates all the time, it can't even get APIs right" -> "it can generate functions but can't reason about the codebase" -> "the bottleneck was never shipping code")
At the same time, humans can move up the abstraction ladder faster than the LLMs can. At least, some humans. Agents can produce lots of code. They can also do the entirely wrong thing. The impact of wrong decisions have been massively write-amplified with more and more intelligent LLMs. With earlier ones, it got a sentence or a function wrong, you reprompted, the cost of a mistake was 10 seconds. Now, you can burn hours or even days of work on the entirely wrong thing without a competent human operator stepping in and course-correcting.
The trajectory of agents have been bigger and bigger context windows, bigger autonomy, but at the same time, a bigger blast radius. In this context, I don't think the human experts will be out of their jobs any time soon.
> At the same time, humans can move up the abstraction ladder faster than the LLMs can
This was kind of the point, its only true for now. I agree with you that this kind of stuff will take longer. I don't think there's probably good training data for it right now. Handling abstractions and course correcting is probably the job now, and it also happens to be exactly the data that we will be typing in our prompts. They'll train on it or something like it.