I think it might be a good idea to make some kind of local-first harness that is designed to fully saturate some local hardware churning experiments on Gemma 4 (or another local model) 24/7 and only occasionally calls Claude Opus for big architectural decisions and hard-to-fix bugs.
Something like:
* Human + Claude Opus sets up project direction and identifies research experiments that can be performed by a local model
* Gemma 4 on local hardware autonomously performs smaller research experiments / POCs, including autonomous testing and validation steps that burn a lot of tokens but can convincingly prove that the POC works. This is automatically scheduled to fully utilize the local hardware. There might even be a prioritization system to make these POC experiments only run when there's no more urgent request on the local hardware. The local model has an option to call Opus if it's truly stuck on a task.
* Once an approach is proven through the experimentation, human works with Opus to implement into main project from scratch
If you can get a complex harness to work on models of this weight-class paired with the right local hardware (maybe your old gaming GPU plus 32gb of RAM), you can churn through millions of output tokens a day (and probably like ~100 million input tokens though the vast majority are cached). The main cost advantage compared to cloud models is actually that you have total control over prompt caching locally which makes it basically free, whereas most API providers for small LLM models ask for full price for input tokens even if the prompt is exactly repeated across every request.
Nico Bailon (author of many extensions for Mario's Pi coding agent) has something that might be a good starting point for this:
https://x.com/i/status/2043054831947198640
https://github.com/nicobailon/pi-model-switch