My 2c: I think the "cloud vs local" debate is (maybe) a false dichotomy. In my experience, I use a hybrid approach and I've seen a huge productivity boost from it.
The cloud-based models are fine for big and complex tasks, but the pricing is ridiculous for small stuff—like summarizing a discussion or fixing a small bug. And cloud and privacy have never been a good match.
As an example, this comment itself was written with the help of Qwen3.5-4B running locally with an extension on top of llama.cpp default web UI [1]. The extension injects my browser's context directly into the conversation, which allows me to summarize things and draft up comments quickly. Speed is pretty acceptable for the size: ~5s TTFT and ~100 t/s generation, all running on a Macbook M5.
And when I want to run bigger tasks, I don't just stick to one provider. Apart from well-known closed-weight providers like OpenAI or Anthropic, I also experiment with open-weight models like GLM-5.1, DeepSeek V4, and Qwen3.6-27B, which provide quite good results for the price.
I'd argue both have value, and I don't see why anyone needs to choose one exclusively. Anyone else doing this?
Why not just use DS V4 Flash for the small stuff? Very fast and extremely cheap.