How does Pi+Qwen (local) compare to Anthropic's offerings? Surely you're not getting the same breadth and quality of output using Qwen? How is the performance?
It’s a toy compared to Opus or Sonnet. Obviously the 5 trillion parameter models running on $$$$ hardware is going to outperform a local model.
So far I've only really set things up and done some benchmarking (a set of capability prompts created and evaluated by Claude, HumanEval and MBPP; haven't completed the latter 2) on several local models (Qwen 1.7b, 4b, 9b & 35b a3b; 1.7b got 6/8 correct at ~14.7 tok/s on the capability set, to 35b for 8/8 at ~4.5 tok/s; can share full results if interested), and setup llama-swap so I can dynamically select them. I'll need to decide which of my projects I'll be really testing them on, with the awareness that I'll have to be even more involved.