Most of the common ways to run local LLMs include a chat interface. llama.cpp's `llama-server` ...

SwellJoe • yesterday at 4:59 PM • 0 replies • view on HN

Most of the common ways to run local LLMs include a chat interface. llama.cpp's `llama-server` stands up a chat interface on 8080, as well as an OpenAI compatible API. LM Studio is a desktop app with a chat interface and API, as well. unsloth Studio, too.

LM Studio is nice in that it makes it easy to add tools, like search. Qwen 3.6 is such a small model that it lacks a lot of knowledge of the world (so it can hallucinate at an uncomfortable rate, which is a common failure mode of very small models), but it can use tools, so being able to search lets it research before answering. It has pretty good reasoning and tool calling, so it's actually pretty effective. I've been comparing Gemma 4 (31B at 8-bits, also very good with tools and reasoning for its size), Qwen 3.6 (27B at 8-bits), against Claude Opus and Gemini Pro lately. And, obviously the frontiers are better, but most of the time, I find the tiny models are fine. I'm still not quite at the point where I'd be willing to code with local models, as the time wasted on hallucinations and logic bugs and sloppy coding practices are much higher, as is the cost of security bugs that make it past review.

alt Hacker News