Did you try an MLX version of this model? In theory it should run a bit faster. I'm hesitant to download multiple versions though.
Is there a reliable way to run MLX models? On my M1 Max, LM Studio seems to output garbage through the API server sometimes even when the LM Studio chat with the same model is perfectly fine. llama.cpp variants generally always just work.
Haven't tried. I'm too used to llama.cpp at this point to switch to something else. I like being able to just run a model and automatically get:
- OpenAI completions endpoint
- Anthropic messages endpoint
- OpenAI responses endpoint
- A slick looking web UI
Without having to install anything else.