logoalt Hacker News

accountofthahatoday at 10:36 AM1 replyview on HN

Does the 'zero CUDA dependency' also count for running it on my own device? I have an AMD card, older model. Would love to have a small version of this running for coding purposes.

Really nice to see the Chinese are competing this strongly with the rest of the world. Competition is always nice for the end-consumer.


Replies

adrian_btoday at 1:55 PM

The model is open weights, so you can download it from the link given at the top.

Then you can run it using some inference backend, e.g. llama.cpp, on any hardware supported by it.

However, this is a big model so even if you quantize it you need a lot of memory to be able to run it.

The alternative is to run it much more slowly, by storing the weights on an SSD. There have already been published some results about optimizing inference to work like this, and I expect that this will become more common in the future.

There are cases when running slowly a better model can still be preferable to running quickly a model that gives poor results, especially when you do not use it conversationally, but to do some work with agents.