I don't think llama.cpp supports any of the LongCat models, actually.
They haven't posted weights/inference solutions for LongCat-2.0 [1], but LongCat-Next had transformers support, which I assume means it works with vLLM/SGLang.
Given it's 1.6T, "common hardware" is probably out of the question; even 2bpw is going to measure out at 400GB, even before considering the bandwidth requirements for 48B active. I haven't read the LongCat-2.0 architecture docs, but if you're not running GLM-5.2, you're probably not running this either.
[1] https://huggingface.co/meituan-longcat/LongCat-2.0: "Model weights coming soon — stay tuned!"
Ah yes but because it’s a MoE 48GB active model, then it’s possible that we might be able to run it locally in specialised setups such as 256GB unified memory.
Many MoE models (seem?) to only require enough memory to load the active expert.
Yeah, for me it seems like a if you have to ask you can't run it" type question.
In general the TL;DR is that anything above 35B needs hardware you buy basically only to run large LLMs, and if you have that hardware you don't need to ask the question.