https://arxiv.org/pdf/2310.11453 The original paper [fig 1, bottom-right] seems to say it needs about 4-5x the parameters of a fp16 model. You can build it and run some models, but the selection is limited because it has to be trained from scratch. I imagine inference speed is faster compared with modern PTQ (4- and 8-bit quants) though.
> bitnet.cpp is the official inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels, that support fast and lossless inference of 1.58-bit models on CPU and GPU (NPU support will coming next).
One bit or one trit? I am confused!
but there is no trained 100b param model? "can run a 100B BitNet" is about the inference implementation, not about the existence of any such model
It's good to see this getting some continued development. I looked into it last year[1] and I thought it showed a lot of promise so I've been very disappointed that I never saw a newer model.
I wonder when we begin to see the dividends of all the NPU PCs come into play. AMD have been doing some good work with their NPU/iGPU hybrid inference kernels. If these larger models could be scaled down to run on NPUs, you'd see much better power advantages, compared to running them on the CPU.
The energy numbers are the real story here, 70-82% reduction on CPU inference. If 1-bit models ever get good enough, running them on commodity hardware with no GPU budget changes who can deploy LLMs. That's more interesting than the speed benchmarks imo.
I'm curious if 1-bit params can be compared to 4- or 8-bit params. I imagine that 100B is equivalent to something like a 30B model? I guess only evals can say. Still, being able to run a 30B model at good speed on a CPU would be amazing.
One of the things I often wonder is "what will be the minimally viable LLM" that can work from just enough information that if it googles the rest it can provide reasonable answers? I'm surprised something like Encyclopedia Britanica hasn't yet (afaik) tried to capitalize on AI by selling their data to LLMs and validating outputs for LLM companies, it would make a night and day difference in some areas I would think. Wikipedia is nice, but there's so much room for human error and bias there.
Headline: 100B. Falcon 3 family: 10B. An order of magnitude off
That's amazing. I'm developing sub-tools for LLM as a hobby on an RTX3050 (4GB), but I can only run lightweight models like 1B and 2B. Is it possible to use your tool to make the CPU take over some of the VRAM movement?
If they had a big result like, native 1.58-bit quality clearly matches top peers, they would be saying that prominently in the repo.
The engineering/optimization work is nice, but this is not what people have been waiting for, as much as, can’t the Bitnet idea that seemed promise really deliver in a competitive way.
The output from this model is horrible! It's GPT-2 level babble and repeats entire paragraphs verbatim. It also reuses the same fake citation `(Jenkins, 2010)` over and over again. From the start of their video (which scrolls by fast enough that you don't see the slop clearly...)
``` Ecosystem Services and their impact on the Ecosystem
Ecosystem services refer to the services provided by ecosystems to the human society. These services include water, air, energy, nutrients, and soil (Jenkins, 2010). For instance, water is the most important service provided by an ecosystem and it helps in the conservation of water, irrigation and sanitation (Jenkins, 2010). On the other hand, air provides the oxygen needed for life.
The water cycle is a significant ecosystem service because it involves the cycling of water among the different parts of an ecosystem. It also involves the movement of water through the atmosphere, from one place to another. It is also the process of evaporation and condensation of water from the atmosphere. It also involves the movement of water from the air to the soil and water into the oceans.
The water cycle is a significant ecosystem service because it involves the cycling of water among the different parts of an ecosystem. It also involves the movement of water through the atmosphere, from one place to another. It is also the process of evaporation and condensation of water from the atmosphere. It also involves the movement of water from the air to the soil and water into the oceans. ```
headline hundred billion parameter, none of the official models are over 10 billion parameters. Curious.
steve jobs would have loved the microsoft repo with demo on mac
They have a demo video in the readme. I think they are trying to convey that BitNet is fast, which it is. But it is worth taking a moment to pause and actually see what the thing is doing so quickly.
It seems to keep repeating that the water cycle is the main source of energy for all living things on the planet and then citing Jenkins 2010. There are also a ton of sentence beginning with “It also…”
I don’t even think it’s correct. The sun is the main source of energy for most living things but there’s also life near hydrothermal vents etc.
I don’t know who Jenkins is, but this model appears to be very fond of them and the particular fact about water.
I suppose fast and inaccurate is better than slow and inaccurate.
> A demo of bitnet.cpp running a BitNet b1.58 3B model on Apple M2
With how much RAM? How much storage does it requires?
What’s the lower limit on the number of bits per parameter? If you use CSR-style sparse matrices to store the weights can it be less than 1?
Misleading title but this is pretty exciting. Interesting how this is based on llama cpp. Its nice to see some momentum since they released the paper in 2023
Anyone know how hard it would be to create a 1-bit variant of one of the recent Qwen 3.5 models?
Why would they film a demo video of it spewing out barely-coherent rambling repetitive drivel? If your model sucks at writing essays, maybe just tell us that, and film a demo of it doing something it IS good at?
I think the README [1] for the new CPU feature is of more interest, showing linear speedups with number of threads. Up to 73 tokens/sec with 8 threads (64 toks/s for their recommended Q6 quant):
https://github-production-user-asset-6210df.s3.amazonaws.com...
demo shows a huge love for water, this AI knows its home
It might interest you to know that one or two months ago, I had Claude port BitNet to WebGPU from the reference implementation, so that it runs right in your browser as a local model. After some debugging, the port seemed to work, but the model didn't function as well as the reference implementation so I'll have to work on it for a while. You can see a debugging session livestreamed here[1]. The released model file was about a gigabyte, it fits in most people's GPU's. We were also able to successfully fine-tune it right in the browser.
There's a lot that you can do when the model size is that small, yet still powerful.
Our next step is that we want to put up a content distribution network for it where people can also share their diffs for their own fine-tuned model. I'll post the project if we finish all the parts.
[1] https://www.youtube.com/live/x791YvPIhFo?is=NfuDFTm9HjvA3nzN
No 100b model.
My disappointment is immeasurable and my day is ruined.
[dead]
[flagged]
The title is misleading — there's no trained 100B model, just an inference framework that claims to handle one. But the engineering is worth paying attention to. I run quantized 70B models locally (M2 Max 96GB, llama.cpp + LiteLLM), and memory bandwidth is always the bottleneck. The 1.58-bit approach is interesting because ternary weights turn matmuls into additions — a fundamentally different compute profile on commodity CPUs. If 5-7 tok/s on a single CPU for 100B-class models is reproducible, that's a real milestone for on-device inference. Framework is ready. Now we need someone to actually train the model.