I was a little confused by this part:
"This is what's happening to the parameters of models when they're quantized down to sizes that are possible to run on your laptop. Instead of floats, small integers are what get stored and loaded into memory. When the time comes to use the quantized values, to generate an answer to a question for example, the values are dequantized on the fly. You might think this sounds slower, but we'll see later on that this actually ends up being faster as well as smaller."
I thought that most GPUs supported floating point math in these quantized formats, like they can natively do math on an float4 number (that's maybe packed, 2 float4s into a single byte, or more probably 16 float4s in an 8 byte array or maybe something even bigger)
Am I getting this wrong - is it instead the GPU pulls in the quantized numbers and then converts them back into 32-bit or 64-bit float to actually run through the ALUs on the GPU? (and the memory bandwidth savings make up for the extra work to convert them back into 32 bit numbers once you get them onto the GPU?)
Or is it some weird hybrid, like there is native support for float8 and Bfloat16, but if you want to use float2 you have to convert it to float4 or something the hardware can work with.
I am confused what actually happens in the vectorized ADD and MULT instructions in the GPU with these quantized numbers.
> I am confused what actually happens in the vectorized ADD and MULT instructions in the GPU with these quantized numbers.
I might be wrong, but I think LLM is all about comparing distance between tokens. You can tell that -255 and +255 are very separated, but you are also away that -8 and +8 are also very far away.
Microsoft Bitnet and Google TurboQuant shows that in extreme you can use just -1, 0, +1
Very old CPUs had support only down to FP16, which is useful in graphics applications.
Then support for Bfloat16 and for INT8 has been added, which are not useful for anything else but AI/ML applications. Then support for FP8 has been added. Even smaller formats are supported only on some very recent GPUs.
If you have a recent enough GPU, it might support something like float2 or float4, but if you have an older GPU you must convert the short format to the next bigger format that is supported, before performing some operations.
Hardware support will vary widely, as will speed on these smaller FP formats, sometimes intentionally nerfed in consumer cards.
Lots of devices with embedded "AI accelerators" will also only do things like INT8, and for some reason INT8 is generally worse than the same size FP8 (maybe that could be fixed with smarter quantization).
Your understanding is correct. The key detail is that the author used an M1 Max and H100 for their testing.
M1 Max: FP16 hardware support, FP8 and Bfloat16 emulated in software (via dequantization)
H100: FP16 and FP8 hardware support
> which I ran both on a MacBook Pro M1 Max and a rented H100 SXM GPU