I asked GPT for a rough estimate to benchmark prompt prefill on an 8,192 token input. • 16× H100: 8,192 / (20k to 80k tokens/sec) ≈ 0.10 to 0.41s • 2× Mac Studio (M3 Max): 8,192 / (150 to 700 tokens/sec) ≈ 12 to 55s
These are order-of-magnitude numbers, but the takeaway is that multi H100 boxes are plausibly ~100× faster than workstation Macs for this class of model, especially for long-context prefill.
You do realize that's entirely made up, right?
Could be true, could be fake - the only thing we can be sure of is that it's made up with no basis in reality.
This is not how you use llms effectively, that's how you give everyone that's using them a bad name from association