logoalt Hacker News

embedding-shapetoday at 10:12 AM2 repliesview on HN

I'd love to see the prompt processing speed difference between 16× H100 and 2× Mac Studio.


Replies

zozbot234today at 10:19 AM

Prompt processing/prefill can even get some speedup from local NPU use most likely: when you're ultimately limited by thermal/power limit throttling, having more efficient compute available means more headroom.

Barathkannatoday at 10:20 AM

I asked GPT for a rough estimate to benchmark prompt prefill on an 8,192 token input. • 16× H100: 8,192 / (20k to 80k tokens/sec) ≈ 0.10 to 0.41s • 2× Mac Studio (M3 Max): 8,192 / (150 to 700 tokens/sec) ≈ 12 to 55s

These are order-of-magnitude numbers, but the takeaway is that multi H100 boxes are plausibly ~100× faster than workstation Macs for this class of model, especially for long-context prefill.

show 1 reply