Time to first token is a very important performance metric, as I figured out using a Mac Studio M3 U...

speedgoose • yesterday at 9:14 PM • 1 reply • view on HN

Time to first token is a very important performance metric, as I figured out using a Mac Studio M3 Ultra (that is quite slow on this aspect).

But 32GB for a TDP of 230W is perhaps not super interesting. Especially because you probably want to have more than one card. It's a lot of heat. You could use the cards for heating up a building, but heatpumps exist.

Replies

bigyabai • yesterday at 9:28 PM

A lot of the TDP is reserved for running the shader units at full-power. My RTX 3070 Ti only pulls ~110w of it's 320w running CUDA inference on Gemma 26b and E4B.

➕ show 4 replies

alt Hacker News

Replies