A lot of the TDP is reserved for running the shader units at full-power. My RTX 3070 Ti only pulls ~110w of it's 320w running CUDA inference on Gemma 26b and E4B.
B70 idles at 30W, while RTX PRO 4500 idles at 9W (measured to be 5W at wall).
B70 runs at 1/3 token output rate of RTX PRO 4500 and consume 3X idle power when do nothing.
My 4070 super and 5070 super both max out their tdp when I use them with ollama, is your usage different?
My 5090 runs at full TDP(pretty much exactly 575W) when running inference through LM Studio.
It's not that it's reserving power, but rather that you hit some bottleneck on a 3070 Ti before running into thermal limits-- it's likely limited by either tensor core saturation or RAM throughput. Running the workload with Nvidia's profiling tools should make the bottleneck obvious.