Intel Arc B70 when released, can only produce 1/3 of the token of RTX PRO 4500. Well, it also cost 1/3 of RTX PRO 4500.
It lacked software support the for the primary target application, running LLM. The officially supported vllm fork is 6 version behind mainline. It did not run the latest hot new open models on huggingface. Parallel two of B70 reduce token rate, not improve it. So, the software behind B70 is basically so far behind.
There are nonlinearities to exploit in that calculus. Given enough VRAM to host a larger model that you're targeting, just the size can push you past the usability threshold at a much better price.
What you say is not consistent with TFA.
The parent article shows that B70 is faster than RTX 4000.
RTX 4500 is faster than RTX 4000, but it cannot be more than 3 times faster, not even more than 2 times faster.
The parent article is consistent with RTX 4500 being faster than B70 for ML inference, but by a much smaller ratio, e.g. less than 50% faster.
If you know otherwise, please point to the source.
If you have run a benchmark yourself, please describe the exact conditions.
In the benchmarks shown at Phoronix for llama.cpp, the relative performance was extremely variable for different LLMs, i.e. for some LLMs a B70 was faster than RTX 4000, but for others it was significantly slower.
Your 3x performance ratio may be true for a particular LLM with a certain quantization, but false for other LLMs or other quantizations.
This performance variability may be caused by immature software for B70. For instance instead of using matrix operations (XMX engines), non-optimized software might use traditional vector operations, which are slower.
It is also possible that for optimum performance with a certain LLM one may need to choose a different quantization for B70 than for NVIDIA, because for sub-16-bit number formats Intel supports only integer numbers.