logoalt Hacker News

adrian_byesterday at 7:55 PM1 replyview on HN

Vector or matrix instructions do not improve single-thread speed in the correct meaning of this term, because they cannot improve the speed of a program that executes a sequence of dependent operations.

Their purpose is to provide parallel execution at a lower cost in die area and at a better energy efficiency than by multiplying the number of cores. For instance, having 16 cores with 8-wide vector execution units provides the same throughput as 128 cores, but at a much lower power consumption and at a much smaller die area. However, both structures need groups of 128 independent operations every clock cycle, to keep busy all execution units.

The terms "single-thread" performance vs. "multi-threaded" performance are not really correct.

What matters is the 2 performance values that characterize a CPU when executing a set of independent operations vs. executing a set of operations that are functionally-dependent, i.e. the result of each operation is an operand for the next operation.

When executing a chain of dependent operation, the performance is determined by the sum of the latencies of the operations and it is very difficult to improve the performance otherwise than by raising the clock frequency.

On the other hand, when the operations are independent, they can be executed concurrently and with enough execution units the performance may be limited only by the operation with the longest duration, no matter how many other operations are executed in parallel.

For parallel execution, there are many implementation methods that are used together, because for most of them there are limits for the maximum multiplication factor, caused by constraints like the lengths of the interconnection traces on the silicon die.

So some of the concurrently executed operations are executed in different stages of an execution pipeline, others are executed in different execution pipelines (superscalar execution), others are executed in different SIMD lanes of a vector execution pipeline, others are executed in different CPU cores of the same CPU complex, others are executed in different CPU cores that are located on separate dies in the same package, others are executed in CPU cores located in a different socket in the same motherboard, others in CPU cores located in other cases in the same rack, and so on.

Instead of the terms "single-thread performance" and "multi-threaded performance" it would have been better to talk about performance for dependent operations and performance for independent operations.

There is little if anything that can be done by a programmer to improve the performance for the execution of a chain of dependent instructions. This is determined by the design and the fabrication of the CPU.

On the other either the compiler or the programmer must ensure that the possibility of executing operations in parallel is exploited at the maximum extent possible, by using various means, e.g. creating multiple threads, which will be scheduled on different CPU cores, using the available SIMD instructions and interleaving any chains of dependent instructions, so that the adjacent instructions will be independent and they will be executed either in different pipeline stages or in different execution pipelines. Most modern CPUs use out-of-order execution, so the exact order of interleaved dependent instructions is not critical, because they will be reordered by the CPU, but some interleaving done by the compiler or by the programmer is still necessary, because the hardware uses a limited instruction window within which reordering is possible.


Replies

nurettintoday at 4:50 AM

Of course I know that bulk operations aren't the same as clock speed and they sit in a weird place. My point was that their existence is evidence of needing performance on a single core.

There is also an argument to be made for re-architecting code to require fewer or zero branching instructions and cramming everything into vector operations.