My one small nitpick is that vector length is usually 2 instructions with SSE4: dpp...

Asooka • yesterday at 9:59 PM • 0 replies • view on HN

My one small nitpick is that vector length is usually 2 instructions with SSE4:

    dpps xmm0, xmm0, 0x17 ; dot product of 3 lanes, write lane 0
    sqrtss xmm0, xmm0
    ret

And is considerably faster than the fancy version, mainly because Intel still hasn't given us horizontal-max vector instruction! ARM is a bit better in that regard with their fancy vmaxvq_f32 and vmaxnmvq_f32...

alt Hacker News