logoalt Hacker News

Asookayesterday at 9:59 PM0 repliesview on HN

My one small nitpick is that vector length is usually 2 instructions with SSE4:

    dpps xmm0, xmm0, 0x17 ; dot product of 3 lanes, write lane 0
    sqrtss xmm0, xmm0
    ret
And is considerably faster than the fancy version, mainly because Intel still hasn't given us horizontal-max vector instruction! ARM is a bit better in that regard with their fancy vmaxvq_f32 and vmaxnmvq_f32...