logoalt Hacker News

Writing string.h functions using string instructions in asm x86-64 (2025)

66 pointsby thaissteinlast Friday at 4:22 PM7 commentsview on HN

Comments

ack_completetoday at 2:25 AM

The REP MOVS series of instructions have an interesting history due to the advantages and disadvantages of microcode and its shifting performance relative to manual code with each CPU generation. It has long been great for aligned large copies due to the microcode having access to cache-wide copies, but until recently struggled with small copies. Apparently, one of the reasons is a lack of branch prediction in microcode:

https://stackoverflow.com/questions/33902068/what-setup-does...

Non-temporal stores are tricky performance wise. They can be dramatically faster than normal stores (~3x), they may be faster on some generations of CPUs than others, they may be slower if subsequent code needs the destination in the CPU cache, and even for GPUs they may not be ideal if an iGPU is sharing part of the cache hierarchy with the CPU. But the worst issue is that occasionally a specific CPU will have some random pathological behavior with them. IIRC, masked non-temporal stores were horrifically slow on some AMD APUs, on the order of hundreds to thousands of cycles per instruction. I find it hard to recommend them much anymore.

jamesfinlaysontoday at 12:21 AM

Not sure what Visual Studio has done over the years but I remember decompiling Gearbox's utilities .dll in James Bond 007 Nightfire (2002) and it appeared to have a bunch of string manipulation functions written using these instructions.

themafiayesterday at 8:53 PM

    vpcmpestri xmm2, xmm3, BYTEWISE_CMP 
    test cx, 0x10    ; if(rcx != 16)
I see this test/cmp all the time after the instruction and I don't understand it. pcmpestri will set ZF if edx < 16, and it will set SF if eax < 16. It is already giving you the necessary status. Also testing sub words of the larger register is very slow and is a pipeline hazard.

You've got this monster of an instruction and then people place all this paranoid slowness around it. Am I reading the x86 manual wrong?

show 1 reply
userbinatortoday at 2:31 AM

I do wish Intel would make the other string instructions faster, just like they did with MOVS, because the alternatives are so insanely bloated.

it is never used with a prefix (the value would be overwritten for each repetition)

...which is still useful for extreme size-optimisation; I remember seeing "rep lodsb" in a demo, as a slower-but-tiny (2 bytes) way of [1] adding cx to si, [2] zeroing cx, [3] putting the byte at [cx + si - 1] into al, and [4] conditionally leaving al and si unchanged if cx is 0, all effectively as a single instruction. Not something any optimising compiler I know of would be able to do, but perhaps within the possibility of an LLM these days.

show 1 reply