I do wish Intel would make the other string instructions faster, just like they did with MOVS, because the alternatives are so insanely bloated.
it is never used with a prefix (the value would be overwritten for each repetition)
...which is still useful for extreme size-optimisation; I remember seeing "rep lodsb" in a demo, as a slower-but-tiny (2 bytes) way of [1] adding cx to si, [2] zeroing cx, [3] putting the byte at [cx + si - 1] into al, and [4] conditionally leaving al and si unchanged if cx is 0, all effectively as a single instruction. Not something any optimising compiler I know of would be able to do, but perhaps within the possibility of an LLM these days.
Just a single instruction avx512 loop outperforms movsb by 10x on my computer.
I’m on ryzen 7600x. Just an example that it doesn’t need to trash instruction cache or have 10 loops behind conditionals