Unfortunately, VLQ/LEB128 is slow to process due to all the rolling decision points (one decisi...

kstenerud • today at 1:21 AM • 1 reply • view on HN

Unfortunately, VLQ/LEB128 is slow to process due to all the rolling decision points (one decision point per byte, with no ability to branch predict reliably). It's why I used a right-to-left unary code in my stuff: https://github.com/kstenerud/bonjson/blob/main/bonjson.md#le...

  | Header     | Total Bytes | Payload Bits |
  | ---------- | ----------- | ------------ |
  | `.......1` |      1      |       7      |
  | `......10` |      2      |      14      |
  | `.....100` |      3      |      21      |
  | `....1000` |      4      |      28      |
  | `...10000` |      5      |      35      |
  | `..100000` |      6      |      42      |
  | `.1000000` |      7      |      49      |
  | `10000000` |      8      |      56      |
  | `00000000` |      9      |      64      |

The full value is stored little endian, so you simply read the first byte (low byte) in the stream to get the full length, and it has the exact same compactness of VLQ/LEB128 (7 bits per byte).

Even better: modern chips have instructions that decode this field in one shot (callable via builtin):

https://github.com/kstenerud/ksbonjson/blob/main/library/src...

    static inline size_t decodeLengthFieldTotalByteCount(uint8_t header) {
        return (size_t)__builtin_ctz(header) + 1;
    }

After running this builtin, you simply re-read the memory location for the specified number of bytes, then cast to a little-endian integer, then shift right by the same number of bits to get the final payload - with a special case for `00000000`, although numbers that big are rare. In fact, if you limit yourself to max 56 bit numbers, the algorithm becomes entirely branchless (even if your chip doesn't have the builtin).

https://github.com/kstenerud/ksbonjson/blob/main/library/src...

It's one of the things I did to make BONJSON 35x faster to decode/encode compared to JSON.

https://github.com/kstenerud/bonjson

If you wanted to maintain ASCII compatibility, you could use a 0-based unary code going left-to-right, but you lose a number of the speed benefits of a little endian friendly encoding (as well as the self-synchronization of UTF-8 - which admittedly isn't so important in the modern world of everything being out-of-band enveloped and error-corrected). But it would still be a LOT faster than VLQ/LEB128.

Replies

sparkie • today at 2:39 AM

We can do better than one branch per byte - we can have it per 8-bytes at least.

We'd use `vpmovb2m`[1] on a ZMM register (64-bytes at a time), which fills a 64-bit mask register with the MSB of each byte in the vector.

Then process the mask register 1 byte at a time, using it as an index into a 256-entry jump table. Each entry would be specialized to process the next 8 bytes without branching, and finish with conditional branch to the next entry in the jump table or to the next 64-bytes. Any trailing ones in each byte would simply add them to a carry, which would be consumed up to the most significant zero in the next eightbytes.

[1]:https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

➕ show 1 reply

alt Hacker News

Replies