logoalt Hacker News

cornholiotoday at 12:07 AM2 repliesview on HN

You can have a universal variable length field, for example 2 bytes for strings < 32768, then four bytes, 8 bytes etc. On the critical short string path, it costs just a single bit test. The glyph vs byte issues need to be dealt with in both formats.

The subdivision issue is a good perspective, but i would argue the performance impact of cloning substrings is dwarfed by the redundant full string reads to find length.


Replies

lelanthrantoday at 7:20 AM

> You can have a universal variable length field, for example 2 bytes for strings < 32768, then four bytes, 8 bytes etc.

To hold the length of a string, I'd do something similar to unicode:

7-bits for size + 1-bit for continuation, then 15 bits for size + 1 bit for continuation, then 23-bits for size + 1 bit for continuation, etc.

Or maybe even do it exactly the same as unicode:

    0XXX XXXX -> length of string is in those 7 bits
    1XXX XXXX  XXXX XXXX -> length of string is in those 7+8 bits
    11XX XXXX  XXXX XXXX  XXXX XXXX-> length of string is in those 6+8+8 bits
    ...

> On the critical short string path, it costs just a single bit test.

A few more clock cycles compared to NULL-termination, although my alternatives above require even more clock cycles.

If the hardware had instructions for sentinel values, things would be easier (Like how DOS calls used '$' termination for strings) and safer.

Load a sentinel byte into a register and have dedicated copy and compare instructions that take each two addresses (src and dst) and copies (or compares) src/dst until the terminator is reached (with copy copying the sentinel as well).

Considering that sentinel values are needed so often, and are so useful, it's surprising that this is not in any ISA. What we have now is kludgy workarounds in the HLL for this. It's hard to blame the HLL, because some workaround has to be implemented.

show 2 replies
Parodpertoday at 1:03 PM

You could do 0xffff as a special case, and put another length+string/pointer to after the 255th byte.