logoalt Hacker News

ashirviskaslast Friday at 11:01 PM0 repliesview on HN

Yep and I find that this really worsens LLM performance. For example `Ben,Alice` would be tokenized as `Ben|,A|lice`. And having to connect `lice` to the name `Alice` does not make it any easier for LLMs. However, formatting it as `Ben, Alice` tokenizes it as `Ben|,| Alice`. I found it kind of useful to improve performance by just formatting the data a bit differently.

I actually just started working on a data formatter that applies principles like these to drastically reduce the amount of tokens without decreasing the performance, like other formats do (looking at you, tson).