logoalt Hacker News

notathrowaway51today at 3:01 PM1 replyview on HN

Fun fact: when treated with unicode Normalization Form Canonical Decomposition, 8 out of 9 polish letters (ż,ó,ć,ę,ś,ą,ź,ń) break down into base letter + combining diacritical mark, but ł stays intact. That means you can't use sqlite's unicode61 remove_diacritics tokenizer to normalize polish text for FTS.


Replies

ks2048today at 4:10 PM

When a Polish speaker searches for something with “ł”, do they expect to also see “l”?

show 1 reply