Fun fact: when treated with unicode Normalization Form Canonical Decomposition, 8 out of 9 polish le...

notathrowaway51 • today at 3:01 PM • 1 reply • view on HN

Fun fact: when treated with unicode Normalization Form Canonical Decomposition, 8 out of 9 polish letters (ż,ó,ć,ę,ś,ą,ź,ń) break down into base letter + combining diacritical mark, but ł stays intact. That means you can't use sqlite's unicode61 remove_diacritics tokenizer to normalize polish text for FTS.

Replies

ks2048 • today at 4:10 PM

When a Polish speaker searches for something with “ł”, do they expect to also see “l”?

➕ show 1 reply

alt Hacker News

Replies