Maybe not at super large font sizes. But even lowercase i and l are easy enough to confuse at a glance mid-word in most sans-serif fonts, not to mention uppercase I and lowercase l. You don’t even need “confusable” glyphs to create a domain name that will stand up to a casual visual confirmation from a busy user in a phishing context.
This other article from the same author is more interesting: https://paultendo.github.io/posts/unicode-confusables-nfkc-c...
But what about 'Ы'? It looks like 'bl', doen't it? 'Ы' is one codepoint and one glyph, though 'bl' is a sequence of two letters. I believe that the method described will miss such things. Cyrillic also has 'Ю', I suppose it is possible to design a font that make it look like 'lO'? Are there any fonts like this in a wild?
Thanks for the effort!
I'm always intrigued by the German FE-Schrift ("fälschungserschwerende Schrift", "more-difficult-to-forge font") chooses shapes for characters that makes it hard for them to be turned into one another (like a 3 into an 8 or so):
> 82 pairs are pixel-identical
> a string like “аpple.com” with Cyrillic а (U+0430) is pixel-identical to “apple.com” in 40+ fonts. The user, the browser’s address bar, and any visual review process all see the same pixels. This is not theoretical. It is a measured property of the font files shipping on every Mac.
Current implementations of "Computer Use" Agentic AI tools mostly use visuals -- screenshotting of a computer screen and interpreting it.
These pixel-dentical character pairs will be a straight failure mode for those automations and could possibly be a threat vector if crafted well.
I'm not an expert, I've just been "vibe-R&D"-ing computer vision for a bit now, but I'll guarantee you SSIM is not suitable for this purpose. I've been dabbling in basically this area (comparing small, potentially low-resolution images) and SSIM produces a lot of false negatives and some false positives.
I would recommend template matching using normalized cross-correlation (TM_CCOEFF_NORMED in opencv.)
Also this paper from Nvidia critically scrutinizing SSIM may be relevant: https://research.nvidia.com/publication/2020-07_Understandin...
An interesting attempt, Claude. However, your promot is missing an important step to measure effectiveness against humans: wait 40-60 years for your vision to degrade naturally, and check the confusables again, preferably on a small phone screen. Bonus points if you can find someone with visual disabilities from birth. Obviously most attacks aren't pixel-perfect, but that's not the point, all you need to confuse are human eyes.
Things like the Fraktur characters are obvious mismatches in any font I know, I do do wonder why they're on the list.
That's super interesting, but at the same time, i think the primary concern is not if they are literally the same but if a user is likely to confuse them in a small font you dont have control over in a place they are not likely to pay attention to (e.g. addeess bar).
Like even if the two characters look quite different, if they both look like the same letter in different fonts that is a problem. It doesn't mattter if you can tell the difference between the glyphs in a side by side comparision. What matters is what letter the user interprets the glyph as.
0 and O, and l and I that look the same in a single font is a crime of modern typography.
Also, I remember 8x16 VGA font that came with KeyRus had some slight differences between Cyrillic and Latin lookalikes, that brought some strange sense of comfort when reading, and especially typing the letter c, because its Cyrillic lookalike is located on the same key.
I think we'll have to start configuring our client tools (e.g. browser, email client, etc) to render domain names with annotations for different character classes. E.g. our native character set is a standard color (blue/black) and then other character sets would have to stand out (purple background?).
Hmm, is SSIM a good metric for comparing fonts? I'd imagine it isn't ideal, as fonts are mostly textureless and SSIM has no concept of glyph identity or typographic intent.
> A domain using only Cyrillic characters that happen to spell a Latin word (like “аpple” in all-Cyrillic) may still render in the address bar’s font and look identical.
that is very interesting.
I imagine the browser could take some context clues and switch rendering to puny code if the locale of the user is nowhere near a cyrillic region. But that is only going to patch some edge cases and miss others.
Ideally, the solution is password managers everywhere, which don't have this vulnerability, instead of using human eyes to visually recognize web urls and thus is vulnerable.
Good read (as is the next article in the series), but you can tell it hasn't been proofread due to "paypa׀.com" being described as a danger. Maybe in a different font than the website's, but in that case, maybe this should have been rendered out.
Was it a demo site? The font looks very wonky, not sure if I should copy-paste from it.
This is really cool. I loved the technical breakdown and side by side comparisons. Surprised to hear that Microsoft and MacOS default fonts didn't score so well!
This seems misguided. The fact that 'ρ' isn't a pixel for pixel match for 'p' doesn't mean they're not confusable. The threat model is not being unable to solve a spot-the-difference puzzle. Unless you are familiar with every pixel of your system fonts, and carefully scrutinize every character on your screen, the lack of an exact match in jρmorgan[.]com in a URL is going to do very little for you. There are many english characters that have multiple totally distinct ways to write them, so you can have two 'a' variants that are distinct but equally 'normal' looking. I guess if you get an LLM to write your blog posts they don't have to make much sense to begin with.
Ooph, I couldn't get far in this the font is giving me motion sickness some how.
Was that the intention?
well, you didn't really do anything, did you? Claude Code rendered these things and wrote the blog post haha
> "This is not theoretical. It is a measured property of the font files shipping on every Mac."
some patterns of speech are so recognizably LLM, i am convinced that the AI detection startups have a very strong chance to succeed on text.
Why are all the descending letters truncated in the titles? Not sure if it's a css glitch or terrible font choice. A bit ironic on an article about fonts.
This is very cool, impressive piece of work Paul.
[dead]
About 20 years ago I used Cyrillic confusables to watermark internal documentation that was being leaked by a disgruntled customer service employee. The document would dynamically render and include the employee ID based encoded as bits in the text. It survived copy/paste to plain text well.
I did run into some issues in early versions on when characters in Linux commands or visible web addresses were replaced. Fortunately the source docs were HTML, and it was easy to exclude code or pre nodes when rendering.
I thought this was so clever, but the leaker was never caught using it, to the best of my knowledge.