This really doesn't have anything to do with C#. This is your classic nvarchar vs varchar issu...

wvenable • yesterday at 11:24 PM • 5 replies • view on HN

This really doesn't have anything to do with C#. This is your classic nvarchar vs varchar issue (or unicode vs ASCII). The same thing happens if you mix collations.

I'm not sure why anyone would choose varchar for a column in 2026 unless if you have some sort of ancient backwards compatibility situation.

Replies

dspillett • today at 12:45 AM

> I'm not sure why anyone would choose varchar for a column in 2026

The same string takes roughly half the storage space, meaning more rows per page and therefore a smaller working set needed in memory for the same queries and less IO. Also, any indexes on those columns will also be similarly smaller. So if you are storing things that you know won't break out of the standard ASCII set⁰, stick with [VAR]CHARs¹, otherwise use N[VAR]CHARs.

Of course if you can guarantee that your stuff will be used on recent enough SQL Server versions that are configured to support UTF8 collations, then default to that instead unless you expect data in a character set where that might increase the data size over UTF16. You'll get the same size benefit for pure ASCII without losing wider character set support.

Furthermore, if you are using row or page compression it doesn't really matter: your wide-character strings will effectively be UTF8 encoded anyway. But be aware that there is a CPU hit for processing compressed rows and pages every access because they remain compressed in memory as well as on-disk.

--------

[0] Codes with fixed ranges, etc.

[1] Some would say that the other way around, and “use NVARCHAR if you think there might be any non-ASCIII characters”, but defaulting to NVARCHAR and moving to VARCHAR only if you are confident is the safer approach IMO.

beart • yesterday at 11:31 PM

I agree with your first point. I've seen this same issue crop up in several other ORMs.

As to your second point. VARCHAR uses N + 2 bytes where as NVARCHAR uses N*2 + 2 bytes for storage (at least on SQL Server). The vast majority of character fields in databases I've worked with do not need to store unicode values.

➕ show 3 replies

SigmundA • yesterday at 11:57 PM

Yes I have run into this regardless of client language and I consider it a defect in the optimizer.

➕ show 1 reply

applfanboysbgon • today at 12:08 AM

I think this is a rather pertinent showcase of the danger of outsourcing your thinking to LLMs. This article strongly indicates to me that it is LLM-written, and it's likely the LLM diagnosed the issue as being a C# issue. When you don't understand the systems you're building with, all you can do is take the plausible-sounding generated text about what went wrong for granted, and then I suppose regurgitate it on your LLM-generated portfolio website in an ostensible show of your profound architectural knowledge.

➕ show 3 replies

paulsutter • today at 1:05 AM

Utf8 solved this completely. It works with any length unicode and on average takes up almost as little storage as ascii.

Utf16 is brain dead and an embarrassment

➕ show 2 replies

alt Hacker News

Replies