logoalt Hacker News

homebreweryesterday at 8:09 AM2 repliesview on HN

Because it's far more reliable to use proper parsers instead of a bunch of regular expressions. Most languages cannot be properly parsed with regexes.

Those files are compiled tree-sitter grammars, read up on why it exists and where it is used instead of me poorly regurgitating official documentation:

https://tree-sitter.github.io/tree-sitter


Replies

f311ayesterday at 8:22 AM

Funny enough, they are less than 10MB when compressed. I guess they could use something like upx to compress these binaries.

The whole Linux release is 15mb, but it uncompresses to 16MB binary and 200MB grammars on disk.

Why do we need to have 40MB of Verilog grammars on disk when 99% of people don't use them?

show 2 replies
Panzerschrekyesterday at 12:31 PM

For real parsing a proper compiler codebase (via a language server implementation) should be used. Writing something manually can't work properly, especially with languages like C++ and Rust with complex includes/imports and macros. Newer LSP editions support syntax-based highlighting/colorizing, but if some LSP implementation doesn't support it, using regexp-based fallback is mostly fine.