logoalt Hacker News

Beagle, a source code management system that stores AST trees

93 pointsby strogonoffyesterday at 1:28 PM46 commentsview on HN

Comments

nzoschkeyesterday at 6:08 PM

In https://replicated.wiki/blog/partII this part is very interesting to me:

> Want to keep LLM .md files in a separate overlay, only make them visible on request? Also easy. CRDT gives the freedom in splitting and joining along all the axes.

I now have a bunch of layers of text / markdown: system prompts, AGENTS.md, SKILL.md, plus user tweaks or full out replacements to these on every repo or subproject.

Then we want to do things like update the "root" system prompt and have that applied everywhere.

There are analogies in git, CMS templating systems, software package interfaces and versioning. Doing it all with plain text doesn't feel right to me.

Any other approaches to this problem? Or is Beagle and ASTs and CDRTs really onto something here?

show 1 reply
a-dubyesterday at 4:39 PM

mmm. interesting and fun concept, but it seems to me like the text is actually the right layer for storing and expressing changes since that is what gets read, changed and reasoned about. why does it make more sense to use asts here?

are these asts fully normalized or do (x) and ((x)) produce different trees, yet still express the same thing?

why change what is being stored and tracked when the language aware metadata for each change can be generated after the fact (or alongside the changes)? (adding transform layers between what appears and what gets stored/tracked seems like it could get confusing?)

show 5 replies
majkinetoryesterday at 6:26 PM

Somewhat similar project is unison:

https://www.unison-lang.org/docs/the-big-idea

ValentineCyesterday at 7:11 PM

Mildly pedantic, but AST already stands for Abstract Syntax Tree, so the post title when unabbreviated is Abstract Syntax Tree trees.

MadxX79yesterday at 3:50 PM

Can it store my PIN numbers and my map of ATM machines also?

show 1 reply
omoikaneyesterday at 6:20 PM

The linked page looks like a subsystem of some specific library, I am not sure if it is intended for general use.

If it were intended to be general replacement for general purpose version control systems, I am not sure how storing AST is better than storing the original plain text files since the transformation from text to AST might be lossy. I might want to store files with no AST (e.g. plain text files), files with multiple AST (e.g. polyglots), multiple files with the same AST (e.g. files to test different code layout), broken AST (e.g. data files to be used as test cases). These use cases would be trivially supported by storing the original file as is, whereas storing any processed form of the file would require extra work.

show 2 replies
xedracyesterday at 4:35 PM

This sounds good in theory, but it means Beagle needs to understand how to parse every language, and keep up with how they evolve. This sounds like a ton of work and a regression could be a disaster. It'll be interesting to see how this progresses though.

show 1 reply
ktpsnsyesterday at 3:45 PM

Glad to see this. We can do better then git.

show 2 replies
BlueHotDog2yesterday at 9:20 PM

what bothers me is, while CRDTS converge, the question is to what. in this case, it seems like there's a last-write-wins semantic. which is very problematic as an implicit assumption for code(or anything where this isn't the explicit invaraint)

westurneryesterday at 5:15 PM

It makes a lot of sense for math-focused LLMs to work with higher order symbols - or context-dependent chunking - than tokens. The same is probably true for software.

From "Large Language Models for Mathematicians (2023)" (2025) https://news.ycombinator.com/item?id=42899805 :

> It makes sense for LLMs to work with testable code for symbolic mathematics; CAS Computer Algebra System code instead of LaTeX which only roughly corresponds.

> Are LLMs training on the AST parses of the symbolic expressions, or token coocurrence? What about training on the relations between code and tests?

There are already token occurrence relations between test functions and the functions under test that they call. What additional information would it be useful to parse and extract and graph rewrite onto source code before training, looking up embeddings, and agent reasoning?