I noticed the same thing, having also written an interpreter for the Wolfram language that focused on the core rule/rewriting/pattern language. At its heart it’s more or less a Lisp-like language where the core can be quite small and a lot of the functionality built via pattern matching and rewriting atop that. Aside from the sheer scale of WL, I ended up setting aside my experiments replicating it when I did performance comparisons and realized how challenging it would be to not just match WL in functionality but performance.
Woxi reminds me of some experiments I did to see how far vibe coding could get me on similar math and symbolic reasoning tools. It seems like unless you explicitly and very actively force a design with a small core, the models tend towards building out a lot of complex, hard-coded logic that ultimately is hard to tune, maintain, or reason about in terms of correctness.
Interesting exercise with woxi in terms of what vibe coding can produce. Not sure about the WL implementation though.
(For context, I write compiler/interpreter tools for a living - have been for a couple decades)