Sounds like OP did some pretty cool engineering to make this run performantly. Definitely not your run of the mill AI slop.
why? at least the chart in the docs suggests otherwise.
eh.
performance is easy. you can craft a test suite that will allow a ralph loop to iterate until it hits the metrics.
the hard part of style/feel/usability. LLMs still suck at that stuff, and crafting tests to produce those metrics is nigh impossible.
It doesn't run at all. If you can get it running, let me know.