> Antigravity was the only autonomous agent that implemented the Pantheon’s signature interior ceiling pattern: repeated square coffers visible through the oculus.
That is seriously really impressive. I looked at the 3D model and didn't even thing to LOOK INSIDE the building before reading this.
Here's [1] the 3D model with `show_cutaway` enabled.
[1] https://modelrift.com/models/pantheon-benchmark-antigravity-...
I've had such a bad time trying to do this myself. You might get a half-way decent draft on the first try and then you start to "debug" this and after a very frustrating session you realize that the model can't properly "see" the results. That is, you just can't iterate on it, at all.
I'm guessing that most harnesses/tools will resize an image before processing and in doing so will loose enough detail to make it much harder to reason about - especially wireframe images.
I'm sure I'm holding it wrong, but this test didn't really test this. It was just a one off. That breaks down pretty quickly and especially if you don't have reference pictures of what you are trying to create.
Antigravity may well Top the whatever benchmark but:
My Antigravity (forced) replacement for Gemini CLI requires me to log on via browser every time I use it, and my Antigravity IDE won't update at all, so:
If it's ok I'd prefer they just work on reaching a baseline acceptable rollout before worrying about being Top in anything.
Ps actual title:
OpenSCAD LLM Benchmark: Building the Pantheon
I've run a tons of benchmarks for OpenSCAD for all kinds of models and setups, and what I realised is:
- Models are very jagged (might excel in one type of 3d model, but not another)
- Gemini models are the least jagged in my experience and have the best image understanding
- Gemini models are also the most creative (which may be undesirable if you want precise CAD part)
- Overall this benchmark doesn't prove much because one 3d model (and one attempt) is just not enough. I am usually testing on at least a dozen models each generated 3 times, but should really do much more, but it's too pricey for a solo dev.
Still, thanks for publishing this. Will be definitely run flash 3.5 soon to see how it performs.
Creating a single real-world object and declaring it a benchmark? No, it doesn't work that way for a robust tool. You need to do something like Iron Chef, with a Greek architecture theme and and a panel or judge that declares the winner. This is just seeing which tool subjectively makes the best looking Pantheon.
I'm unconvinced, this is one of the most iconic historical buildings with tomes written about it and plenty of existing photographs and public models to train on.
I would be more interested in benchmarking the modeling of an anonymous structure based on provided references alone. It kind of feels like the shallow magic of watching an LLM one-shot a to-do app..
I tried Claude code designing a snap fit, vase mode printed box. Ultimately didn't work out, it couldn't get the tolerances right and kept designing features that wouldn't print in vase mode.
Scad needs unit tests. It would be powerful to asset that a profile doesn't have slope greater than 45°, that intersection of two objects is null, or specific volume.
It also needs cut away views. I got okay results using boxes to remove everything except a sliver, to view a slice and internal details. But without hash marks, texture, or outlines it can be hard to tell the forms.
Still a long way from shorting Autodesk.
As a side note Autodesk released an agentic assistant back in December for Fusion. Six months later it is still quite bad.
Isn't CadQuery more professionally than OpenSCAD close to traditional CAD / mechanical engineering workflows. Not sure which model (ChatGPT, Gemini, and Claude Code) is better for CadQuery code generation?
I've been trying out MCP servers for FreeCAD to mixed results.
One area I had near magic was providing a land survey which includes details in writing of the plat. It took those directions and beautifully reconstructed the boundaries to exact precision in CAD.
Where I ran into trouble was creating good constraints on sketches without being overly explicit. I kept running into it creating distance constraints from an arbitrary point instead of using other elements in the diagram that a human drafter would think to do by default.
I have been using GPT 5.5 to build a video game. Benchmark sounds about right. It generates assets and sprite good enough, if not closer to AAA level games. Will check antigravity now.
That's actually a reason for me to try it again. My past attempts to use LLM for OpenScad has greatly improved my own OpenScad skills.
I've been using LLM's to do my OpenSCAD work for over two years now. It's always where I start (and end).
This is a really important project. Preserving humanity’s knowledge and making it openly accessible,including in formats usable by AI systems feels like one of the most valuable things happening right now. Thank you for the clear technical instructions and the bulk download options.
Projects like Anna’s Archive make it much easier for researchers and builders to work responsibly with large datasets.
That's curious, I've been trying to do some parametric modeling with Claude - and its performance was abysmal.
Why are specialized CAD making LLM models not showing up? In future are we going to have same model for everything? from programming to creative writing to CADs?
Claude Code 2.1 / Opus 4.7 looks best to me: Dome and ceiling structure is correcter than the others.
Why is this medium ranked, and not on par with the best two?
This would be the same Antigravity 2.0 that "surprise, no longer an IDE, did I forget to mention that? Lolol."
To be brutally honest, I'm disappointed with antiGravity. It feels incredibly unGoogle-like. The AI billing models are fragmented, and the AntiGravity IDE is currently tripping over something as trivial as a basic Electron deployment config bug.
Don't get me wrong, I don't think AI coding is a bad thing. For East Asians like myself, it levels the playing field with Westerners, so as long as you rigorously review the AI's output, it's a perfectly viable tool.
However, the absolute farce we just witnessed with the antiGravity2.0 update really raises doubts about whether 'vibe coding' can actually be trusted. If even a behemoth like Google is dropping the ball like this, it says a lot.
The only thing faster moving that AI these days are the goalposts. Three years ago we would have been amazed if models were able to produce anything, now we have the luxury of nitpicking. Even the worst entries in the benchmark are quite impressive.
Going to try it. just downloaded. will see how it is compared to Claude Code
So, does it mean Antigravity is better than Claude code with opus model? Given this benchmark. I once tried Antigravity and it was just disappointing.
And yet 300+140=460. A very jagged surface indeed. https://gemini.google.com/share/c2a187275e26
Why Codex GPT-5.5 High instead of Extra High, I wonder?
It's crazy how I can see articles like this, but in my practical every day use antigravity is a horrible consumer experience. The TUI is broken. You cannot type input while the model is outputting text, otherwise both get messed up and the the TUI renders a sickly blob of text. There are no keyboard shortcuts to switch between planning and execution mode, or a way to directly load skills.
The usage limits are too aggressive, too. I tried to generate a quick Deno Fresh website to act as a a redirect to my GitHub from socials (literally the simplest possible thing I could have asked of it) and it chewed through my five hour limit in tokens from scaffolding.
To me, as a developer of CLI developer tooling, its obvious not a lot of thought or testing went into this product, but as Google has said before: the models are the product".
Next month they'll be beaten again.
And next year Google will probably sunset Antigravity.
If it doesn't make Google billions, don't trust them.
Why are half of the comments on Hackernews stereotypical AI-bros whose lives revolve around tech, and the other half sceptical commentators whose lives also revolve around tech but they are disappointed with its performance?!
Where are the normal people :/
[flagged]
[flagged]
[dead]
[flagged]
[dead]
google..no thanks
I’ve literally never wanted to use openscad to convert a photo into a model. Usually I have a functional requirement such as making an en enclosure with a spec sheet to work from on the enclosed device.
Claude 4.6 before the lobotomy in Claude code was able to take a PSU spec sheet and my requirements for glands and ports, use YAPP and openscad MCPs to iteratively and unassisted build end to end a printable enclosure that was perfectly suited for the PSU with right dimensions and screw holes, mountings, grills, gland ports, everything, placed for optimal printing. This was the moment I felt like LLMs had really arrived.
A photo of a building? Why. That’s a mesh problem and is about fidelity. A technical spec sheet and diagrams to functional print with intelligent choices about the functional part baked in? That’s useful.
Last weekend I bought my wife a bike off marketplace. It was in good condition but was missing one of the internal cable routing grommets. I gave Claude pictures of the pill-shaped hole by itself and with my digital calipers in the long and short directions.
Gave it a short prompt and it gave me an openscad model with everything parametrized. I printed with no changes in tpu and it was nearly perfect on the first try. Claude put in a 0.3mm subtraction in the x/y dimensions and I lowered it to 0.1 and it's perfect.
Much easier shape than ancient Roman architecture but still very cool how easy it was.