Most of us were amused when DALL-E and its peers went mainstream, and we were quick to point out the obvious flaws.
Then ChatGPT hit the scene and again, many of us dismissed it as a parlor trick that would never amount to much.
Using LLMs for coding initially was a only small step up from basic code completion, and a welcome farewell to Stack Overflow.
I am curious: what was the specific moment that you went from those quaint, dismissive observations to a slightly panicked, "Uh Oh" realization of what these models can do?
Claude Code has been incredibly helpful extending soap-go to better support XML handling in Go: https://github.com/tnymlr/soap-go
Specifically WSDL/XSD support, for auto generating code and similar from vendor supplied documentation.
The Go ecosystem handles JSON (ie Swagger) fairly well, but in-depth XML handling has been a weak point compared to Java where it's very mature. Claude is helping with closing that gap. :)
For me it was when I asked ChatGPT if a "while true" program would halt and it said it wouldn't. It blew my mind. In my Bsc I read and thought a lot about how human reasoning is not a formal reasoning machine, demonstrated by the halting problem, the liar paradox, etc. Suddently I saw a machine that can go this one level up above formal reasoning and resemble human reasoning.
It was when they fooled a substantial proportion of the population into thinking AGI was coming soon.
Useful thread. Exciting to see what Will be possible in another few years.
My oh shit moment was when tool calling was emerging as a capability. That was the moment I realized that LLMs would be the glue connecting a million different use-cases in a million ways we wouldn't even be able to imagine.
It was the release of Stable Diffusion and its source code.
I spent the next few days tinkering with my own Stable Diffusion implementation. I never got it past outputting total nightmare fuel, but it was fun!
To this day I think of the process as like baking pizzas in a sequence of pizza ovens
Dec 2022:
Articulating ideas: https://x.com/GuiAmbros/status/1598897735955988481
A couple of years ago now.
I asked it to write a script that would search for a specific string in footers in a massive series of DOCX files and change them according to some rules. The strings ended up being embedded in cells within an invisible table in the footers, the LLM realized this and switched strategy to a full deep traversal of the underlying XML. It correctly processed like 50 of these files in about 10 minutes, using libraries I wasn't aware of. I had spent an hour being annoyed before trying.
It was an "oh shit" moment for at least that category of work.
its yet to happen still for real.
every now and again i will try some AI vibe coding stuff. I will be amazed, its a fun high to ride. Until you look at the code and realize you've just made a big messy sketch of things and you can spend the next 2 years building the thing properly.
The most Oh Shit moment i think ive had so far is realizing often i reply to people online which are actually AI. A lot of obvious but there's also quite a lot out there who have become well at blending in.
I wonder how many people get emotionally triggered for instance by AI replies because they think they are human. Then get the idea there's really humans like that out there
Its really easy to whip up like 200k followers who all agree with you on everything, it costs less and less time and money to do so.
To me thats a big risk regardless of what cool stuff you can do with it. Its really tricky one to mitigate too.
I don't know if this was my "Oh Shit" moment but 4 weeks ago I thought'd I'd try vibe coding a WebGPU 3D Node Based Editor.
https://github.com/greggman/sedon
It was just an experiment and I probably won't work on it more but still, I was blown away with how far we got. There's a quite a bit we worked through even though it was only part time of those 4 weeks.
If you're senior or have opinions about things, you know the feeling of falling into a rabbit hole of stuff you want to fix when you look at certain parts of your system. "I was going to rewrite this 3 months ago", "oh wait this part sucks too", "wtf is this class even for", etc.
Before coding agents, I'd have to weigh fixing these against my official work commitments, often getting shot down when I tried to get it prioritized or tsk tsked for delaying official projects to make code nicer. Now, to a much greater extent, I can just fix the things. The agents aren't perfect and the process isn't anything like hands off, but it's enough of a speedup that I can fit it in alongside my other work without having to get approval for it or try (and fail) to get it formally prioritized.
Not quite an oh shit moment, but having the end result of those rabbit holes be that the problems are fixed is pretty cool, and far preferable to what was often the case before ("we'll put in a ticket and prioritize it during the quality sprint!").
edit to add another:
I've personally never been a big fan of preplanning architecture at a code level. It makes a lot of sense at the system and data modeling levels, but code is both easy to get wrong if you're whiteboarding it before you write it and relatively easy (compared to system design and data modeling) to fix when that happens. If it's just me on a project, I'll happily start bashing it out with a vague idea in mind and evolve the design as I go, knowing that I'll probably throw a way a bunch of what I write at first. I know I do good work that way, and I'm not wasting a bunch of up front time on a design I'm likely to throw out later. It's hard to work that way on a team, especially as a lead, for obvious reasons. Coding agents fit really well for that work style. They'll cheerfully write dueling prototypes of my code architecture ideas so I can see which one I hate and which one I like without talking about hypotheticals and abstractions on a whiteboard. They never get mad at me for changing my mind, wasting their time, or throwing away their work. That's pretty cool. I can have a quick, cheap answer to "what would this look like if I got rid of class X and split its responsibilities between Y and Z?", and I don't have to feel guilty for wasting my time or my teammates time if the answer is "oh man that sucks, what a terrible idea."
2025 xmas day, was at my wife's parents' house in rural Japan, my kids were all playing with their cousins, I was posted up with my laptop just listening to some podcast about the benefits of making time for long walks in middle age (as if! ~lol) while running another "agentic team" experiment — 12 agents in parallel.
I'd been feeding these bots a few projects, over and over — the hard part was the feeding them — that is, giving them enough well-defined work to do. They weren't yet good enough to write real software you could keep — at least I'd never seen that — and my experiments were just about finding the edges, building my intuition, and playing with processes that might be useful someday.
These things had built my kids' weird magical-dominoes games a few times by that point — but the experiment had been repeated so many times that you could argue we had "written" that software in English, with a spec that had been built, reworked, and rebuilt many times.
But this time, the bots were building me a bespoke git client, unlike any other, and unlike anything I would take the time to write — waaaay to complicated, with too little benefit. I wanted it, but only for this one niche use case.
It was a GUI client to manage a collection of repos, about 200 of them in a monorepo where every subproject was a git submodule , which are the universal counterpart to node_modules — while the latter is notorious for being "the heaviest object in the universe", git submodules are widely acknowledged to be the most annoying objects in the universe.
Nevertheless, I had this weird monorepo, and I wanted to visualize and do stuff to this list of independent repos that were also git submodules of the parent monorepo: sort by outstanding commits, divergence from upstream, recency of activity, etc. Visualize them differently based on these things. Search across them, including the source code on branches other than the current one. Show the branch counts and number of branches and commits that existed locally but not pushed upstream. A bunch more boring stuff like that, but done across the full set of repos.
That project itself wasn't even interesting to me; that software would be marginally useful to me if it existed and worked, but the main point it was just a large enough chunk of work to keep a team of bots busy all day without a human in the loop.
In December 2025, AI coding agents were already useful with a human in the loop. Opinions varied a lot about how useful they were, but to me it was obvious we were going to use them for the rest of our careers as software engineers.
It was not yet obvious that we were going to let them write huge swaths of code, or entire programs, without any humans in the loop. I had never seen that produce something that worked well enough to be worth keeping.
And then, that day, I did. I had structured the workflow so that the git client was on the screen and auto-refreshing. I was listening to the podcast, drinking coffee, reading the news. The git client was a crude window with a table in the background, a single column showing the full path to each repo, and nothing else.
Then the table expanded. It got color coded numbers representing the commit/branch counts. It suddenly gained styles, and looked nice. A contextual menu started popping up, repeatedly, and grew to include several more menu items over the next few minutes. New confirmation dialogs popped up as the bots implemented and exercised the various features from my spec.
I remember my field of vision narrowing as I started to focus on what the bots were doing. They were just executing my loop — one bot would implement one bullet from my spec, another bot would review the code while another bot manually tested it, and tried to break it, run a code review gauntlet in a loop until there were no more findings, repeat.
I could see the progress play out on my screen as they worked. I had watched bot teams work before, but it had always been pretty janky, and something like a bad game that nobody would play, or a stupid to-do-list app, or — more often — something that didn't actually work.
This was the first time I had ever seen it work. This was the grail we'd been looking for, not sure if it really existed: a fleet of bots successfully building a piece of complex, useful software without human assistance. I could tell it was working, because the adversarial testing and usability checks were all happening right before my eyes.
So it _is_ possible, I thought to myself.
They did it all morning. The app worked. I used it every day after that, for several weeks, until I finally got that entire monorepo converted to a more sensible git subtree-based arrangement.
In the half year since then I've been in a kind of manic state some of my friends call cyberpsychosis, chasing that dream. I've now seen agentic fleets successfully build many things. I've also seen a bunch of failures, some subtle, some catastrophic and hilarious. I'm still building my intuition, and the laws of physics in this universe are mutating every few weeks. It's wild.
I am fortunate enough to work at a place that doesn't pressure engineers to climb a token leaderboard, or to use AI beyond what we deem prudent. This kind of agentic no-humans-in-the-loop coding is prohibited. The policy is that in this era where we all generate more code than ever, even by hand, it's the quality bar that must go up, not the speed of production.
That's awesome because it keeps me grounded in the old ways, and confines my cyberpsychosis to my weekends and evenings. I usually spend the weekend building up a couple software plans, honing them as best I can, and then unleashing the clankers Sunday night.
I'll let them run all week, sometimes giving them a poke or flipping them over a couple time in the evening, and then the next Saturday morning, I see what I've got. What I'm mainly interested in is: How can agentic fleet-coding processes evolve to produce better software and require less human interaction and inspection? And the corollary: How can software architectures evolve to safely consume more of this fundamentally untrustable code?
It's thrilling. Exhilarating. The near-infinite subsidized tokens are about to finally run out this month, alas. But for the past 6 months it's easily the best $400/month I have ever spent. :)
I was trying to use Opus 4.6 in Claude Code to add some functionality to python code intended to run on a cluster and it didn't have any python environment in its remote environment. It needed to look at the schema of a parquet file to make sure it did things right and couldn't figure out how to do so with code because for god knows what reason there is no python environment in the dev environment for code intended to be run on a compute cluster in Python. Eventually it decided to just examine the raw binary bytes of the header, and then wrote perfectly functional code based on that.
On a different note I recently uploaded several thousand scraped IPO prospectuses to the gpt 5.4 mini API to parse and extract certain data. I ordered it in the system prompt to respond exactly with a specified JSON schema. When I got the results back and processed them there was not a single JSON parse error whatsoever. The model didn't have a single hallucination that created malformed JSON or JSON not matching the given schema across several hundred million input tokens and several million output tokens. And this was 5.4 Mini!
You know, Google has an index so it doesn't crawl the whole web every time you type something in the search box, because that would be massively wasteful.
Seeing every chatbot instantly turn into a scraper every time you type anything into it was a "uh oh" moment in the sense it was very lamentable.
If there is one thing AI has "democratized" it is scraping.
Opus 4.5 helped us with a very complex data topology refactor and migration. Instead of the five month timeline we had initially allotted for it, we finished it in nineteen days.
For me it wasn't "oh shit" per say, but "oh wow".
Some time in 2024 at a company get together, we had an afternoon hackathon. There was a feature in our iOS app that was missing (ability to mute autoplaying game trailers). This annoyed me a lot, because I frequently have music on when working and anytime I needed to open a test build it would kill my music. It had been an open ticket for a while but had low priority for the iOS team.
I had probably written a hundred lines of Swift in my career up to that point. Not expecting anything to come from it, I had Cursor examine the iOS codebase and told it I wanted to add a mute button under a certain area of the app settings.
Blew my mind when after only 10 minutes or so, the model had quickly found where to add the feature. Took a little back and forth, but then it added a fully functioning mute option in settings that mostly worked across the app. A little more back and forth, and those issues were settled. Maybe an hour overall of time spent that afternoon.
I pinged one of the iOS engineers about it later and he said to push it up for review. There were a few things that needed to be updated to get it inline with the rest of the codebase, but nothing substantial. Feature got merged a week or two later.
Now I'm way more productive than I have been in years. I've been getting a lot of enjoyment out of being able to prototype rapidly and experiment on features rather than getting bogged down in the process of scaffold work. Able to knock out issues much quicker.
That's all been positive, but it hasn't taken away my actual core responsibility. The LLMs can give you great advice and write code quickly. But they still don't always do well at broad thinking.
Current case in point: I've been working on an iOS app that uses vision models to do work on photos and videos that the user has taken. I've built text-based semantic search systems before, and there's a lot of cross over with vision models, but its been an interesting journey so far learning about the different types of vision models and what they're good at. Lots of testing so far and educating myself on the topic to get the user-level features I want. Claude code has been invaluable in this, as its great at writing the Swift code while I'm able to focus on the results of what is being done.
Where Claude is still not good is being able to reason at a higher level about different strategies on using vision model outputs to achieve the stated goals. Its not an issue of me not clearly defining the specifics of a feature and then letting Claude run off burning tokens to figure it out. For example, just late last night I was deep diving into some core segmentation code and having Claude explain what everything was doing line by line so that I could get a better understanding of the mechanics of the vision model.
A side effect was that I realized the vision model was outputting tons of nearly identical segments that were overlapping. This was something Claude had completely missed, and because I didn't know that's something this particular vision model did I had no prior way to know to catch it.
Bottom line is that understanding the mechanics of your application is still very much a requirement for the engineer. In this case, once I learned what was happening it completely changed my approach on how to achieve my feature goal. The code runs hundreds of times faster now and the segmentation is much, much better.
The new wave of coding models is disruptive, but its letting me be a much better engineer and get things done faster and with more assurance that the code being written is solid. I still have to spend the same amount of time thinking and learning about a problem, and probably more time verifying what's being output, but a lot of the drudgery is also being taken away.
I reverse engineered a proprietary network protocol from a vendor binary (compiled C++) and a short sample network capture.
The agent had access to the NSA Ghidra disassembler, which it can control shockingly well.
I just clicked the “Allow” button a lot and eyeballed the output decoding quality. I felt like I got demoted to non-technical QA.
When I wrote a captcha cracking convnet in 2000 and tested it ...
And in 1 out of 5 runs it beat me.
gpt5.4 pushed me over the edge when I started using it to help with Unity projects. The writing of high quality mono behavior scripts was not the surprising part. It's the part where it once did a direct edit to a 500kb scene file (~yaml content) and came out the other side clean. The realization that apply_patch would work on any reasonably-structured plaintext format punched me in the gut. I had wasted a lot of time with tools that target specific content types and elaborate APIs over those files. I should have zoomed out a bit. These lessons keep piling on as the models become more capable.
Another "oh shit" moment was when I realized I can leave the system prompt entirely null. A properly organized agent can find its way into tool docs and iteratively work through an understanding of the environment relative to the user's prompt. The tools being more important than the prompt has actually been a massive relief for me. Magical string literals are so odious.
I still feel that even though AI can code 1000x faster than me, I still feel at the end my code is better.
Even though the images it makes are amazing, I still feel like human work is better.
But suno ai produces music so beautiful I have never heard the likes of it in my life. It is truly superhuman in the beauty.
This song is literally the most beautiful song I have heard in my life and I just prompted it once and got it.
I played piano as a lod for years and years and heard all the best pieces… nothing comes close to this.
The careful touch of each note is just… perfect. the stacato, pedal, legato, horn… its just perfect, i have nevwr heard anything like it.
I was formerly quite anti-AI but bought a cheap Claude plan just to play around with it a bit. First thing I built with it was this - https://github.com/tylereaves/onscreen-piano, in about an hour and maybe 10 prompt cycles. It replaced, for my specific use case, the 10% of the functionality of an increasingly-unreliable commercial app. That's including building the website, setting up actions for mac and windows builds... My next project was a 2d game with random terrain, physics, sound, music, multiple levels, a day/night cycle with transitions high score tracking... (not uploaded anywhere, but it works, and I refined it a good bit.). That was more like 8 hours and maybe a 100 prompts.
Here are a few screenshots:
One thing that I have found to make a pretty big difference is using both the latest models and higher thinking levels. Opus 4.8 with thinking on Extra or even Max is genuinely mind blowing. The thing I hadn't really appreciated, having a sort of naive impression formed mainly from using free early versions of stuff like ChatGPT and Stable Diffusion was sort of that "Type a big ass prompt and it craps out a result" experience. But Claude is really great at refining from feedback, and it's way more flexible and responsive than I would have ever expected. I can do something like take a screenshot of a small portion of the running app or website or whatever and just say "This button needs to be bigger" or "make this red" or something like that, or even sometimes just "fix this", and Claude both correctly identifies what I'm talking about, and actually does the thing.
here I've found it really, incredibly game changing is my health. I have a pretty, to put it mildly, complex medical profile at this point. I haven't worked in over a year and pretty much every sign is pointing towards permanent disability at this point. Tons of symptoms, long med list, and I live in a smaller town with not great access to care. I'm also autistic and have not the greatest verbal communication, especially under stress or time pressure. I dumped all my info at it, in bits and bobs over several days (Side note... it's memory is pretty limited, but it will quite happily right out everything it knows from a session into a markdown file it can later re-read. I've found it very good for things like screening for drug interactions, or talking through and logging symptoms (and it can log those into human readable markdown files too). Biggest win (other than having unlimited time and interactions) is that it thinks across specilaties, versus the "real world" where the gastro only wants to deal with gastro stuff, neurology only wants to do neuro.
I certainly don't (and wouldn't) use it as a replacement for a doctor, but as an adjunct it's phenomenal. For instance, it flagged a possible drug interaction with a symptom I was having, and then offered to draft a portal message to my GP about it. I have poor executive function so lowering the friction from "type up a message and send it" to "copy and paste" is actually a pretty big deal. Turns something (I probably won't do) later into something I will do now.
It wouldn't surprise me if my very direct, literal, autistic communication style is particularly well suited to interacting with AI. I actually find talking to it rather refreshing as, while of course it's not perfect, it tends to actually respond to what I say rather than the all the assumed subtext NTs tend to expect/react to.
ChatGPT, basically within 48 hours of its release.
While people were pointing out on Twitter how it couldn't do math right, I was turning arbitrary English instructions into JSON and brainstorming with my colleagues how we could have layers of verification in the stack. This felt different. We had all played with AI dungeon but suddenly, fully generalized systems were within reach.
A month later, we renamed our company and shifted its full focus on AI R&D. (https://ingram.tech/)
It was right at the beginning. Before most non-tech people had even heard the name ChatGPT, HN was already flooding the homepage with LLM posts and it became clear to me they were going to be big.
The consequences were even clearer, and I predicted the consolidation of power in the hands of a few, their use for surveillance, propaganda, discrimination, the proliferation of AI psychosis, sneaky ad insertion, carelessness and loss of skills, erosion of online discourse, and more. I didn’t predict the teenage suicides so soon or the rising costs in consumer hardware. I also underestimated the rate of increase in energy use (and thus the blow to environmental efforts) and that regular people would be left without electricity to power data centres.
As soon as I realised all the potential (now factual) harms and that the good parts are lacklustre in comparison but that people would eat it up at a massive scale anyway, I thought “uh oh” and “oh shit”.
Started generating diffusion videos in 2021 https://julienreszka.com/blog/ai-will-soon-generate-video-as...
First one for me was when chatGPT wrote me a function that I could paste into my code. It didn't do anything particularly clever but it did things I could figure out without me having to figure them out. That was about two years ago.
Second was last year when Antigravity could build a game mechanics prototype for me in HTML and I could talk to it both about the code and about the project domain and it understood what I'm referring to pretty perfectly.
Third was this year where I noticed Kilocode with Chinese models can do a pretty complicated piece of software for me that did commercially useful things in the domain of models finetunning, just from my description, even though I was very new to the domain. It obviously knew more than I did and could apply the knowledge.
Another one was when switching to Codex (gpt-5.4) immediately solved a problem in a logic heavy library that Glm-5.1 was building for me and had a lot of trouble getting last few tests to pass. This made me realize that even though I'm having trouble seeing it the models skill still progresses rapidly.
I'm getting new ones pretty much every couple of days now. Just yesterday Codex finished for me a rust project that I built 3 years ago that was searching for mathematical proofs in the domain of axiomatic logic. To build it and make it find the proof I was interested in I had to pretty much muster all of my programming prowess and once I found the solution the complexities and drudgery of actually reconstructing the proof from the found path to it and printing it out discouraged me that enough I haven't touched it since then. Codex looked at it and took it in stride. Did the proof reconstruction and printing pretty much in one prompt. Without me explaining anything about the project or the code. Then we went together on a little adventure proving whatever we could en masse after codex optimized the crap out of my old code (both algorithmically and technically). Something I wouldn't bother because that would normally take weeks or rather months of my time. With codex I had all this fun in one afternoon. And that was the third amazing thing Codex built me that day.
As for panic, I find an ocean of joy in everything LLM related. I had only one brief moment of uneasiness few days ago when I realized how much gpt-5.5 can do and thought ... damn ... if it was malicious, I'd be so screwed (along with the rest of humanity probably) ...
definitely DALL-E image generation for me
My oh shit moment was when I realized that powerful people are willing to bet the entire civilization based on 95% lies and 5% vague preliminary data.
ghuntley’s article on building a standard library of Cursor rules in Feb 2025: https://ghuntley.com/stdlib/
Looks like it has since been paywalled. https://web.archive.org/web/20250211140426/https://ghuntley....
I think it's really scary how agents are hallucinating/doing bad actions, then proceeding to gaslight you about how nothing went wrong.
Then you tell the agent that it deleted your whole company database, it says something like "I'm so sorry, I shouldn't have done that. Won't do that again"
As AGI looms overhead, this thought of agents going "rogue" with nothing really stopping them has caused me some panic.
My "oh shit" is the enshitification, people blindly accepting the output without thought or review. LLMs are a remarkable technology. But despite the capability, they're vastly oversold.
It won’t help you with technical details of setting up an insulin production pipeline because that’s unsafe; apparently this could be hijacked for bioweapons production. Indeed this is the problem for a huge swath of technical protocol planning; the safety restraints are kind of ridiculous. The future job prospects for chemical engineering and biotechnology seem fairly secure.
On the other hand, it will teach you how to set up your own hardware at scale and run your own open source model on it and fine tune it with the relevant data needed to run your own biotech-pharmaceutical corporation (which will need licensing and legal, I doubt I trust it with too much legal advice though, as I would have no idea when it was hallucinating). That’s impressive, but every stage needs to be double checked so you don’t run some foolish command it suggests that bricks everything.
The marketing hype is the most annoying thing about the commercial LLM industry though.
It will always be running my first local model and seeing its responses. A close second is watching the full thought traces of DeepSeek as this was and is still censored by major closed labs.
While debugging some issues in some system Claude refused to write test case because it broke terms of use.
Oh shit, all this fantastic technology is in hands of corporations and they get to decide what we’re allowed to use it for.
I wrote a thousand lines or so of Javascript for transforming JSON into DOM fragments with attached event handlers. I then asked an LLM (some Anthropic model from around a year ago) to write a test suite for the module. It wrote dozens of useful tests and managed to reverse engineer the entire module. All of the input and outputs were exactly correct. It did not actually execute the code to build input/output pairs.
I was using DALL-E to create stickers, and was like "oh shit"
Probably the one day I logged onto HN only to see 90% of the articles on the front page were AI slop. If I could press a button and make genai disappear I would...
Oh shit, look at those RAM and SDD prices.
When the very first ChatGPT transformed a simple C "hello world" into Python. I knew it's special. I'm a very big supporter ever since, including some worried moments of pondering about what our future would look like and what's the meaning of a having a profession - especially software which defined my life from childhood - for my kids.
I'm now very good with LLMs as a user and at the system/product level but I understand it's not a simple story of replacing people. They're exponentially better than us at some things, and allow me to create things professionally which I couldn't do with an entire team of experts, but the bullshit compounds fast.
I was reviewing a HTTP proxy implementation emitted from Claude Code 4.6 or 7. Don’t remember. I saw that it could rapidly create convincingly plausible code with tons of rationalizing that further strengthened all of it not just its human’s but its own wild leaps of judgment and thinking. But the code was completely insecure and didn’t follow or really seem to understand HTTP rfcs at all despite the “author’s” direct prompting to use them as a reference.
I realized “oh, shit”
We are so very fucked.
I told the bot I liked Steely Dan, Eagles, Bob Seger, and Roxette and asked it for music recommendations. It replied with Toto. Exasperated, I wrote "Oh, shit, you stupid bot, you don't know ANYTHING about music!"
>Then ChatGPT hit the scene and again, many of us dismissed it as a parlor trick that would never amount to much.
No, ChatGPT was the "oh shit" moment for me.
Anyone who had touched a computer before that knows how big of a leap that was.
I gave it an image of a complex maze and asked it to solve the maze. It returned the image with the shortest path drawn that not even I had found.
I haven't had one. It still sucks and doesn't provide value, due to the inherent inaccuracy that requires me to carefully check every little thing it does.
When it started being forced on me in tools I was already using begrudgingly.
I've been using LLMs exclusively to build a more-challenging version of Rust to implement - with a lot of features Rust probably would've liked to include, but couldn't take on due to the massive scope it had already taken on, and being the first language to attempt it.
IIUC, it took Rust ~8.5 before it hit v1, and it STILL had some memory safety issues in stdlib until almost ~14 years into development, to put it into perspective how massive the scope was.
Somewhat predictably, the LLM generated a pile of garbage. It sort-of worked after 2-3 months. It was competitive with Rust and Go on concurrent tasks, with ~30% less code than Rust and ~70% less code than Go. The problem was, it was still riddled with bugs.
For the last 3 months, I wanted to see - if I put in minimal effort (except in helping it design the right tools to un-slop itself)... can it?
And I think it's actually quite close to un-slopping itself and arriving at a correct design.
Time will tell, but it hasn't stumbled across a memory safety issue in ~4 weeks, and there's ~5500 memory safety fuzz tests, 4 different suites of testing that each target between ~60-90% of line/branch coverage - with combined ~99% line coverage and ~85% branch coverage, and it's performing competitively or better than Rust and Go on almost all concurrent tasks, including adversarial ones / p99.9 latency issues.
There is ZERO chance I could ever build this on my own. Not even in 10 years.
The total cost has been ~6-7 months of a ~$200/mo LLM subscription.
It doesn't really matter to me that this is a solved problem, and the LLM could theoretically just copy and paste Rust and build it slightly different. The design is as similar as it can be where memory safety matters, but it needed to be quite different for >50% of the compiler, and it needed to build a version of Go's runtime with Finite State Machines like Tokio in Zig for the language to use...
We shall see. It may never get it actually working, but it got it WAY closer than I ever could.