There are so many of these "meta" frameworks going around. I have yet to see one that proves in any meaningful way they improve anything. I have a hard time believing they accomplish anything other than burn tokens and poison the context window with too much information. What works best IME is keeping things simple, clear and only providing the essential information for the task at hand, and iterating in manageable slices, rather than trying to one-shot complex tasks. Just Plan, Code and Verify, simple as that.
In my view, Spec-Driven systems are doomed to fail. There's nothing that couples the english language specs you've written with the actual code and behaviour of the system - unless your agent is being insanely diligent and constantly checking if the entire system aligns with your specs.
This has been solved already - automated testing. They encode behaviour of the system into executables which actually tell you if your system aligns or not.
Better to encode the behaviour of your system into real, executable, scalable specs (aka automated tests), otherwise your app's behaviour is going to spiral out of control after the Nth AI generated feature.
The way to ensure this actually scales with the firepower that LLMs have for writing implementation is ensure it follows a workflow where it knows how to test, it writes the tests first, and ensures that the tests actually reflect the behaviour of the system with mutation testing.
I've scoped this out here [1] and here [2].
[1] https://www.joegaebel.com/articles/principled-agentic-softwa... [2] https://github.com/JoeGaebel/outside-in-tdd-starter
This pile of Markdown files has the most cringe-inducing name I have seen in weeks.
I have a ai system i use. I'd like to release it so others can benefit, but at the same time it's all custom to myself and what i do, and work on.
If I fork out a version for others that is public, then I have to maintain that variation as well.
Is anyone in a similar situation? I think most of the ones I see released are not particularly complex compraed to my system, but at the same time I don't know how to convey how to use my system as someone who just uses it alone.
it feels like I don't want anyone to run my system, I just want people to point their ai system to mine and ask it what there is valuable to potentially add to their own system.
I don't want to maintain one for people. I don't want to market it as some magic cure. Just show patterns that others can use.
I used this for a team hackathon and it took way too much time to build understanding of the codebase, wrote too many agent transcripts and spent way too much token during generation. It also failed multiple times when either generating agent transcript or extracting things from agent transcript - once citing "The agent transcripts are too complex to extract from" - quite confounding considering it's the transcript you created. For what we were trying to build - few small sets of features - using gsd was an overkill. The idea was to get some learnings whether gsd could be useful - for our case it was a strong no. Learning for me: don't overcomplicate - write better specs, use claude plan mode, iterate.
I've had a good experience with https://github.com/obra/superpowers. At first glance this looks similar. Has anyone tried both who can offer a comparison?
Nice, I like the UI more than mine, I built a similar tool out of minor frustrations with some design choices in Beads, mine uses SQLite exclusively instead of git or hard files, been using it for all my personal projects, but havent gone back to try and refine what I have a little more. One thing a lot of these don't do that I added to mine is synching to and from GitHub. I want people to see exactly what my local tasks are, and if they need to pull one down to work on.
I think the secret sauce is talk to the model about what you want first, make the plan, then when you feel good about the spec, regardless of tooling (you can even just use a simple markdown file!) you have it work on it. Since it always has a file to go back to, it can never 'forget' it just needs to remember to review the file. The more detail in the file, the more powerful the output.
Tell your coding model: how you want it, what you want, and why you want it. It also helps to ask it to poke holes and raise concerns (bypass the overly agreeable nature of it so you dont waste time on things that are too complex).
I love using Claude to prototype ideas that have been in my brain for years, and they wind up coming out better than I ever envisioned.
I've been using GSD extensively over the past 3 months. I previously used speckit, which I found lacking. GSD consistently gets me 95% of the way there on complex tasks. That's amazing. The last 5% is mostly "manual" testing. We've used GSD to build and launch a SaaS product including an agent-first CMS (whiteboar.it).
It's hard to say why GSD worked so much better for us than other similar frameworks, because the underlying models also improved considerably during the same period. What is clear is that it's a huge productivity boost over vanilla Claude Code.
The only tool you need is the one that saves tokens... the one that saves tokens ... the one that saves tokens. Currently I don't know any.
Claude code itself consumes lot of tokens when not needed. I have to steer it a lot while building large applications.
I tried it once; it was incredibly verbose, generating an insane amount of files. I stopped using it because I was worried it would not be possible to rapidly, cheaply, and robustly update things as interaction with users generated new requirements.
The best way I have today is to start with a project requirements document and then ask for a step-by-step implementation plan, and then go do the thing at each step but only after I greenlight the strategy of the current step. I also specify minimal, modular, and functional stateless code.
I've compared this to superpowers and the classic prd->task generator. And I came away convinced that less is more. At least at the moment. gsd performed well, but took hours instead of minutes. Having a simple explanation of how to create a PRD followed by a slightly more technical task list performed much better. It wasn't that grd or superpowers couldn't find a solution, it's just that they did it much slower and with a lot more help. For me, the lesson was that the workflow has changed, and we that we can't apply old project-dev paradigms to this new/alien technology. There's a new instruction manual and it doesn't build on the old one.
Has anything like this been built?
I want a system that enforces planning, tests, and adversarial review (preferably by a different company's model). This is more for features, less for overall planning, but a similar workflow could be built for planning.
1. Prompt 2. Research 3. Plan (including the tests that will be written to verify the feature) 4. adversarial review of plan 5. implementation of tests, CI must fail on the tests 6. adversarial review verifying that the tests match with the plan 7. implementation to make the tests pass. 8. adversarial PR review of implementation
I want to be able to check on the status of PRs based on how far along they are, read the plans, suggest changes, read the tests, suggest changes. I want a web UI for that, I don't want to be doing all of this in multiple terminal windows.
A key feature that I want is that if a step fails, especially because of adversarial review, the whole PR branch is force pushed back to the previous state. so say #6 fails, #5 is re-invoked with the review information. Or if I come to the system and a PR is at #8, and I don't like the plan, then I make some edits to the plan (#3), the PR is reset to the git commit after the original plan, and the LLM is reinvoked with either my new plan or more likely my edits to the plan, then everything flows through again.
I want to be able to sit down, tend to a bunch of issues, then come back in a couple of hours and see progress.
I have a design for this of course. I haven't implemented it yet.
I use openspec and love it. I’m doing 5-7x with close to 100% of code AI generated, and shipping to production multiple times a day. I work on a large sass app with hundreds of customers. Wrote something here:
https://zarar.dev/spec-driven-development-from-vibe-coding-t...
I like openspec, it lets you tune the workflow to your liking and doesn’t get in the way.
I started with all the standard spec flow and as I got more confident and opinionated I simplified it to my liking.
I think the point of any spec driven framework is that you want to eventually own the workflow yourself, so that you can constraint code generation on your own terms.
I have been using this a lot lately and ... it's good.
Sometimes annoying - you can't really fire and forget (I tend to regret skipping discussion on any complex tasks). It asks a lot of questions. But I think that's partly why the results are pretty good.
The new /gsd:list-phase-assumptions command added recently has been a big help there to avoid needing a Q&A discussion on every phase - you can review and clear up any misapprehensions in one go and then tell it to plan -> execute without intervention.
It burns quite a lot of tokens reading and re-reading its own planning files at various times, but it manages context effectively.
Been using the Claude version mostly. Tried it in OpenCode too but is a bit buggy.
They are working on a standalone version built on pi.dev https://github.com/gsd-build/gsd-2 ...the rationale is good I guess, but it's unfortunate that you can't then use your Claude Max credits with it as has to use API.
I tried this for a week and gave up. Required far too much back and forth. Ate too many tokens, and required too much human in the loop.
For this reason I don’t think it’s actually a good name. It should be called planning-shit instead. Since that’s seemingly 80%+ of what I did while interacting with this tool. And when it came to getting things done, I didn’t need this at all, and the plans were just alright.
I gave it a shot, but won't be using it going forward. It requires a waterfall process. And, I found it difficult, and in some cases impossible, to adjust phases/plans when bugs or changes in features arise. The execution prompts didn't do a good job of steering the code to be verified while coding and relies on the user to manually test at the end of each phase.
I did a similar system myself, then I run evals on it and found that the planning ceremony is mostly useless, claude can deal with simple prose, item lists, checkbox todos, anything works. The agent won't be a better coder for how you deliver your intent.
But what makes a difference is running plan review and work review agent, they fix issues before and after work. Both pull their weight but the most surprising is the plan-review one. The work review judge reliably finds bugs to fix, but not as surprising in its insights. But they should run from separate subagents not main one because they need a fresh perspective.
Other things that matter are 1. testing enforcement, 2. cross task project memory. My implementation for memory is a combination of capturing user messages with a hook, append only log, and keeping a compressed memory state of the project, which gets read before work and updated after each task.
Built my first SaaS as a frontend dev with no backend experience using a similar approach. The key shift was treating Claude Code as a senior developer who needs clear specs, not a magic box. The more precise the context and requirements, the better the output. Vague prompts produce vague code.
If you want some context about spec-driven development and how it could be used with LLMs I recommend [1]. Having some background like helps me to understand tools like this a bit more.
[1] https://www.riaanzoetmulder.com/articles/ai-assisted-program...
GSD has a reputation for being a token burner compared to something like Superpowers. Has that changed lately? Always open to revisiting things as they improve.
> If you know clearly what you want
This is the real challenge. The people I know that jump around to new tools have a tough time explaining what they want, and thus how new tool is better than last tool.
> GSD is designed for frictionless automation. Run Claude Code with: claude --dangerously-skip-permissions
Is this supposed to run in a VM?
My experience with this library has been underwhelming sadly. I have a better experience going raw with any cli agent
Apart from GSD and superpowers, there's another system, called PAUL [1]. It apparently requires fewer tokens compared to GSD, as it does not use subagents, but keeps all in one session. A detailed comparison with GSD is part of the repo [2].
[1] https://github.com/ChristopherKahler/paul
[2] https://github.com/ChristopherKahler/paul/blob/main/PAUL-VS-...
I’ve been using GSD for all my dev projects
Honestly a fantastic harness right out of the box. Give it a good spec and it can easily walk you through fairly complex apps
I think the research / plan / execute idea is good but feels like you would be outsourcing your thinking. Gotta review the plan and spend your own thinking tokens!
I tried this but it creates a lot of content inside the repository and I don't like that. I understand these tools need to organize their context somewhere to be efficient but I feel that it just pollutes my space.
If multiple people work with different AI tools on the same project, they will all add their own stuff in the project and it will become messy real quick.
I'll keep superpowers, claude-mem, context7 for the moment. This combination produces good results for me.
I tried it after watching the video demo from the repo creator, and it looked quite impressive at first. And I decided to rebuild my side project with this, but after a few days I realized that it was not for me. It's way too much of a black box for me as an engineer, not a prompter.
The spec-first approach is underrated. Treating the spec as a living artifact the AI can reference across sessions is something I've been experimenting with too. The main challenge is keeping specs short enough to actually stay current.
I'm still stuck on superpowers. Can't seem to get better plans out of native claude planning - superpowers ensures I have a reviewed design that actually matches my mental model. Typical claude planning doesn't confirm assumptions sufficiently for my weak brain dumps/poorly spec'd tickets.
I could not produce useful output from this. It was useful as a rubber duck because it asks good motivating questions during the plan phase, but the actual implementation was lacklustre and not worth the effort. In the end, I just have Claude Opus create plans, and then I have it write them to memory and update it as it goes along and the output is better.
There should be an "Examples" section in projects like this one to show what has actually been made using it. I scrolled to the end and was really expecting an example the way it's being advertised.
If it was game engine or new web framework for example there would be demos or example projects linked somewhere.
I haven't read everything in here but think this will be very useful going forwards. love the name btw! GSD!
The spec-driven approach resonates. I've found that the quality of the initial context you feed to AI coding tools determines everything downstream. Vague specs produce vague code that needs constant correction.
One pattern that's worked well for me: instead of writing specs manually, I extract structured architecture docs from existing systems (database schemas, API endpoints, workflow logic) and use those as the spec. The AI gets concrete field names, actual data relationships, and real business logic — not abstractions. The output quality jumps significantly compared to hand-written descriptions.
The tricky part is getting that structured context in the first place. For greenfield projects it's straightforward. For migrations or rewrites of existing systems, it's the bottleneck that determines whether AI-assisted development actually saves time or just shifts the effort from coding to prompt engineering.
if you want to charge for this, or even if you don't and you want people in old & boring companies to use it, imagine a developer or engineer having this conversation with management/bureaucrats:
"I want to use 'get shit done' as part of my project"
These days, it's not a big deal at all at most places. But there are places where it will raise an eye brow. I'm not saying change it's name, and you've probably considered this already, but I would like to suggest the meaning of GSD tongue-in-cheek perhaps? Whatever, a kick-ass project either way.
You are missing one important bit. Semantic Gravity Sieves. Important data in the metadata collapses together, allowing grouped indexing. Something like a DAG allows the logic to be addressed consistently.
I'm curious if anyone has used this (or similar) to build a production system?
I'm facing increasing pressure from senior executives who think we can avoid the $$$ B2B SaaS by using AI to vibe code a custom solution. I love the idea of experimenting with this but am horrified by the first-ever-case being a production system that is critical to the annual strategic plan. :-/
How come we have all these benchmarks for models, but none whatsoever for harnesses / whatever you'd call this? While I understand assigning "scores" is more nuanced, I'd love to see some website that has a catalog of prompts and outputs as produced with a different configuration of model+harness in a single attepmt
250K lines in a month — okay, but what does review actually look like at that volume?
I've been poking at security issues in AI-generated repos and it's the same thing: more generation means less review. Not just logic — checking what's in your .env, whether API routes have auth middleware, whether debug endpoints made it to prod.
You can move that fast. But "review" means something different now. Humans make human mistakes. AI writes clean-looking code that ships with hardcoded credentials because some template had them and nobody caught it.
All these frameworks are racing to generate faster. Nobody's solving the verification side at that speed.
I've tried it, and I'm not convinced I got measurably better results than just prompting claude code directly.
It absolutely tore through tokens though. I don't normally hit my session limits, but hit the 5-hour limits in ~30 minutes and my weekly limits by Tuesday with GSD.
With the coding slot machine, I prefer move fast and start over if anything goes off track. Maybe the amount of token spent with several iterations is similar to using a more well planned system like GSD.
This looks like moving context from prompts into files and workflows.
Makes sense for consistency, but also shifts the problem:
how do you keep those artifacts in sync with the actual codebase over time?
it is very hard for me to take seriously any system that is not proven for shipping production code in complex codebases that have been around for a while.
I've been down the "don't read the code" path and I can say it leads nowhere good.
I am perhaps talking my own book here, but I'd like to see more tools that brag about "shipped N real features to production" or "solved Y problem in large-10-year-old-codebase"
I'm not saying that coding agents can't do these things and such tools don't exist, I'm just afraid that counting 100k+ LOC that the author didn't read kind of fuels the "this is all hype-slop" argument rather than helping people discover the ways that coding agents can solve real and valuable problems.
I’ve tried GSD several times. I actually like the verbosity and it’s a simple chore for Claude to refresh project docs from GSD planning docs.
Like most spec driven development tools, GSD works well for greenfield or first few rounds of “compound engineering.” However, like all others, the project gets too big and GSD can’t manage to deliver working code reliably.
Agents working GSD plans will start leaving orphans all over, it won’t wire them up properly because verification stages use simple lexical tools to search code for implementation facts. I tried giving GSD some ast aware tools but good luck getting Claude to reliably use them.
Ultimately I put GSD back on the shelf and developed my own “property graph” based planner that is closer to Claude “plan mode” but the design SOT is structured properties and not markdown. My system will generate docs from the graph as user docs. Agents only get tasked as my “graph” closes nodes and re-sorts around invariants, then agents are tasked directly.
"I am a super productive person that just wants to get shit done"
Looked at profile, hasn't done or published anything interesting other than promoting products to "get stuff done"
This is like the TODO list book gurus writing about productivity
Unbelievably slow, not worth it at all.
At the risk of sounding stupid what does the author mean by: “I’m not a 50-person software company. I don’t want to play enterprise theatre.” ?
I was using this and superpowers but eventually, Plan mode became enough and I prefer to steer Claude Code myself. These frameworks are great for fire-and-forget tasks, especially when there is some research involved but they burn 10x more tokens, in my experience. I was always hitting the Max plan limits for no discernable benefit in the outcomes I was getting. But this will vary a lot depending on how people prefer to work.