This stuff smells like maybe the bitter lesson isn't fully appreciated.
You might as well just write instructions in English in any old format, as long as it's comprehensible. Exactly as you'd do for human readers! Nothing has really changed about what constitutes good documentation. (Edit to add: my parochialism is showing there, it doesn't have to be English)
Is any of this standardization really needed? Who does it benefit, except the people who enjoy writing specs and establishing standards like this? If it really is a productivity win, it ought to be possible to run a comparison study and prove it. Even then, it might not be worthwhile in the longer run.
Folks have run comparisons. From a huggingface employee:
codex + skills finetunes Qwen3-0.6B to +6 on humaneval and beats the base score on the first run.
I reran the experiment from this week, but used codex's new skills integration. Like claude code, codex consumes the full skill into context and doesn't start with failing runs. It's first run beats the base score, and on the second run it beats claude code.
https://xcancel.com/ben_burtenshaw/status/200023306951767675...That said, it's not a perfect comparison because of the Codex model mismatch between runs.
The author seems to be doing a lot of work on skills evaluation.
The instructions are standard documents - but this is not all. What the system adds is an index of all skills, built from their descriptions, that is passed to the llm in each conversation. The idea is to let the llm read the skill when it is needed and not load it into context upfront. Humans use indexes too - but not in this way. But there are some analogies with GUIs and how they enhance discoverability of features for humans.
I wish they arranged it around READMEs. I have a directory with my tasks and I have a README.md there - before codex had skills it already understood that it needs to read the readme when it was dealing with tasks. The skills system is less directory dependent so is a bit more universal - but I am not sure if this is really needed.
I have been using Claude Code to automate a bunch of my business tasks, and I set up slash commands for each of them. Each slash command starts by reading from a .md file of instructions. I asked Claude how this is different from skills and the only substantive thing it could come up with was that Claude wouldn't be able to use these on its own, without me invoking the slash command (which is fine; I wouldn't want it to go off and start checking my inventory of its own volition).
So yeah, I agree that it's all just documentation. I know there's been some evidence shown that skills work better, but my feeling is that in the long run it'll fall to the wayside, like prompt engineering, for a couple of reasons. First, many skills will just become unnecessary - models will be able to make slide decks or do frontend design without specific skills (Gemini's already excellent at design without anything beyond the base model, imho). Second, increased context windows and overall intelligence will obviate the need for the specific skills paradigm. You can just throw all the stuff you want Claude to know in your claude.md and call it a day.
> Is any of this standardization really needed?
This standardization, basically, makes a list of docs easier to scan.
As a human, you have a permanent memory. LLMs don't have it, they have to load it into the context, and doing it only as necessary can help.
E.g. if you had anterograde amnesia, you'd want everything to be optimally organized, labeled, etc, right? Perhaps an app which keeps all information handy.
We’re working with the models that are available now, not theoretical future models with infinite context.
Claude is programmed to stop reading after it gets through the skill’s description. That means we don’t consume more tokens in the context until Claude decides it will be useful. This makes a big difference in practice. Working in a large repo, it’s an obvious step change between me needing to tell Claude to go read a particular readme that I know solves the problem vs Claude just knowing it exists because it already read the description.
Sure, if your project happened to already have a perfect index file with a one-sentence description of each other documentation file, that could serve as a similar purpose (if Claude knew about it). It’s worthwhile to spread knowledge about how effective this pattern is. Also, Claude is probably trained to handle this format specifically.
I'd argue we jumped that shark since the shift in focus to post training. Labs focus on getting good at specific formats and tasks. The generalization argument was ceded (not in the long term but in the short term) to the need to produce immediate value.
Now if a format dominates it will be post trained for and then it is in fact better.
Skills can contain scripts, making them a lot more versatile than just a document.
Of course any LLM can write any script based on a document, but that's not very deterministic.
A good example is Anthropic's PDF creator skill. It has the basic english instructions as well as actual Python code to generate PDFs
In addition to the points others makes standardization also opens opportunities for training and RL that benefit from the standardization.
It's all about managing context. The bitter lesson applies over the long haul - and yes, over the long haul, as context windows get larger or go away entirely with different architectures, this sort of thing won't be needed. But we've defined enough skills in the last month or two that if we were to put them all in CLAUDE.md, we wouldn't have any context left for coding. I can only imagine that this will be a temporary standard, but given the current state of the art, it's a helpful one.
I'm a little sad, in this case, that ongoing integration via fine-tuning hasn't taken off (not that I have enoughe expertise to know why.) It would be nice, dammit, if I could give explicit guidance for new skills by day, and have my models consolidate them by night!
Skills are not just documentation. They include computability (programs/scripts), data (assets), and the documentation (resources) to use everything effectively.
Programs and data are the basis of deterministic results that are accessible to the llm.
Embedding an sqlite database with interesting information (bus schedules, dietary info, or a thousand other things) and a python program run by the skill can access it.
For Claude at least, it does it in a VM and can be used from your phone.
Sure, skills are more convention than a standard right now. Skills lack versioning, distribution, updates, unique naming, selective network access. But they are incredibly useful and accessible.
On the one hand, I agree.
The whole point of LLM-based code execution is, well, I can just type in any old language it understands and it ought to figure out what I mean!
A "skill" for searching a pdf could be :
* "You can search PDFs. The code is in /lib/pdf.py"
or it could be:
* "Here's a pile of libraries, figure out which you want to use for stuff"
or it could be:
* "Feel free to generate code (in any executable programming language) on the fly when you want to search a PDF."
or it could be:
* "Solve this problem <x>" and the LLM sees a pile of PDFs in the problem and decides to invent a parser.
or any other nearly infinite way of trying to get a non-deterministic LLM to do a thing you want it to do.
At some level, this is all the same. At least, it rounds to the same in a sort of kinda "Big O" order-of-magnitude comparison.
On the other hand, I also agree, but I can definitely see present value in trying to standardize it because humans want to see what is going on (see: JSON - it's highly desirable for programmers to be able to look at a string representation of data than send opaque binary over the wire, even though to a computer binary is gonna be a lot faster).
There is probably an argument, too, for optimization of context windows and tokens burned and all that kinda jazz. `O(n)` is the same as `O(10*n)` (where n is tokens burned or $$$ per second or context window size) and that doesn't matter in theory but certainly does in practice when you're the one paying the bill or you fill up the context window and get nonsense.
So if this is a _thoughtful_ standard that takes that kinda stuff into account then, well, great! It gives a benchmark we can improve and iterate upon.
With some hypothetical super LLM that has a nearly infinite context window and a cost/tok of nearly zero and throughput nearing infinity, you can just say "solve my problem" and it will (eventually) do it. But for now, I can squint and see how this might be helpful.
You may be right, but I find myself writing English differently depending on the audience: people vs AI.
I haven't done a formal study, so I can't prove it, but it seems like I get better output from agents if I tailor my English more towards the LLM way of "thinking".
The main thing here would need standardisation is the environment in which the skill operates. The skill instructions are interpreted by the AI, any support scripts are. Interpreted by the environment.
You don't want to give an English description of how to compress LZMA and then let the AI do it token by token. Although that would be a pretty good arduous methodical benchmark task for an AI.
It's not about instructions, it's about discoverability and data.
Yeah, WWW is really just text but that doesn't mean you don't need HTTP + HTML and a browser/search engine. Skills is just that, but for agent capabilities.
Long term you're right though, agents will fetch this all themselves. And at some point they will not be our agents at all.
This is pushed by Antropic, OpenAI doesn't seem to care much about "skills". Maybe Anthropic is doing some extra training to better follow sections of text marked as skill, who knows? Or you can just store what worked as a skill and share with others without any need to do their own prompt for common tasks?
I agree with this and it's a conversation I've struggled to have with coworkers about using these -
IMO it's great if a plugin wants to have their own conventions for how to name and where to put these files and their general structure. I get the sense it doesn't matter to agents much (talking mostly claude here) and the way I use it I essentially give its own "skills" based on my own convention. It's very flexible and seems to work. I don't use the slash commands, I just script with prompts into claude CLI mostly, so if that's the only thing I gain from it, meh. I do see other comments speculating these skills work more efficiently but I'm not sure I have seen any evidence for that? Like a sibling comment noted I can just re-feed the skill knowledge back into the prompt.
You are right about it's just natural language but Standarization is very improtant, because it's never just about the model itself, the so called Harness is a big factor on LLM performance and standarization allows all harness to index all skills.
yeah the boon of LLM is how it gives a masked incentive for every jane and joe to be intentional communicators.
Post training can make known formats more reliable.
I’ve been scratching my head on this one too. You’re probably right about the bitter lesson... at the end of the day, plain English instructions in the context window are what do the heavy lifting.
That said, I reckon that’s actually what this project is trying to lean into. It looks like it's just standardising where those instructions live (the SKILL.md format) so tools can find them, rather than trying to force a new schema.
Fair play to them for trying to herd the cats. I think there's an xkcd comic for this one somewhere.
Skills are for the most part already generated by LLMs. And, if you're implementing them in your own workflow, they're tailored to real-world problems you've encountered.
Having a super repo of everyone else's slop is backwards thinking; you are now in the era where creating written content and verifying it's effectiveness is easier than ever.
[dead]
what a great comment
I share your skepticism and think it's the classic pattern playing out, where people map practices of the previous paradigm to the new one and expect it to work.
Aspects of it will be similar but it trends to disruption as it becomes clear the new paradigm just works differently (for both better and worse) and practices need to be rethought accordingly.
I actually suspect the same is true of the entire 'agent' concept, in truth. It seems like a regression in mental model about what is really going on.
We started out with what I think is a more correct one which is simply 'feed tasks to the singular amorphous engine'.
I believe the thrust of agents is anthropomorphism: trying to map the way we think about AI doing tasks to existing structures we comprehend like 'manager' and 'team' and 'specialisation' etc.
Not that it's not effective in cases, but just probably not the right way to think about what is going on, and probably overall counterproductive. Just a limiting abstraction.
When I see for example large consultancies talking about things they are doing in terms of X thousands of agents, I really question what meaning that has in reality and if it's rather just a mechanism to make the idea fundamentally digestable and attractive to consulting service buyers. Billable hours to concrete entities etc.