On March 3, 2026, Anthropic updated a tool called skill-creator without much fanfare. But if you use Claude for real work, especially if you've built any custom skills, this update changes how you know whether your setup is working.
The short version: before this update, there was no good way to test whether a skill was doing its job, outside trial and error. You'd build one, try it a couple of times, decide it seemed fine, and move on. If something broke later, say, after a model update, you often wouldn't find out until a deliverable came out wrong.
The update fixes that. It includes real tests, scored results, blind comparisons, and an automated tool that improves the part of the skill that controls whether it loads at all. None of it requires coding.
Note: This guide is primarily geared toward users building skills in Claude.ai or Claude Cowork. Claude Code skills work largely the same, but unfortunately, Claude Code can’t read the skills you have in Claude.ai or Cowork, and vice-versa.
Skill
A folder you upload to Claude that contains a text file (called SKILL.md) with instructions for a specific task. It teaches Claude how to do something your way. A skill might tell Claude to always write LinkedIn posts in a certain voice, or to follow your company's specific proposal structure.
Skill-creator
A built-in tool that helps you build, test, and improve skills. You access it by asking Claude to use it. No installation needed on Claude.ai or Cowork. Add it to Claude Code by pointing it toward this Github.
Eval (evaluation)
A test for your skill. You write a few realistic prompts — the kind of thing you'd actually ask Claude to do — and Claude runs them twice: once with your skill loaded and once without. Both outputs get scored against assertions you define (or that skill-creator drafts for you), then displayed in a browser-based viewer where you review the actual outputs and leave feedback. This built-in comparison tells you whether the skill is improving Claude's output or just adding noise. Best practice: use real prompts from your day-to-day work rather than invented examples.
Benchmark
Running your eval suite and saving the scored results — pass rate, time, and token usage — as a snapshot. A benchmark is just your evals with numbers attached. You rerun it after model updates or skill edits to catch regressions.
Blind A/B comparison
An automated, unbiased comparison between two outputs. A separate Claude instance reviews both without knowing which is which, picks a winner, and explains why. You can compare skill vs. no skill, two versions of the same skill, two different skills, or the same skill on two different models. Requires subagents, so it's available in Cowork and Claude Code only.
Trigger / triggering
When Claude decides to load and use a skill. Claude reads the short description at the top of each skill file and decides whether to activate it based on what you're asking. A skill that triggers at the wrong time, or never triggers, is a broken skill regardless of how good the instructions inside it are.
Description optimization
An automated process that rewrites and tests different versions of your skill's description until it finds one that triggers reliably.
Pass rate
The percentage of your evals that a skill passes. If you wrote 5 tests and the skill passes 4, your pass rate is 80%.
The actual text file that lives inside a skill folder. It contains two parts: a short block of metadata at the top (called frontmatter) that tells Claude when to load the skill, and the full instructions below it. Every skill requires exactly one SKILL.md file, named exactly that, capitalization included.
Frontmatter
The block of metadata at the very top of a SKILL.md file, wrapped between two sets of --- dashes. It includes the skill's name and description at minimum. Claude reads only the frontmatter of every available skill on each message to decide which ones to load, and the rest of the file stays unloaded until it's needed.
scripts/ folder
An optional folder inside a skill that holds executable code (Python, Bash, etc.). Scripts are useful for validation steps that need to be deterministic. A prose instruction leaves room for interpretation; a script just runs the check and returns pass or fail. Claude executes scripts when the skill instructs it to. Requires code execution to be enabled in Claude.ai and Cowork; runs natively in Claude Code.
references/ folder
An optional folder inside a skill for supporting documentation, API guides, example libraries, style references, anything Claude only needs for specific subtasks. Claude loads these files on demand rather than upfront, which keeps the skill from consuming unnecessary context on every activation.
Subagent
A separate, independent Claude instance that runs in its own clean context. Subagents are used to parallelize work, for example, running multiple evals simultaneously without any context bleed between them. Subagents are available in Cowork and Claude Code, but not in standard Claude.ai.
Skill chaining
Instructing one skill to invoke another as a step in its workflow. For example, a LinkedIn post skill that tells Claude to run a brand voice skill on the draft before returning it to you. This lets you build modular workflows where each skill handles one job and passes the result to the next.
MCP (Model Context Protocol)
An open standard that connects Claude to external tools and services: Notion, Asana, Linear, or your own internal systems. Skills teach Claude how to accomplish a workflow. MCP gives Claude the tool access needed to execute it.
CI system (Continuous Integration)
An automated pipeline that runs tests whenever code or configuration changes. In the context of skills, this means automatically running your benchmark suite after each model update or skill edit, so you catch regressions without having to remember to run them manually. Supported via the Claude API and Claude Code; not available in Claude.ai or Cowork.
Eval viewer
A browser-based HTML interface that displays your eval results after a test run. Shows each test case one at a time: the input given to the skill and the output it produced. You type feedback directly into the viewer for each case, then click "Submit All Reviews."
The update adds three things:
1. Evals with built-in comparison — Automated tests with scored results. Each test runs the same prompt with and without your skill, then displays both outputs in a browser-based viewer where you review the actual results and leave feedback. The scores tell you where the skill passed and where it fell short; your review tells Claude what to fix next. If Claude performs just as well without the skill loaded, the skill may be outdated or unnecessary.
2. Blind A/B testing — An automated, unbiased comparison where a separate Claude instance reviews two outputs without knowing which came from which version, then picks a winner. Works for comparing any two versions: skill vs. no skill, old vs. new, or even different skills. Available in Cowork and Claude Code only.
3. Description optimization — An automated process that rewrites and tests the part of the skill that controls whether it fires at all. Anthropic ran this across their own six built-in document skills and saw improved triggering on five of the six.
The good news: The updated skill-creator tool is available in your Claude instance without you needing to do a thing. Anthropic manages that for you.
The bad news: The update isn’t automated in Claude Code, so you’ll need to point it to the skill-creator Github repo.
Skills look and sound very fancy. But good skills are shockingly easy to build, especially if you already have documented processes or example outputs.
(Great skills require patience for refinement, but very little technical savvy.)
To start, either ask Claude directly: "Use the skill-creator to build a new skill" — or go to Customize and create a new skill from there.
Don't build a skill without it. The skill-creator handles the structure, generates test cases, and manages the iteration loop for you. Without it, you're writing a SKILL.md file by hand without Claude’s built-in best practices.
Before you describe your skill, think through what it actually does. For each step the skill should handle, ask yourself: What's the action? What does "good" look like? What inputs does it need? If you can answer those, Claude will build a much better skill on the first try. If you can't, that's a sign you need to map out the process first. The skill-creator will help you think through it, but you'll get further faster if you've done some pre-thinking (which you can do with Claude before you invoke the skill-creator).
Once you’ve done that, tell Claude what the skill should do. Explain it in plain language, paste in an example of work you'd like Claude to replicate, or upload a document showing the kind of output you're after. Claude asks clarifying questions, then drafts the skill for you.
After Claude drafts your skill, run evals immediately. Write three to five test prompts and run them. Each test runs the prompt twice — once with your skill and once without — so you can see whether the skill is actually improving things. Open the eval viewer to read through the actual outputs side by side. The scores tell you whether something failed; the viewer tells you why. You want a clear gap. The skill version should be noticeably better at the specific thing it's built for.
Once you know what to fix, revise and rerun. Two or three rounds before a skill becomes reliable is normal.
Once the skill content is solid, optimize the description so it loads reliably. Get the instructions right first, then tune the trigger. Ideally do this in Cowork or Claude Code to take advantage of the automated description optimization capability (Claude.ai does not have access). More on this below.
• Claude.ai and Cowork: Already available. Ask Claude to use the skill-creator.
• Claude Code: Grab it from github.com/anthropics/skills
• Official announcement: claude.com/blog/improving-skill-creator-test-measure-and-refine-agent-skills
You have skills uploaded and working, or so you think. The new tools let you verify that properly.
Say: "Use the skill-creator to edit [your skill name]."
You can also edit by clicking Customize>[Relevant Skill], and then selecting the three dots at the upper righthand corner of your screen. By default, this will open up a new chat with Claude Sonnet 5.4. If you want to use Opus 5.4, you need to ask to edit the skill via chat.
Say: "Use the skill-creator to run evals on [your skill name]." Skill-creator generates test cases, runs each prompt twice — once with your skill and once without — grades the outputs, and opens a viewer. Each test case shows the skill's output alongside vanilla Claude's output, so you can see whether the skill is actually helping.
For each case, type your feedback directly in the "Your Feedback" box on the page. If the output looks right, leave it blank. If something went wrong, describe what you'd want corrected.
When you've reviewed all cases, click "Submit All Reviews." If you're using Claude.ai, this downloads a feedback.json file to your computer. Bring that file back to Claude. Paste the contents or upload it, and skill-creator reads it to inform the next round of edits. Repeat until the skill passes.
If you're using Cowork, you can review the evals directly in the app, and then just let Cowork know you've finished reviewing.
If Claude without the skill produces equally good or better output, your skill has likely been made redundant by a model update. That happens regularly, and it's not a failure — it means you can simplify your setup.
Evals show you both outputs side by side and let you judge. A blind A/B comparison automates that judgment: a separate, independent Claude instance reviews both outputs without knowing which came from your skill and which didn't, then picks a winner and explains why.
This is useful beyond just skill-vs-no-skill. You can compare two versions of the same skill, two different skills that do the same thing, or the same skill running on two different models. Tell Claude: "Run a blind A/B comparison on my [skill name]."
Claude will ask what you want to compare and suggest test cases. If you have real prompts, share those to help Claude refine its tests.
Anthropic ships new Claude models regularly. When one drops, rerun your evals. Skills that worked cleanly on the previous model sometimes behave differently on a newer one. So as a best practice: new model update = retest all your skills.
Every skill has a short description at the top ("frontmatter") that Claude reads on every message to decide whether to load it. Think of it like a book jacket — Claude skims all the jackets and picks the ones that seem relevant. If the jacket is vague, Claude skips it. If it's too broad, Claude loads it for things it shouldn't.
Description optimization is an automated trial-and-error process that rewrites that jacket until Claude makes the right call reliably. Here's what happens behind the scenes:
The whole process runs automatically in the background. You get a live report in your browser that updates as each round finishes, showing scores and the description tried at each step. At the end, you get a before/after comparison and the winning description gets applied to your skill.
Why it matters: Anthropic ran this on their own six built-in document skills and saw improved triggering on five of the six. A skill with great instructions but a bad description is a skill that never loads when you need it — which is the same as not having the skill at all.
Where it works: Cowork and Claude Code only. It requires Claude's command-line tool running in the background, which isn't available in standard Claude.ai.
Skills don’t automatically update.* When Claude is done making changes, it will generate a file that looks like this:
Before you click “Copy to your skills”, double check that the skill contains all the right files by clicking on it. Frequently, you will have reference files as part of your skill. Occasionally, Claude repackages the skill incorrectly and leaves those files out.
You may want to download the previous version of your skill as well and save it, since once you replace your older skill, it can’t be restored.
If your skill has reference files or a script, don’t use the built-in download button, as that will only download the SKILL.md file. Ask it to package all files in a zip file. Then double check the zip file contains all files.
*Unless you’re building skills in Claude Code.
Anthropic offers three main ways to use Claude, and they're not the same under the hood.
Claude.ai is the standard web and mobile chat interface, and what most people use day to day. Skills work here, evals work here, and the skill-creator is built in. The one thing to know: code execution (running actual scripts) requires you to turn it on under Settings > Features. If it's off, Claude can read a script but can't run it.
Cowork is Claude's desktop app built for non-developers who want automation without writing code. It has everything Claude.ai has, plus subagents — meaning it can spin up multiple independent Claude instances in parallel to run tasks simultaneously. That's relevant for evals specifically.
Claude Code is the command line tool, built for developers. Scripts run natively here with no settings toggle required. It also has subagents, full filesystem access, and supports wiring skill benchmarks into CI systems.
The tips below call out when something works differently depending on which surface you're on.
When you write instructions in a SKILL.md file, Claude interprets them. That means there's always some flex in how they get applied. For the bulk of instructions, that's fine. For validation steps that have to be exactly right every time, it's a problem.
The fix is to put those checks in an actual script instead. A Python or Bash file has no interpretive flexibility. It checks the thing, returns a pass or fail, and Claude stops or continues accordingly. If a required field is missing, the script catches it. If a file format is wrong, the script catches it. Claude doesn't have to make a judgment call.
You bundle the script in a scripts/ folder inside your skill, and reference it from your SKILL.md instructions. When the skill runs, Claude executes the script and acts on the result.
Anthropic's own PDF skill uses this pattern. The skill previously struggled with non-fillable forms because Claude was trying to position text at exact coordinates based on prose instructions alone. They moved that logic into a script that extracts the coordinates programmatically, and the problem went away.
One surface difference worth knowing: In Claude Code, scripts run natively. In Claude.ai and Cowork, code execution is off by default — you'll need to turn it on under Settings > Features before scripts in your skills will run.
You do not need to know how to code to write a script. And you don’t even need to know when a script would be valuable. You can just ask Claude, “Would this skill benefit from a script?” And it will tell you yay or nay. If yay, it will build and bundle the script for you. Just make sure when you go to upload the skill, the file contains the script, not just the SKILL.md file. It should look like this:
Claude loads the full SKILL.md body every time the skill triggers, so file length has a direct cost. A 2,000-word SKILL.md adds 2,000 words of context overhead on every activation. Use the references/ folder for detailed documentation, example libraries, API guides, and anything Claude only needs for specific subtasks — point to those files from SKILL.md with a note on when to read them.
The practical limit is 5,000 words in SKILL.md before performance starts to degrade. Under 500 lines is a reasonable target.
Claude loads multiple skills at the same time, which means two things worth knowing.
The first is a design constraint: don't write instructions that try to override global behavior or step on what other skills might be doing. Keep your skill's instructions scoped to the specific task it covers. A skill that tells Claude to "always respond in bullet points" or "never use formal language" will conflict with every other skill in the library.
The second is a feature you can use intentionally. You can instruct Claude inside a SKILL.md file to invoke another skill as a step in the workflow. For example, an instruction like "once the post is drafted, use the ai-writing-guard skill to check it before returning the output" will chain the two skills together. Claude loads the second skill mid-workflow and applies it.
This makes it possible to build modular skill stacks where each skill does one thing well and hands off to the next. A content workflow might chain a voice skill, then a formatting skill, without any of that logic living in a single bloated SKILL.md.
In Claude Code, you can also do this more explicitly through the agent field in a skill's frontmatter, which spawns a subagent running a specific skill or agent configuration. That approach gives you more control over context isolation between steps.
If a skill is triggering for the wrong things, the fix is usually adding explicit exclusions to the description. Something like: "Do NOT use for simple data lookups or general questions. Use only for the full report generation workflow." Claude uses this to stay out of situations that don't warrant the skill.
Too vague and you get false triggers; too narrow and it never fires. Negative triggers are the fix — they let you stay specific without overcorrecting.
Add a version field under metadata in your frontmatter. When you run benchmarks after a model update or after editing the skill, you'll want to know which version produced which results. Without versioning, benchmark data becomes hard to read.
|
|
If you're running skills via the API or Claude Code in a production pipeline, you can wire benchmark results into a CI system and tie them to version numbers. That's not available in Claude.ai or Cowork, where results stay local.
Running more than 20 to 50 skills simultaneously degrades performance. Claude loads all skill descriptions into context on every message to decide which skills to activate. At scale, that overhead adds up. Be selective about which skills stay on by default and which ones get enabled only when you're doing relevant work.
When a skill coordinates across more than one MCP integration, the order of calls and how data passes between steps has to be spelled out. Don't leave Claude to infer the sequence. Name each phase clearly, specify what data from phase one goes into phase two, and include a validation step before moving forward. Ambiguity in multi-MCP workflows compounds across steps.
Skills aren't set-and-forget. Models change, your processes evolve, and what worked three months ago might stop working without warning. The tools covered here (evals, benchmarks, blind comparisons, description optimization) exist so you don't have to guess whether your setup is still doing its job. Build the skill. Test it. Retest it when something changes. That's all there is to it.