The March of Nines: Why Agent Skills Alone Won't Reach Production Reliability

Andrej Karpathy put a name to a pattern that every team running agents in production eventually meets: the march of nines. The original context was self-driving cars, but the math applies just as cleanly to agentic AI pipelines. Both are multi-step sequential systems where each step depends on the state the previous step left behind. Failure compounds. The gap from “works in a demo” to “works at scale” is not one more engineering effort. It is several more, each as large as the first, and each less visible than the last.

Figure 1 - Compounding failure rate table showing 10-step workflow success rates at 90%, 99%, 99.9%, and 99.99% per-step reliability, with failures per day at ten runs per day.

Figure 1 - The Compounding Failure Table: Each row represents one nine of reliability. Moving from the first nine (90%) to the second nine (99%) takes roughly the same engineering effort as getting from 0% to 90% in the first place. The 10-step count is a stand-in for any real pipeline; the principle is the same whether your workflow has six steps or sixteen.

The march of nines, properly attributed#

Karpathy’s quote, as recorded in Zvi Mowshowitz’s episode notes from the Dwarkesh podcast:

“Every single nine is the same amount of work. When you get a demo and something works 90% of the time, that’s just the first nine. Then you need the second nine, a third nine, a fourth nine, a fifth nine.”

He was talking about Waymo and Tesla, not Claude. The original passage: “It’s a march of nines (of reliability). Waymo isn’t economical yet, Tesla’s approach is more scalable, and to be truly done would mean people wouldn’t need a driver’s license anymore” [1].

The point: full self-driving reliability looks deceptively close once you have a working demo. The gap from a 90%-reliable demo to a 99.99%-reliable product is not 10% of the work. It is three more equally hard laps around the engineering track.

The same compounding shows up in agentic workflows. A 10-step pipeline at 90% per-step success completes the full chain about 35% of the time. At 99% per-step, that jumps to 90%. The numbers for the next two nines:

Per-Step Success	10-Step Success	Failures per Day (10 runs)
90%	34.9%	~6.5
99%	90.4%	~1
99.9%	99.0%	~1 every 10 days
99.99%	99.9%	~1 every 100 days

Real workflows may have fewer steps, or human checkpoints that partially reset the compounding. The principle holds either way. Sequential multi-step systems have compounding failure, and that compounding accelerates badly in the range where demos tend to live (90-95% per step). This is why your agent looks great on a Thursday afternoon and falls over in production the following Monday.

Why agent skills are the obvious and incomplete answer#

Skills are the natural first response. You encode your standard operating procedure as a Markdown file. Step one: extract the relevant data. Step two: classify it. Step three: load the relevant playbook. Step four through ten: proceed accordingly. The agent reads the skill, follows the steps, and your 90%-reliable demo becomes noticeably more consistent.

This is exactly what Anthropic productized on January 30, 2026, with the launch of Claude Cowork plugins. Eleven plugins were open-sourced at launch, covering productivity, enterprise search, sales, finance, data, legal, marketing, customer support, project management, and biology research. Each plugin bundles skills, connectors, slash commands, and sub-agents into a domain-specific package.

The market read this as a structural threat to vertical SaaS. In the days that followed, roughly $285 billion in SaaS market cap evaporated: Thomson Reuters fell 18%, LegalZoom dropped 20%, Pearson lost 7%, and the S&P 500 software and services index was off roughly 8% [7], [8]. The market’s logic was: if an agent skill can encode a contract-review workflow in Markdown and execute it reliably, what is the value proposition of the dedicated SaaS tool?

The answer is in the reliability math.

SkillsBench tested 84 tasks across 11 domains against 7 frontier model configurations including Claude, GPT, and Gemini [2]. The result: a baseline average pass rate without skills of 24.3%, rising to 40.6% with curated skills [2]. A +16.2 percentage point improvement is real and worth having.

But average pass rates of 40.6% with curated skills means more than half of agent runs still fail even with the best available skills. And 16 of the 84 tasks showed a negative delta when skills were applied, meaning skills made those tasks perform worse, not better. Self-generated skills, on average, provided no benefit at all.

Figure 2 - Bar chart comparing SkillsBench pass rates: 24.3% baseline, 40.6% with curated skills, with annotation showing 16 of 84 tasks had negative delta with skills applied.

Figure 2 - SkillsBench Results: The +16.2 percentage point improvement from curated skills is real. But a 40.6% average pass rate still means most runs fail, and 16 tasks got worse with skills applied. Skills raise the floor without raising the ceiling. Source: SkillsBench [2].

This ceiling exists because skills are prompts. When you write a skill, you are asking the model to comply with your procedure, not enforcing it. There is a fundamental difference between a document that says “run the validation check at step seven” and a system that runs it regardless of what the model thinks. The model may decide step seven is unnecessary, compress later steps as context fills, or follow the skill precisely and still hallucinate step seven’s output in a way that propagates cleanly into steps eight, nine, and ten.

Figure 3 - Conceptual diagram showing skills as influence arrows pointing at an LLM node, versus a harness as code boxes surrounding the LLM node with deterministic enforcement arrows.

Figure 3 - Skills Are Influence, Harnesses Are Enforcement: A skill tells the model what to do. A harness makes sure it happens. The difference is not stylistic. It is architectural. Skills sit inside the model’s context window, subject to model judgment. Harness gates sit in the orchestration code, subject to a simple conditional check that the model cannot override.

What the upper nines actually require#

The move from influence to enforcement is the architectural shift that gets you past the second nine.

Stripe’s Minions system is the most concrete public example of this shift at production scale. Over 1,300 AI-only pull requests are merged at Stripe every week [5]. Every Minion-produced PR is human-reviewed before merging, but the coding and testing steps run fully unattended. The test validation runs against a subset of Stripe’s battery of over 3 million tests, via nearly 500 MCP tools exposed through an internal server called Toolshed [6].

Here is the critical design choice: Stripe did not prompt Claude Code to run tests. They baked test validation into the deployment pipeline as a non-negotiable step. The test suite runs against every Minion-generated change, not because the model remembered to do it, but because the harness enforces it.

A harness gate cannot be argued out of. It cannot be forgotten. It cannot be skipped because the model has decided the change looks clean without testing. It runs. Every time.

Figure 4 - Stripe Minions architecture: LLM nodes for code generation connected to a deterministic test-validation gate and a final human review gate.

Figure 4 - Stripe Minions: LLM Nodes Plus Deterministic Gates: Each Minion runs as a one-shot coding agent triggered by a Slack emoji reaction. The code generation and exploration steps are LLM-driven and benefit from skills and context. The test validation step is a deterministic gate that runs against over 3 million tests regardless of model confidence. Human review is the final gate before merge. This is not a coincidence of Stripe’s particular situation. It is the pattern that makes the system work.

The infrastructure behind Minions is not a recipe to follow literally. A decade of investment in test suites, standardized devbox environments, nearly 500 internal tools. It is a proof of concept for a principle: identify which steps cannot fail, then enforce those steps in code rather than a prompt.

Figure 5 - Comparison table of skills versus harness gates across six dimensions: step enforcement, validation, error recovery, auditability, context isolation, cost control.

Figure 5 - Skills Versus Harness Gates: The structural difference across six operational dimensions. Skills excel at encoding domain knowledge and guiding model judgment within a phase. Harness gates excel at guaranteeing that phases happen in the right order with the right checks between them. The choice is not which one to use. It is which steps get which treatment.

Prompting is influence and code is enforcement, and you cannot prompt your way to the upper nines any more than you can ask a checklist to run itself.

Choosing: skills versus harness gates, step by step#

The practical question is not “should I use skills or build a harness.” It is “which steps in this specific workflow get which treatment.”

Here is the heuristic we use. For each step in a workflow, ask: “If the model skipped this step or got it wrong, would the downstream workflow still be acceptable?”

If yes, the step is a skill candidate. These are judgment steps: expanding a dataset, drafting an explanation, classifying ambiguous input, generating a summary. The model’s judgment is the point. A skill guides that judgment. Occasional errors are correctable or tolerable.

If no, the step is a harness gate. These are enforcement steps: confirming a required input exists before proceeding, validating output schema before passing to the next phase, running a test suite before a PR is raised, refusing to advance until a required approval is logged. The model’s judgment is not the point. The step must happen.

Most production workflows split roughly 70 to 30 or 80 to 20 between skill-appropriate steps and gate-appropriate steps. The gates are few. They are load-bearing. Missing one gate is where the upper nines collapse.

Figure 6 - Decision tree: for each workflow step, branch on whether the result is still acceptable if the model skips or errs there, routing to skill candidate or harness gate.

Figure 6 - The Skill-or-Gate Decision Tree: Apply this to every step in your workflow before building. The judgment steps get skills. The cannot-fail steps get code. In a 10-step workflow, you might find 2 or 3 gate-worthy steps. Those 2 or 3 determine your production reliability ceiling more than anything else in the system.

To make this concrete: the pipeline that produced this post has several skill-appropriate steps (querying the KB, drafting prose, generating figure prompts) and a small set of gates (research brief must exist before drafting, image prompts must exist before reviewer approval, reviewer verdict must be APPROVED before publication). The writer is a skill. The orchestrator enforces the gates.

A practitioner story#

A year ago we were on a Servoy and Postgres modernization engagement where the agent demo was genuinely impressive. The proof-of-concept ran cleanly against every test we wrote. Every edge case we could think of passed. The client demo went well. Then we ran the agent against ten months of production data.

It failed two-thirds of the way through, every time at the same point in the pipeline. The agent was hitting a class of records it had not been tested against, generating an intermediate output that passed its own internal checks but failed silently when passed to the next phase. Because there was no gate between those two phases, the failure propagated, accumulated, and eventually crashed the run.

The fix was not a better prompt. We added a harness gate between the two phases: a schema validation check that compared the intermediate output against an expected structure before advancing. If the check failed, the harness restarted that phase from the last checkpoint and logged the failure. The model did not get smarter. The system got stricter, and the failure rate dropped to near zero.

Figure 7 - Phase restart on schema-validation failure: harness logs the checkpoint, restarts the failed phase, and re-validates before advancing.

Figure 7 - Phase Restart on Validation Failure: A harness gate catches a schema mismatch between Phase N and Phase N+1. Rather than propagating the bad output or failing the entire run, the harness logs the checkpoint, restarts Phase N with the corrected state, and re-validates before allowing the pipeline to advance. The model’s role is the same as before. The harness added the guarantees the model could not provide itself.

The demo passed because we only tested the happy path. The production system failed because real data is not the happy path. The harness gate was the only thing that could make the system resilient to that gap.

What to take away#

Three things, plainly.

First: skills and harnesses are not competing choices. They are complementary tools for different steps. Skills encode judgment. Harnesses enforce structure. You need both in any production workflow of meaningful complexity.

Second: the upper nines come from deterministic gates, not prompt refinement. If your agent is failing in production at a rate that is unacceptable, the answer is almost never a better system prompt. It is a gate at the step where the failure is occurring.

Third: the productive question is not “how do I prompt better?” It is “which steps in my workflow cannot be allowed to fail?” That list is your harness design. Encode those steps in code, not in a Markdown file the model may choose to interpret.

Figure 8 - Summary card: three takeaways with a skills section, a harness gates section, and the central design question "which steps cannot fail?"

Figure 8 - Three Things to Take Away: Skills for judgment, gates for enforcement, and the reliability question as the design input. Most agent reliability work lives at the intersection of these three. The teams that have crossed the production threshold built gates before they stopped blaming the model.

For teams that have a working demo but are watching it fail in production, this is the conversation worth having. We have built this infrastructure for Servoy shops and standalone AI pipelines alike. The pattern transfers cleanly across both.

The Series#

This is Part 1 of the three-part Production Reliability sub-series:

The March of Nines: Why Agent Skills Alone Won’t Reach Production Reliability (this article) — Karpathy’s reliability math, why skills alone plateau, and where deterministic harness gates earn their place
The Adversarial Evaluator: GAN-Inspired Harness Architecture from Anthropic — a generator-evaluator pattern that pushes per-step success rates toward the upper nines
Stripe Minions and the Hybrid Secret: Deterministic Rails Around AI — the production-scale case study, the test-gate pattern, and the architecture that makes 1,300 AI-only PRs per week sustainable

References#

[1] A. Karpathy and D. Patel, “AGI is still a decade away,” Dwarkesh Podcast, Oct 2025. https://www.dwarkesh.com/p/andrej-karpathy

[2] R. Bao et al., “SkillsBench: Evaluating Agent Skills Across Domains,” arXiv preprint arXiv:2602.12670, Feb 2026. https://arxiv.org/abs/2602.12670

[3] Anthropic, “Introducing Claude Cowork plugins,” Anthropic News, Jan 2026. https://www.anthropic.com/news/cowork-plugins

[4] Anthropic, “Skill authoring best practices,” Claude Platform Documentation, 2025. https://platform.claude.com/docs/en/agents-and-tools/agent-skills/best-practices

[5] A. Gray, “Minions: Stripe’s AI coding agent,” Stripe Engineering Blog, Feb 2026. https://stripe.com/blog/minions

[6] ByteByteGo, “How Stripe builds and ships AI coding agents at scale,” ByteByteGo Newsletter, 2026. https://blog.bytebytego.com/p/stripe-minions

[7] CNBC, “Vertical SaaS sells off as Anthropic ships agent plugins,” CNBC Business, Feb 2026. https://www.cnbc.com/

[8] Fortune, “AI agents and the $285B SaaS reckoning,” Fortune, Feb 2026. https://fortune.com/

[9] G. Dotzlaw, K. Dotzlaw, and R. Dotzlaw, “What Is an Agent Harness, Really?”, 2026. /insights/claude-code-01-agent-harness/