Ten engineers, the output of thirty: the honest enterprise agentic playbook

Q: What is agentic AI for software teams?

AI agents that execute units of work end to end (read an issue, plan, write code, run tests, fix in a loop) rather than only suggesting code, moving the bottleneck from writing code to reviewing logic.

Q: What is the Agency Router?

A routing layer that assigns each task to the cheapest tier of model autonomy that can complete it reliably, so you spend the agency each task deserves. The router, not the model, is the product.

Q: How much productivity gain is realistic from AI coding agents?

A blended 1.7 to 2.1 times within six months, not 3x. Boilerplate compresses ~2.8x, refactoring ~2.4x, novel architecture ~1.1x; on familiar code a careless rollout can land below 1.0.

Q: Should AI coding agents run on local, self-hosted, frontier, or managed-cloud models?

All four, with a router choosing per task: local for routine/sensitive, self-hosted for 75-85% of volume, frontier for the hard 15-25%, managed cloud only for compliance. Self-hosted is ~8-15x cheaper all-in at volume.

Q: Build or buy: what agentic platform should an enterprise own?

Buy the models; build the platform. The router, the evaluation harness, and the guardrails are the product. Subscription seats are cheapest to start and most expensive at scale with the least control.

Q: What governance, evals, and guardrails do AI coding agents need before production?

An eval set of 50-100 real tickets run on every model/router change, budget caps and sensitive-repo pins at the gateway, secret redaction, and a human production gate. Without measured evals you govern on vibes.

Q: How do you keep proprietary code and data safe with AI agents?

Enforce the privacy order in the architecture: local, then self-hosted, then private cloud endpoint, then API with a no-train agreement, never proprietary code in a consumer chatbot. Pin sensitive repos and redact at the gateway.

Q: What is agent slop and how do you prevent it?

Plausible-but-wrong code produced at volume. Prevent it with human review that scales with output, tests gating the agent loop, an eval set that catches regressions, and a critic pass before merge.

Q: How should a CIO brief the CEO on cost and ROI?

Frame it as unit cost: a 35-45% lower cost per feature on most work. For 20 developers the tiered platform is ~$50-60k/month all-in versus $150k+ to hire the equivalent engineers, for the delivery of roughly thirty to forty.

Q: What is a realistic six-month rollout for agentic AI in engineering?

Weeks 1-4 pilot one repo and build the eval set; months 2-4 stand up the router and self-host a model; months 5-6 put agents in the pipeline with evals gating every change. At 90 days, if nothing moved, you bought tools, not a multiplier.

In short

Every vendor is selling a 3x engineering team. The number that survives contact with a real codebase is smaller and more useful: roughly 1.7 to 2.1 times blended throughput in six months, which feels like ten people delivering like fifteen to twenty, more on narrow work. You earn it from a router that runs four tiers and spends the agency each task deserves, not from buying model access. This post is the on-ramp. The full field guide, with the numbers and the roadmap, is a free ebook at the bottom.

A vendor will tell you that handing your developers an AI assistant turns ten of them into thirty. A skeptic will tell you it turns them into ten developers who write worse code faster. Both are wrong, and the gap between them is where the actual decision lives.

I have spent twenty-five years watching tools get sold as multipliers. Most were not. This one is, on a specific slice of the work, if you build the system around it. The trouble is that the system, not the model, is the hard part, and the system is the part nobody puts on the invoice.

Where should AI coding agents run? Local, self-hosted, frontier, or managed cloud

The first mistake is asking "which AI do we buy." The right question is "where does each task run." A serious organization runs four tiers at once and routes each request to the cheapest one that can do it well.

Local for the routine, a self-hosted open model for the bulk, a frontier model for the hard part, managed cloud only where compliance demands it. The router chooses per request.

The economics are the reason this matters. A 32B-class open model such as Qwen, served on one well-utilized high-memory card, costs roughly twenty to forty cents per million tokens. A frontier model costs around fifteen dollars for the same million. On token price that is forty to eighty times cheaper; once you count the operations team, power, and the frontier vendors' prompt caching, the honest all-in gap is closer to eight to fifteen times. Still large, and your code never leaves your network. (Size the model to the hardware: a 600B-class model like DeepSeek needs a multi-GPU node, not a single card.) So the goal is not to run the cheap model everywhere, which caps your quality, nor the expensive one everywhere, which empties your budget. The goal is a router that knows the difference.

The Agency Router

The Agency Router is the routing layer that assigns each task to the cheapest tier of model autonomy that can complete it reliably, so you spend the agency each task deserves and no more. It is the control plane of an agentic software team: it scores every request for difficulty and sensitivity, sends most work to a local or self-hosted model, escalates the genuinely hard 15 to 25 percent to a frontier model, keeps sensitive code in-house, and enforces the budget. The router, not the model, is the product.

One honesty note on the title. The pitch says thirty; the number you can defend to a board is closer to two. A well-run team of ten reads like fifteen to twenty across a normal portfolio, and only approaches twenty-five to thirty on narrow, mechanical work. Thirty is the ceiling on the easy slice, not the average. The rest of this piece is about earning the honest 1.7 to 2.1 times, because that is the number that survives the first skeptical CTO.

Which task goes to which tier is the routing decision in one table. This is the matrix the Agency Router applies on every request.

Task	Tier the router picks	Why
Autocomplete, docs, small edits	① Local	routine, private, zero token cost
Private-repo questions, sensitive code	① Local	nothing leaves the building
Boilerplate, scaffolding, tests, most refactors	② Self-hosted	the 75 to 85 percent the open model handles well
Cross-repo migration, hard multi-file debugging	③ Frontier	the hard 15 to 25 percent where quality pays for itself
Novel architecture, subtle correctness	③ Frontier + human	judgment the model does not have
Regulated, residency-bound workloads	④ Managed cloud	only where compliance forces it

What governance and security do AI coding agents need?

This is the question that decides whether the rollout survives the security review, and it is the reason the self-hosted tier exists. Agents that read your repositories and call tools are a new attack surface and a new path for intellectual property to leak. Before a single agent touches production, the CISO needs four answers, enforced in the architecture rather than in a policy memo.

Data residency and IP. Pin sensitive repositories to the local or self-hosted tier; redact secrets at the gateway; verify no-train terms in every vendor contract; never allow proprietary code into a consumer chatbot. The privacy order, most protective first: local, self-hosted internal, private cloud endpoint, API with a no-train agreement.
Quality as a measured number. An evaluation harness built from 50 to 100 real tickets in your history, run on every model or router change. Without it you are governing on vibes, and vibes do not survive an incident.
Cost containment. The router enforces budget caps; cheap is the default; the runaway ceiling is alarmed; the routing split is a live dashboard.
Accountability. If an agent ships a vulnerability, a named human owns the routing decision and the merge, not just the code. A human gate stays on production.

When agents lose money, and what happens to your engineers

An honest guide names the cases where this fails. Agents lose money on familiar code worked by senior engineers, where the review-and-correct loop costs more than the help; on tangled legacy with no fast tests, where the agent gets lost and confidently ships defects; and any time leadership budgets for a flat 3x and meets a 1.6x reality on maintenance work. The fix is always the same: fix the prerequisites before you scale, and route those tasks to a human.

The workforce question is the one a CEO asks next. This does not empty the org chart; it changes its shape. Your seniors stop spending afternoons on boilerplate and spend them on architecture, review, and judgment, which is where the agents cannot follow. Juniors need deliberate protection: rotate them through agent-off work and make review a teaching moment, or you raise a generation that never learns to debug. The new roles are the platform engineers who own the router and the evaluation harness, and they are the hire that actually pays for itself.

Do AI coding agents really make teams faster? The honest multiplier

The gains are not uniform, and the flat "3x across everything" is marketing. Boilerplate, scaffolding, test generation, and mechanical refactoring compress dramatically. Novel architecture and tangled legacy barely move. On familiar code worked by senior engineers, a careless rollout can even land below 1.0, slower than no agent, when the review loop costs more than the help. Your blended figure depends on your portfolio; for a real mixed codebase it lands between 1.7 and 2.1 times after five or six months of disciplined work. The slice figures below are what we measured on real pilots, not a vendor's chart.

Work	Realistic effect	Why
Boilerplate, scaffolding	about 2.8×	well specified, low judgment
Refactoring with retrieval	about 2.4×	the agent works from real code
Test generation	about 2.2×	mechanical, verifiable
Novel architecture, business logic	about 1.1×	this is judgment; the model does not have it

And the multiplier is gated by the codebase, not the model. An agent loses its way in a tangled repo the same way a new hire does, only faster and more confidently. Before you count a single hour saved, you pay for codebase hygiene, fast and trustworthy tests, disciplined review, real guardrails, and hardware that does not throttle your developers. None of that is on the vendor's invoice. All of it is on yours.

The part the pitch gets backwards

The win is not a bigger team. It is a lower cost per feature.

Framed as headcount, this sounds like a Gartner slide. Framed as unit cost, it sounds like an operator. Done right, it drops the cost of shipping a feature on most of your portfolio by 35 to 45 percent: ten engineers stop spending their afternoons on boilerplate, migrations, and the third near-identical CRUD endpoint, and spend them on the design and judgment only a human brings. Take 20 developers. The tiered mix here runs $6,000 to $12,000 a month in tokens and GPU, plus one or two platform engineers at $40,000 to $50,000 fully loaded, so call it $50,000 to $60,000 all-in, against $25,000 to $50,000 of undisciplined frontier spend that still needs the same staffing, and against the $150,000-plus a month the extra engineers it replaces would cost. Fully loaded on both sides, the platform wins by more than half.

The expensive, exciting half of this, the swarm of agents, is not where the discipline lives. The discipline lives in the quiet router that looks at most tasks and decides the cheap model is enough. A committee of language models is not a strategy. It is a bill. The strategy is knowing which questions are worth paying for, and which a twenty-cent model already answers as well as a fifteen-dollar one.

That is why most organizations get the same thing wrong. They buy the model and skip the platform. The routing, the evaluations, and the guardrails are the thing you are actually building, and they are exactly what no one budgets for. Build them and the multiplier shows up, and the platform itself becomes an asset a competitor cannot buy off a shelf. Skip them and you have paid frontier prices for autocomplete and shipped a faster mess.

What a CEO should ask before signing

What is our blended cost per developer per month, and where does it land if usage triples?
Which repositories can never leave the building, and is that wired into the architecture rather than written in a policy nobody reads?
What does quality look like as a number we can watch, and who owns it before it ever reaches a developer?
On our own portfolio, not the brochure's, what multiplier do we actually expect, and what is the ninety-day test that proves it?

If the plan in front of you cannot answer those four, it is a vendor's plan, not yours.

The full field guide

This post is the on-ramp. The complete strategy, with the four tiers and the decision tree, the real cost models, the privacy ranking, the six-month roadmap, the seven failure modes, and a one-page cheat-sheet a CIO can hand straight to a CEO, is a free field guide. Open it in the reader, or take the PDF.

EBOOK

Ten Engineers, the Output of Thirty

A CIO's field guide to agentic software teams · 7 chapters + cheat-sheet · 18 pages

Open the ebook › Download the PDF ↓

Frequently asked questions

What is agentic AI for software teams?

Agentic AI for software teams is the use of AI agents that do not just suggest code but execute units of work end to end: they read an issue, plan a change across files, write the code, run the tests, read the failures, and fix them in a loop until the work is done or they ask for help. It moves the bottleneck from writing code to reviewing logic.

What is the Agency Router?

The Agency Router is a routing layer that assigns each task to the cheapest tier of model autonomy that can complete it reliably, so you spend the agency each task deserves. It scores every request for difficulty and sensitivity, sends most work to a local or self-hosted model, escalates the hard 15 to 25 percent to a frontier model, pins sensitive code in-house, and enforces the budget. The router, not the model, is the product.

How much productivity gain is realistic from AI coding agents?

The honest, board-defensible figure is a blended 1.7 to 2.1 times within six months, not the 3x vendors claim. Boilerplate and scaffolding compress around 2.8x, mechanical refactoring around 2.4x, while novel architecture barely moves at about 1.1x. Your real number depends on your portfolio, and on familiar code worked by senior engineers a careless rollout can even land below 1.0.

Should AI coding agents run on local, self-hosted, frontier, or managed-cloud models?

All four, with a router choosing per task. Local for routine and sensitive work at zero token cost; a self-hosted open model for 75 to 85 percent of the volume; a frontier API for the genuinely hard 15 to 25 percent; managed cloud only where compliance forces it. Self-hosted is roughly 8 to 15 times cheaper all-in than frontier at volume.

Build or buy: what agentic platform should an enterprise own?

Buy the models; build the platform. The router that decides how much to spend, the evaluation harness that governs quality, and the guardrails that contain risk are the product, and they are the part nobody budgets for. Subscription seats feel cheapest to start and become the most expensive at scale with the least control over where your code goes.

What governance, evals, and guardrails do AI coding agents need before production?

An evaluation set built from 50 to 100 real tickets out of your own history, run on every model or router change; budget caps and sensitive-repo pins enforced at the gateway; redaction of secrets before anything leaves the building; and a human gate on production. Without measured evals you are governing on vibes, and vibes do not survive an incident.

How do you keep proprietary code and data safe with AI agents?

Enforce the privacy order in the architecture, not a memo: local only, then self-hosted internal, then a private cloud endpoint, then a standard API with a no-train agreement, and never proprietary code in a consumer chatbot. Pin sensitive repositories to local or self-hosted tiers and redact at the gateway. This is the main reason the self-hosted tier exists.

What is agent slop and how do you prevent it?

Agent slop is plausible-but-wrong code produced at volume; three agents can agree and be wrong together. You prevent it by scaling human review with output, gating the agent loop on fast trustworthy tests, running an evaluation set that catches regressions, and adding a critic pass before anything merges.

How should a CIO brief the CEO on cost and ROI?

Frame it as unit cost, not headcount: this drops the cost of shipping a feature on most of the portfolio by 35 to 45 percent. For 20 developers the tiered platform runs about 50 to 60 thousand dollars a month all-in, against 150 thousand-plus to hire the engineers that output replaces, for the delivery of roughly thirty to forty.

What is a realistic six-month rollout for agentic AI in engineering?

Weeks 1 to 4: pilot one clean repo with subscription seats and build the evaluation set. Months 2 to 4: stand up the router, self-host an open model, pin sensitive repos. Months 5 to 6: put agents in the pipeline for test generation, review, and bug-fix pull requests, with evals gating every change. At ninety days, if cycle time has not moved and everything still goes to the frontier model, you bought tools, not a multiplier.

Glossary

Agentic AI: AI that executes whole units of work autonomously (plan, write, test, fix in a loop), rather than only suggesting the next line.
AI coding agent: An agent that takes an engineering task off a developer's plate and returns it as a reviewed pull request.
The Agency Router: The routing layer that sends each task to the cheapest tier of model autonomy that can do it reliably, spending the agency each task deserves.
The Four Tiers: Local (no tokens), self-hosted (the workhorse), frontier API (the hard 15 to 25 percent), and managed cloud (compliance only).
The Honest Multiplier: The blended, board-defensible productivity gain from agentic engineering: 1.7 to 2.1 times, not the 3x ceiling.
Agent slop: Plausible-but-wrong code generated at volume; the dominant quality failure mode of agentic teams.
Evals (evaluation harness): A golden set of real tickets the agent must solve correctly before a model or router change ships; tests for the AI.
Guardrails: Budget caps, sensitive-repo pins, gateway redaction, and a human production gate that keep the multiplier from multiplying mistakes.
Self-hosted tier: An open model served on rented or owned GPU behind an internal interface; roughly 8 to 15 times cheaper all-in than frontier at volume.