Case Study 9 min read

Production AI in months, not quarters — without the technical debt

We shipped 22 production features in four months — under 1,000 hours, roughly 40% less than a traditional delivery would have needed — with AI-generated code that matched the architecture instead of drifting from it, and documentation that came free with the build. The method behind that result matters only because it makes the result repeatable.

AI coding assistants make most teams faster at producing technical debt

The default pattern with AI coding assistants looks productive and often isn't. A developer describes what they want, the assistant generates code, and the next several hours go to debugging what it misunderstood and reworking what conflicts with code elsewhere in the system. What you get is plausible output. Code that compiles. Code that looks right. Code that doesn't quite. Across a 22-feature build, that compounds into technical debt you pay down before you can ship anything new.

That debt is what kills most AI builds — not speed, reliability. So the result that's actually hard to buy isn't "fast." It's fast and architecturally consistent. That's what this project proves you can have.

  • You point AI assistants at vague prompts. Usually that means plausible code that compiles, looks right, and quietly drifts from the architecture. Here, every feature is fully specified before any production code is written.
  • Features pile up over a long build. Usually drift compounds into technical debt you pay down before shipping anything new. Here, feature #22 follows the same patterns as feature #1.
  • Tests get written after the code. Usually coverage is retrofitted, incomplete, and trusted less. Here, test cases are authored from requirements first, so coverage is designed in.
  • Documentation is a separate effort. Usually it's deferred indefinitely and the system becomes a black box. Here, the specs are the documentation — your team can own the system.

What the build delivered

Stripped to outcomes, the project produced five things — and they're the reasons the result holds up rather than just looking good on a timeline.

  • Speed. Production delivery in months, not the multi-quarter timeline a traditional engagement would have run.
  • Cost. Total effort landed under 1,000 hours — roughly 40% less than a traditional delivery of the same scope would have taken.
  • Reliability. No debt tax — the regression suite that protected the last feature was the same one that protected the first.
  • Ownership. The specs doubled as the documentation, so the system could be handed to an internal team to run.
  • Compounding. The spec, skill, and agent libraries carried forward — the first project absorbed the setup overhead, and every feature after inherited it.

The build matters less for what it was than for what it proves: AI-accelerated delivery can be fast and reliable at the same time. The rest of this page is why.


Why the result is repeatable, not luck

The fix for plausible-but-wrong AI code isn't to use less AI. It's to give the AI more structure to work against. We inverted the usual pattern: instead of using AI mainly to write code, we used it to think through the problem first, producing a structured specification that made the coding phase fast and architecturally consistent.

The approach is called Spec-Driven Development — a four-phase, mandatory process where every feature is fully specified before a single line of production code is written. We didn't invent the name or the idea; we adopted the discipline and built the tooling to enforce it.

Requirements  →  Design  →  Tasks  →  Test Cases  →  Implementation
 (Phase 1)      (Phase 2)  (Phase 3)   (Phase 4)        (Code)

Requirements are written in EARS format (Easy Approach to Requirements Syntax), a constrained pattern language that produces unambiguous, testable statements — a requirement written as "WHEN X, THEN Y" maps directly to a test assertion, eliminating the most expensive class of defect: building the wrong thing. Design is written against the system's established architecture docs, so feature #22 follows the same patterns as feature #1. Tasks decompose the design into 2–4 hour units that trace back to specific requirements. Test cases are authored from the requirements before implementation — left-shift testing in the truest sense, so coverage is designed into the feature, not retrofitted. Only then does implementation start.

When the assistant has a real design specification — TypeScript interfaces, data models, API contracts — the generated code matches the architecture. Without it, you get plausibility.


Under the hood

For anyone evaluating whether this is a disciplined system or a clever story: the discipline is enforced mechanically, not on the honor system.

Specialized agents with ownership boundaries. Rather than one general-purpose assistant doing everything, the project defined ten core specialized agents — Business Analyst, Architect, UI Designer, Developer, Tester, Reviewer, Code Analyzer, DevOps, Triage Analyst, and Lessons Analyst — plus two on-demand agents, a Researcher and an independent reviewer running on a second model. Each has explicit file ownership and explicit restrictions. A pre-tool hook rejects any agent that tries to write outside its ownership matrix: the Business Analyst can't quietly expand scope by editing the design; the Tester can't water down requirements to make tests pass; the Reviewer can't hide a flagged issue by editing the test it would fail.

Independent second-opinion review. At every approval gate, the orchestrator dispatches an independent review of the artifact to a second model — a different one from the model that produced it (Codex, among others). A cheap way to catch what same-model self-review misses.

Hooks enforce the process. Twenty-plus shell hooks fire before and after every tool call. Blocking hooks refuse the action on a violation — spec edits during a dev phase, feature-spec creation before architecture exists, a "done" transition without a complete coverage matrix, phase transitions when the build fails, and destructive operations without explicit confirmation. The methodology is enforced mechanically, not relied on culturally: skip a step and you get a hard failure, not a polite reminder.

Reusable skills as institutional knowledge. Each phase runs through a dedicated skill — a 150–415 line template defining the output format, a quality checklist, and phase-specific patterns. Skills make the AI's output consistent regardless of which developer initiates it, and they're version-controlled artifacts, reviewed and improved like any other code. A new contributor inherits the team's conventions automatically instead of learning them by trial and error.

Quality gates in CI. Four layers — unit, integration, API, and UI end-to-end (Playwright) — run on every pull request, and failures block the merge. The E2E layer is deliberately kept to critical user flows (job import, position publish, sourcing-list creation, AI research launch, export) rather than exhaustive UI coverage, which tends to become flaky and ignored. Every shipped change had verified-passing tests at all four layers.

The system learns. Every test failure, review issue, or spec inconsistency surfaces as a pending lesson. After each feature's QA stage, a Lessons Analyst agent reviews them and proposes rules of two kinds: permanent ones (foundational, capped at five) and active ones (recent, capped at fifteen, retired after three unused features). Approved rules auto-load into every subsequent session, so the next feature inherits what the previous ones learned. By feature #22, the agents were working against a body of version-controlled judgement that didn't exist at feature #1.


Where humans stay in control

The AI accelerates production of artifacts. Humans control decisions.

  • Phase gates need a human signature. The AI can't advance to design without approved requirements, or to tasks without an approved design.
  • Scope changes go through specs first. A new requirement mid-implementation forces a hard stop: update requirements, design, tasks, and test cases, then resume. No undocumented functionality.
  • Architecture stays with humans. Three system-level documents are architect-controlled; agents read them as immutable context.
  • Code-review judgment is human. Structured checklists guide it, but accept-or-reject belongs to people.

This isn't a caveat bolted on at the end — it's the premise the whole system is built on. AI agents left to operate without boundaries drift, contradict each other, and quietly expand scope. The harness is what makes them useful: constrained ownership, mechanical enforcement, human sign-off at every gate. Without it, the speed gains come at the cost of coherence. The honest version of this story is that the methodology works because of the control layer, not in spite of it.

The failure mode we designed against: overtrust. There's a predictable trap with AI output. When roughly 90% of what's generated is correct, reviewers stop scrutinizing the other 10% — the high hit rate quietly trains people to skim. Asking humans to "review harder" doesn't fix it, because it fights their own pattern recognition. Our answer is structural: the independent second-model review described above runs on every artifact before it reaches the human gate, so the human reviews a flagged diff, not a clean-looking wall of plausible output. The control layer exists precisely because human attention degrades exactly where AI output looks most trustworthy.


How we built it

  • Twenty-two features fully specified through all applicable phases
  • ~750 commits across the engagement
  • Five application modules integrated: API, frontend, end-to-end tests, ETL workflows, database
  • Six fractional team members at uneven allocation: two developers (the largest share of code-effort), one QA engineer, one business analyst, one project manager, one solution architect
  • Under 1,000 hours of direct product-delivery effort over ~4 calendar months
  • Left-shift testing throughout — test cases authored from EARS requirements before implementation, so the regression suite grew as a designed artifact rather than a retrofit
  • Documentation produced as a byproduct of development, not a separate effort layered on top

We tracked phase-level effort against a traditional-delivery baseline throughout the build. The Spec-Driven + AI figures are measured from the project's own time tracking; the traditional-development column is the reference baseline the same work runs against:

  • Requirements. 1–2 days of analysis and documentation in traditional development; 4–8 hours per feature here, across review iterations.
  • Technical design. 1–2 days of design and documentation traditionally; 2–4 hours here.
  • Task breakdown. Half a day in sprint planning traditionally; 30–60 minutes here.
  • Test cases. 2–3 days for thorough coverage traditionally; 2–4 hours here for comprehensive coverage.
  • Implementation. Traditionally, 20–40% of effort is absorbed by rework and misunderstandings; here, that rework drops by 30–40%.
  • Documentation. A separate effort after the build traditionally; zero additional effort here — the specs are the documentation.

What this means for your build

This is the delivery shape we bring to client engagements — specs first, AI-accelerated, bounded agents, human-gated phase reviews. For you, it means a production system shipped in 2–4 months instead of a multi-quarter consulting engagement, at mid-market pricing, with code and documentation your team owns.


The result we'd aim for in your operation

If you need a production system shipped in months rather than quarters — one your team can own, without the technical-debt tax that sinks most AI builds — that's the result this pattern delivers.

A Pulse Check is where we'd start — free, 30 minutes, no slide deck. We listen to where the workflow breaks today, sketch what would need to exist before any code starts, and tell you honestly whether this pattern fits your build or whether something else does.