Methodology

AI generates code. We ship software.

Modern AI tooling can compress certain kinds of software work by 30–50%. But only inside a senior-led methodology. Without it, AI throughput correlates with stability regression — DORA's data is unambiguous. This is the loop we use, the tools we use it with, and the failure modes we explicitly avoid.

The four-step loop Where it breaks

The thesis

The discipline isn't in the tool. It's in the loop around it.

Most engagements that fail under the "AI development" banner share the same root cause: AI is asked to drive the project. Models open the planning document, models pick the architecture, models scope the work, models write the code, and a human shows up at the end to push the button. The result is fast code that may or may not be the right software.

We don't do that. We treat AI as the fastest tool inside a senior-led process, not a substitute for the process. Senior engineers own the spec, own the tests, own the architecture, and own the merge button. AI generates implementation against verifiable targets that humans authored. Every change goes through a review the AI cannot bypass.

That sounds slower than "let the agent run." In practice it's materially faster — because the methodology removes the rework that comes from generating code against an undefined target, and removes the production incidents that come from merging code no senior engineer ever read.

The loop

Four steps. The AI runs inside them, never around them.

Step 1

Spec-driven development

What we do

A written spec — Markdown, version-controlled, owned by senior engineers — is the source of truth. We iterate the spec with you, then AI generates code from it. The spec persists across sessions, across model versions, and across team changes. The code is the by-product; the spec is the artifact.

Why it works

Modern coding agents are dramatically more reliable when given a stable target. Without a spec, an agent invents context that doesn't match your intent — every session starts from a slightly different mental model. With a spec read at session start, the model is anchored, and you get convergent generation across sessions and across engineers.

Concrete pattern

Every engagement starts with a SPEC.md (and ADRs for non-trivial architecture choices). Coding agents are configured to read those files at session boot. When requirements change, the spec changes first; only then does the code follow. The spec is the place arguments get resolved, not the code.

Failure mode it prevents

"The agent generated something that contradicts what we discussed three weeks ago." Spec-anchored development means the conversation is captured, dated, and verifiable — and the model has the same access to it as the engineer does.

Step 2

Acceptance-test-first (ATDD)

What we do

Before any production code is written, we author the acceptance tests. Given/When/Then scenarios — human-readable, machine-runnable. The AI then generates the implementation against those tests as the verifiable target. The test is the contract; satisfying it is what passes for done.

Why it works

Tests give the model an unambiguous, executable definition of correctness. Generation against a defined target is convergent and verifiable. Generation against a prompt alone is divergent — different sessions produce different interpretations of the same English. The test removes the ambiguity.

Concrete pattern

For every user-facing slice, we write acceptance tests first as executable scenarios. The agent's job is to make those tests pass without modifying them. Tests are reviewed by humans before generation begins; once approved, they become the spec the AI implements against.

Failure mode it prevents

"The AI hallucinated a different feature than we asked for." When the test is the contract, the model literally cannot satisfy a different feature — it has to make the actual scenarios pass.

Step 3

Senior-engineer-in-the-loop

What we do

Every AI-generated change goes through a senior engineer before merge. No exceptions. AI agents can read code, write code, run tests, even open pull requests — they cannot merge. The reviewer is the human authority on architecture, security, and the constraints that live in the team's head.

Why it works

DORA's 2024–25 research is explicit: AI throughput goes up by ~20% while incidents per pull request go up by ~24% in teams without disciplined review. The cost reduction is real. The quality regression is also real, and it shows up in production weeks after the velocity number lands in a slide. Senior review is the step that protects the gain.

Concrete pattern

Pull requests authored by AI are tagged as such and reviewed under the same standard as human-authored code, often more strictly — agents are excellent at producing code that looks reasonable and quietly violates a constraint nobody told them about. Senior reviewers explicitly check for those failure modes.

Failure mode it prevents

"We shipped a critical bug because the AI didn't know about a constraint that lives in someone's head." Senior review is the place tacit knowledge gets applied to generated code.

Step 4

Eval & safety gates

What we do

Automated quality gates run before any merge: type checks, lints, unit and acceptance tests, and — for AI features — eval suites that test for hallucination, drift, and regression against a curated dataset. Cost ceilings are checked per LLM call. Grounding is verified against source-of-truth data.

Why it works

Production AI degrades silently. Models drift. Prompts that worked last month start producing different outputs as upstream models update. Without continuous evaluation, the first time you find out is when a customer files a ticket. Eval gates make degradation visible inside the development loop, not after deployment.

Concrete pattern

CI runs the full quality + eval pipeline on every pull request. AI-feature changes trigger expanded eval suites measured against a held-out dataset. Hallucination grounding is checked by re-running the prompt against pinned context and asserting outputs cite real sources. Cost regressions are flagged automatically.

Failure mode it prevents

"The model started hallucinating in production three weeks after we shipped it." Continuous eval running in CI catches drift before it reaches users.

The tools we actually use

Named. With roles, not endorsements.

Tools are interchangeable; the methodology is what matters. But buyers reasonably ask what's in the box, so here's the actual stack we run on as of now.

Primary coding agent

Claude Code (Opus 4.7)

Lives in the IDE and the terminal. Used by senior engineers for reading codebases, generating implementations against specs and tests, refactoring, debugging, and large multi-file changes. The bulk of our agent-driven work runs here.

Secondary agent

OpenAI Codex

Used in parallel with Claude for exploration on tricky problems and for benchmarking generation quality where the two models disagree. Useful when the task profile favors a different reasoning style.

In-editor flow

Cursor

When an engineer wants the agent embedded directly in the file-edit flow rather than driven from a chat or terminal session. Composer is heavily used for cross-file refactors.

Autocomplete tier

GitHub Copilot

Suggestions for routine code where a full agent loop is overkill — typing acceleration rather than generation. Lives alongside the heavier agents.

Eval & grounding

Custom eval runners

Internal tooling for AI feature validation, hallucination grounding, drift detection, and cost-bounded testing. The same eval harness that protects XGAIMS in production runs in client engagements.

CI / CD

GitHub Actions + safety gates

The full quality pipeline — types, lints, tests, evals — runs on every pull request. AI-feature changes trigger expanded eval suites. Cost regressions and hallucination drift are flagged automatically before merge.

We re-evaluate this stack every quarter. Models change fast; the methodology changes slowly. The right setup six months from now will look different — the loop won't.

Worked example

How XGAIMS got built.

XGAIMS is a multi-tenant production platform — 50+ admin surfaces, 20+ background services orchestrating LLM, ML, and ETL workloads, bidirectional CRM sync, real-time chat with operator handoff, hallucination-grounded LLM outputs, predictive segmentation, and a closed-loop feedback system. Built by a small senior team using the exact methodology on this page. Four concrete moments where the loop earned its keep:

Step 1 · Spec

The finance enforcement layer

XGAIMS has a finance enforcement layer that gates every spend action — campaigns, ad placements, vendor calls — against budget rules. The spec for this layer was written before any code: explicit allow/deny rules, escalation paths, audit-trail requirements, failure-mode handling. The spec went through three rounds of review with the team before the first commit. Every implementation pass since — across multiple coding agents and team rotations — has worked against that spec. The model never invents the budget rules; it reads them.

Step 2 · Tests

The chat inbox handoff

XGAIMS has a real-time chat inbox where AI handles low-stakes conversations and hands off to a human operator at the autonomy ceiling. The handoff trigger logic was written as Given/When/Then acceptance tests first — over forty scenarios covering unclear intent, escalation phrases, response confidence below threshold, regulated content, and edge cases. The implementation was generated against those tests. Six months in production, the handoff logic has needed three small adjustments — all made by adding scenarios, then regenerating.

Step 3 · Review

A grounding bug caught at review

Early in the hallucination grounding implementation, an agent generated code that constructed a citation string from the prompt rather than from the retrieved context. Tests passed because the test fixtures used identical text in both. A senior reviewer caught it on read — the model had silently changed where the truth came from. The fix took twenty minutes; the bug, had it shipped, would have been the kind of incident that costs a customer.

Step 4 · Evals

Continuous drift detection

Every LLM-driven feature in XGAIMS has an eval suite running on every pull request, plus a nightly run against the production model. When an upstream model release shifted output style on a content-generation feature, the eval suite caught the drift the same night. We pinned the prior model version, ran the new one through review, and rolled forward when the diff was understood. The user-facing surface never changed.

The point.The methodology isn't theoretical. We run a real production AI system on it. When we apply it to your codebase, we're applying something we already use to keep our own product alive.

The research

Why this loop is the loop.

The four steps weren't invented in a vacuum. They map to the recurring pattern across the major AI-engineering productivity studies of the last two years: the gap between teams that capture AI productivity and teams that just generate more pull requests is process discipline.

McKinsey 2025

Trained engineering teams using AI coding tools save 30–60% of time on coding, test generation, and documentation.

The headline gain is real. The qualifier — "trained" teams — does most of the work. Untrained teams see no aggregate productivity gain.

MIT / GitHub RCT

Developers using GitHub Copilot completed coding tasks 55.8% faster than the control group on a randomized task set.

Tool-level acceleration is well-established at the individual task level. The harder question is whether team-level throughput follows — see DORA below.

Google DORA 2024–25

AI adoption correlates with ~20% higher pull request throughput AND ~24% higher incident rate per pull request — in teams without disciplined review.

AI throughput converts to value only with senior review in the loop. Strip the review and the speed gain becomes a stability regression.

Faros AI 2025

~21% more tasks completed per developer with AI assistance; review time up 91% as pull request volume grew 98%.

Without process changes, downstream bottlenecks (review, CI wait time, deployment) absorb the upstream gain. Capturing AI productivity requires reshaping the loop, not just adding the tool.

METR 2024–25

Test-anchored AI loops produce materially more verifiable code than prompt-driven approaches; the gap widens on longer task horizons.

The longer the work horizon, the more the methodology matters. Test-first generation is the practical mechanism that keeps long-running agent work convergent.

Read across these results, the pattern is consistent: AI generates code faster, but the gain only converts to business value when senior review, tests, and eval gates are present. Strip those out and the gain becomes a quality regression. Our methodology is the version of these findings encoded into a process.

Failure modes

Where the methodology breaks (and what we won't do).

The honest version. Some of these are choices we make; some are constraints buyers bring. We'd rather flag them up front than discover them in production.

AI-driven planning

We don't let AI scope the project. Architecture, sequencing, and trade-offs are decided by senior engineers. AI is excellent at generating and analyzing — and consistently weak at deciding what should be built.

"Vibe coding" — no spec, no test

If a buyer wants us to skip the spec or tests to "go faster," we say no. The methodology is what makes the speed real; remove it and you get fast code, not fast software. The fastest path to a production incident is a merged pull request that no senior engineer ever read.

Skipping senior review

Even on small changes. The 30–50% cost reduction relies on the loop catching things; remove the loop and the cost reduction becomes a quality regression.

Legacy systems with no APIs

Home-grown systems with no APIs and no internal expertise on the source code slow us down materially. The methodology still pays off — AI is unusually good at reading old code — but the speed claims on the rest of this site assume the connected systems are reachable.

Months-long IT or security review cycles

Political bottlenecks erode the speed claim more than any technical issue. We can ship inside compliance regimes, but we'll quote the calendar honestly — the gating cost dominates the build cost in those engagements.

Scope changing weekly

The methodology compresses time against a stable target. If the spec changes faster than the build can converge against it, the loop's compounding gains evaporate. We'll work with you to get the scope stable enough to commit to.

In the harder cases, we can still ship — the methodology still pays off — but the acceleration curve is flatter and the timeline is longer. We quote those engagements honestly, including where the speed and cost claims on the rest of this site don't apply.

Want to put the methodology to work?

Tell us what you're trying to ship. Or try us with a $20K First Build first — one deliverable, four weeks, the full loop, no commitment beyond that.

Start a project Try a First Build →

Or take the methodology with you — get the PDF →