AI writes most of the code in my projects now. That does not mean I review less. It means I review differently. The files I still open line-by-line are the ones whose failure mode is measured in hours of downtime, six-figure cloud bills, or a data breach. Everything else can be skimmed, spot-checked, or trusted. This is the blast-radius framework I use.
1. The Time My Deploy Pushed to the Wrong Repo
A few weeks ago I was running a long refactor session in Windsurf Cascade on the SWE-1.5 model. Mostly clean work. At some point the agent did what agents do: it switched branches, checked out something else, ran a few operations, and left the environment in a different shape than where I started.
I did not notice. I then ran a deploy script.
The script happily pushed changes to the wrong Git remote. Not the one I intended. It could have happened inside any agentic IDE - Cursor, Claude Code, Aider, Copilot Workspace. They all manipulate repository state. This one just happened to be Windsurf.
I got lucky. The branch I was on was not production. A git reset to the correct commit cleaned it up. No data loss, no angry Slack messages, no postmortem.
But the lesson stuck. The code the agent wrote was fine. The thing that almost hurt me was not the code. It was the environment state around the code - branches, remotes, HEAD, deploy scripts, the .git/config. Files that are not even "code" in the traditional sense.
That incident forced me to rewrite how I review AI-generated work. The old "review every PR line-by-line" rule does not scale when an agent produces 400 lines in 30 seconds. The "trust it, move on" rule is worse. The answer is a tiered system.
2. Why "AI Codes Everything" Is a Bad Default
Before the framework, some numbers that should make every engineering leader uncomfortable.
Stanford's CCS '23 study "Do Users Write More Insecure Code with AI Assistants?" (Perry, Srikumar, Boneh et al.) found two things in the same breath. Developers with AI assistants wrote significantly less secure code than those without. And the same developers were more confident their code was secure. The confidence paradox is the dangerous part. If you feel safer, you review less. If you review less, you catch less.
Snyk's 2024 AI Code Security Report sharpens it further. Up to 40% of AI-assisted code contains security flaws. Nearly 80% of developers admit bypassing security policies when using AI tools. Only 10% scan most of the AI-generated code they ship.
Meanwhile, GitGuardian's State of Secrets Sprawl 2024 found 23.77 million new hardcoded secrets added to public GitHub in a single year - a 25% increase year-on-year. 70% of secrets detected back in 2022 were still valid in 2024.
Put those three reports next to each other and a pattern falls out. We are generating more code, faster, with more confidence, while reviewing less of it. The risk does not disappear because the developer feels good. It just moves from "caught in review" to "caught in production."
The fix is not to review everything again. We tried that - it is why PRs rotted for three days before anyone approved them. The fix is to decide what you review line-by-line and what you genuinely let the model own, based on what happens when it is wrong.
3. The Blast Radius Principle
Every file in a codebase has three properties that matter for review triage:
- Blast radius - how many users, systems, or dollars are affected when this file is wrong?
- Reversibility - if it breaks, can I fix it in 5 minutes, 5 hours, or never?
- Detection lag - how quickly will I notice the mistake? Immediately, next deploy, next quarter, or after an auditor calls?
A UI button with the wrong colour has small blast radius, high reversibility, instant detection. A Terraform file that deletes a production database on apply has enormous blast radius, near-zero reversibility, and often silent detection until it is too late.
These two files deserve radically different review effort. Treating them the same is where teams burn out or get breached.
The 5 tiers below rank files by the product of those three properties. Higher tier = higher compound risk = more human eyeballs required.
4. Tier 1 - Always Review Line-by-Line
Review posture: read every line, every time, even if the diff is one character. No shortcuts. No "LGTM" from skimming.
Files in this tier:
- Dockerfiles and base image selections - a wrong
FROMsilently ships vulnerabilities or breaks production runtime - Infrastructure as Code - Terraform, Pulumi, CloudFormation, Kubernetes manifests, Helm charts
- CI/CD configuration - GitHub Actions, GitLab CI, Jenkins, deploy scripts, release automation
- IAM and secrets handling - policy files, RBAC configs,
.env.example, anything that touches credentials - Database migrations - schema changes, backfills, destructive statements, anything that runs once and cannot be reverted cleanly
- Environment state -
.git/config, remote URLs, branch protection rules, pre-commit hooks
Why: these files have the highest blast radius in the system. A bad Dockerfile can 10x your image size and break every deploy until fixed. A bad IAM policy can grant world-readable S3. A bad migration can corrupt a 50M-row table. An AI-generated Terraform plan that uses -auto-approve without a plan-review step can delete resources you did not know existed.
AI is very good at generating plausible-looking infrastructure. "Plausible-looking" and "correct" are not the same thing. Read every line.
Example - Dockerfile:
Wrong:
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
EXPOSE 3000
CMD ["node", "server.js"] Correct:
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
USER node
EXPOSE 3000
CMD ["node", "server.js"] That single missing USER node line is why you read every line. It ships with root privileges by default. An attacker who breaks out of the container has root on your infrastructure.
Example - Deploy script with the Windsurf incident:
Wrong:
#!/bin/bash
set -e
npm run build
git push origin main:production Correct:
#!/bin/bash
set -e
echo "Current branch: $(git rev-parse --abbrev-ref HEAD)"
echo "Current remote: $(git remote get-url origin)"
echo "HEAD commit: $(git rev-parse HEAD)"
read -p "Continue deploy? (y/n) " -n 1 -r
echo
[[ $REPLY =~ ^[Yy]$ ]] || exit 1
if [[ "$(git rev-parse --abbrev-ref HEAD)" != "main" ]]; then
echo "ERROR: Not on main branch. Aborting."
exit 1
fi
npm run build
git push origin main:production Without that environment check, my Windsurf incident would have pushed to the wrong remote silently. The AI generated the script correctly; the problem was environment state the agent had changed and I did not verify.
5. Tier 2 - Review Before Merge
Review posture: block on human approval. Run it locally or in staging. Do not trust the AI's "I tested this" claim - it usually has not.
Files in this tier:
- API contracts - OpenAPI specs, GraphQL schemas, protobuf definitions, public SDK interfaces
- Authentication and authorisation logic - login flows, session handling, token validation, middleware
- Payment and billing code - Stripe webhooks, subscription logic, invoice generation, refund flows
- Rate limiting, quotas, throttling - anything that protects your system from abuse or your bill from surprises
- Cross-service contracts - message queue schemas, event payloads, webhook formats
Why: Tier 2 is where the confidence paradox from the Stanford study hits hardest. Auth bugs are the most cited class of "AI wrote it, looked right, shipped" failures. An AI-generated JWT validator that skips signature checks on expired tokens looks identical to one that does not. A Stripe webhook handler that does not verify the signing secret looks identical to one that does. The difference is a two-line check, and the diff does not scream at you.
These files are also where breaking changes ripple outward. A silently modified API contract can break three downstream clients you do not own.
Example - JWT validation (the dangerous diff):
Wrong:
function validateToken(token) {
try {
const decoded = jwt.decode(token); // decode != verify
if (decoded.exp < Date.now() / 1000) {
return { valid: false };
}
return { valid: true, user: decoded };
} catch (e) {
return { valid: false };
}
} Correct:
function validateToken(token) {
try {
const decoded = jwt.verify(token, process.env.JWT_SECRET);
return { valid: true, user: decoded };
} catch (e) {
return { valid: false };
}
} The difference is jwt.decode() vs jwt.verify(). One line. The first ships every expired token as valid. The second rejects it. A human skimming this diff might miss the semantic difference.
Example - Stripe webhook verification:
Wrong:
app.post('/webhook', (req, res) => {
const event = JSON.parse(req.body.toString());
if (event.type === 'charge.succeeded') {
markPaymentProcessed(event.data.object.customer_id);
}
res.json({ received: true });
}); Correct:
app.post('/webhook', (req, res) => {
const signature = req.headers['stripe-signature'];
let event;
try {
event = stripe.webhooks.constructEvent(
req.body.toString(),
signature,
process.env.STRIPE_WEBHOOK_SECRET
);
} catch (err) {
return res.status(400).send(`Webhook Error: ${err.message}`);
}
if (event.type === 'charge.succeeded') {
markPaymentProcessed(event.data.object.customer_id);
}
res.json({ received: true });
}); Without the signature check, an attacker sends a fake webhook saying "charge succeeded" without paying. The second version requires the exact webhook secret Stripe issued you. Not optional.
6. Tier 3 - Review the Diff
Review posture: read the diff, scan for obvious smells, run tests, trust on the second look.
Files in this tier:
- Core business logic - domain models, service classes, orchestration layers
- Data pipelines - ETL scripts, transformation logic, aggregations that feed dashboards
- Background jobs and workers - queue consumers, cron handlers, scheduled tasks
- Caching layers - Redis wrappers, in-memory caches, cache invalidation logic
Why: these files shape behaviour, not infrastructure. A bug here costs you a wrong number in a report, a delayed job, or a stale cache - annoying, but reversible within a sprint. Read the diff, flag anything that feels off, lean on your test suite as the second reviewer. Do not read every line; you will burn out.
The failure mode to watch for is what I called out in The Hidden Cost of AI-Generated Code: locally correct code that erodes global coherence. The diff looks fine. The third time you look at the file, you realise the AI invented a new pattern that conflicts with the three that already existed. Catch those in the diff review, not six sprints later.
Example - Business logic with a pattern drift:
Wrong (pattern inconsistent with codebase):
class UserService {
async createUser(data) {
try {
if (!data.email || !data.name) {
throw new Error('Missing fields');
}
} catch (err) {
return { error: err.message };
}
}
} Correct (matches existing codebase pattern):
class UserService {
async createUser(data) {
const validation = validateUser(data);
if (!validation.ok) {
return { error: validation.errors };
}
}
} Both work. But one throws exceptions, the other uses error objects. Three sprint cycles later, another engineer will write code expecting exceptions. You just added inconsistency debt. Read the diff to catch this early; it will not fail in CI.
7. Tier 4 - Skim for Smells
Review posture: a 30-second look. Make sure it is not obviously wrong. Let tests catch the rest.
Files in this tier:
- Internal utilities - helpers, formatters, date manipulation, string parsing
- Test files - unit tests, fixtures, mocks, test helpers
- Scripts - one-off data fixes, local dev tooling, developer ergonomics
- Config for non-critical tools - linters, formatters, editor configs
Why: low blast radius, high reversibility. A broken util breaks the one function that calls it - you will see it in CI within minutes. Tests are self-verifying by design. Scripts are run once and thrown away. Obsessing over these files is where review effort goes to die.
The one exception: if the test is the only guardrail protecting a Tier 1 or Tier 2 file, promote it. A test covering your Stripe webhook verifier is a Tier 2 file in disguise.
8. Tier 5 - Trust But Verify Later
Review posture: trust the model, ship it, catch it in staging or production with your eyes.
Files in this tier:
- UI components - React/Vue/Svelte components, styling, layouts
- CSS and Tailwind - spacing, colours, responsive breakpoints
- Copy and content - marketing pages, microcopy, tooltip text
- Images, icons, static assets
- Documentation - READMEs, code comments, changelogs
Why: these are the highest-surface-area, lowest-blast-radius files in your codebase. A mis-aligned button costs nothing. A typo in a tooltip is a same-day fix. Visual regression tests, Storybook, and a browser window catch the rest faster than you can read a diff.
This is the tier where AI legitimately earns its speed gains. Let the model work. Spot check in the preview deploy.
9. The Review Checklist You Can Steal
Put this at the top of your engineering handbook. Or paste it into your CLAUDE.md or .cursorrules so the agent itself respects the hierarchy.
| Tier | Review Effort | Example Files | Failure Cost |
|---|---|---|---|
| 1 | Every line, every time | Dockerfile, Terraform, IAM, migrations, CI/CD, git config | Outage, breach, weekend lost |
| 2 | Block on human approval | API contracts, auth, payments, rate limits | Silent data corruption, security hole |
| 3 | Read the diff, trust tests | Business logic, pipelines, workers | Wrong number in a report |
| 4 | 30-second skim | Utilities, tests, dev scripts | CI failure, fix in minutes |
| 5 | Trust, catch in staging | UI, CSS, copy, docs | Visual bug, same-day fix |
Three rules that make this stick on a real team:
- Encode the tiers in your
CODEOWNERSfile. Tier 1 files should require a senior engineer on the PR regardless of who authored it. GitHub will enforce what discipline will not. - Run a pre-deploy check for environment drift. Before any deploy script runs, it should print current branch, current remote, current HEAD, and ask for explicit confirmation. This would have caught my Windsurf incident in three seconds.
- Scan Tier 1 and Tier 2 with automated tooling. Snyk, GitGuardian,
tflint,hadolint,kubeval. Humans review judgement. Tools review syntax. Do not make humans do what tools do faster.
10. Closing: Review Narrower, Review Harder
The common reading of AI-assisted engineering is that review becomes optional. It does not. It becomes surgical.
Every hour of review time you used to spend reading a React component is now free. Spend it on the Terraform file instead. Spend it on the auth middleware. Spend it on the migration that runs on 50 million rows tomorrow morning.
The Stanford researchers summarised their finding in a line I have not stopped thinking about. Developers who trusted the AI less and engaged more with their prompts produced code with fewer security vulnerabilities. Less trust, more engagement. That is the whole discipline.
AI is a force multiplier on everything - including mistakes. The files where a mistake costs you the most are the files where you have to slow down, zoom in, and read every line like a human who still takes responsibility for production.
The other 80%? Let the model cook.
Related reading: The Hidden Cost of AI-Generated Code (and How to Fix It) and The Ideal Claude Code Project Structure That Actually Scales.
Working through the challenges in this post? I help engineering leaders and CTOs navigate complex technical decisions and scale high-performing teams. Schedule a consultation →