The 5 Files You Must Still Review in the Age of AI-Generated Code

AI writes most of the code in my projects now. That does not mean I review less. It means I review differently. The files I still open line-by-line are the ones whose failure mode is measured in hours of downtime, six-figure cloud bills, or a data breach. Everything else can be skimmed, spot-checked, or trusted. This is the blast-radius framework I use.

Introduction

1. The Time My Deploy Pushed to the Wrong Repo

A few weeks ago I was running a long refactor session in Windsurf Cascade on the SWE-1.5 model. Mostly clean work. At some point the agent did what agents do: it switched branches, checked out something else, ran a few operations, and left the environment in a different shape than where I started.

I did not notice. I then ran a deploy script.

The script happily pushed changes to the wrong Git remote. Not the one I intended. It could have happened inside any agentic IDE - Cursor, Claude Code, Aider, Copilot Workspace. They all manipulate repository state. This one just happened to be Windsurf.

I got lucky. The branch I was on was not production. A git reset to the correct commit cleaned it up. No data loss, no angry Slack messages, no postmortem.

But the lesson stuck. The code the agent wrote was fine. The thing that almost hurt me was not the code. It was the environment state around the code - branches, remotes, HEAD, deploy scripts, the .git/config. Files that are not even "code" in the traditional sense.

That incident forced me to rewrite how I review AI-generated work. The old "review every PR line-by-line" rule does not scale when an agent produces 400 lines in 30 seconds. The "trust it, move on" rule is worse. The answer is a tiered system.

2. Why "AI Codes Everything" Is a Bad Default

Before the framework, some numbers that should make every engineering leader uncomfortable.

Stanford's CCS '23 study "Do Users Write More Insecure Code with AI Assistants?" (Perry, Srikumar, Boneh et al.) found two things in the same breath. Developers with AI assistants wrote significantly less secure code than those without. And the same developers were more confident their code was secure. The confidence paradox is the dangerous part. If you feel safer, you review less. If you review less, you catch less.

Snyk's 2024 AI Code Security Report sharpens it further. Up to 40% of AI-assisted code contains security flaws. Nearly 80% of developers admit bypassing security policies when using AI tools. Only 10% scan most of the AI-generated code they ship.

Meanwhile, GitGuardian's State of Secrets Sprawl 2024 found 23.77 million new hardcoded secrets added to public GitHub in a single year - a 25% increase year-on-year. 70% of secrets detected back in 2022 were still valid in 2024.

Put those three reports next to each other and a pattern falls out. We are generating more code, faster, with more confidence, while reviewing less of it. The risk does not disappear because the developer feels good. It just moves from "caught in review" to "caught in production."

Sources: Snyk AI Code Security Report 2024 (flaws, policy bypass, scan rate); GitGuardian State of Secrets Sprawl 2024. The 10% scan rate is the gap the tier system closes, not by reviewing everything, but by reviewing the right things.

The fix is not to review everything again. We tried that - it is why PRs rotted for three days before anyone approved them. The fix is to decide what you review line-by-line and what you genuinely let the model own, based on what happens when it is wrong.

3. The Blast Radius Principle

Every file in a codebase has three properties that matter for review triage:

Blast radius - how many users, systems, or dollars are affected when this file is wrong?
Reversibility - if it breaks, can I fix it in 5 minutes, 5 hours, or never?
Detection lag - how quickly will I notice the mistake? Immediately, next deploy, next quarter, or after an auditor calls?

A UI button with the wrong color has small blast radius, high reversibility, instant detection. A Terraform file that deletes a production database on apply has enormous blast radius, near-zero reversibility, and often silent detection until it is too late.

These two files deserve radically different review effort. Treating them the same is where teams burn out or get breached.

The 5 tiers below rank files by the product of those three properties. Higher tier = higher compound risk = more human eyeballs required.

The tier boundaries here are based on observed failure patterns across production incidents, not an empirically validated study. A bad migration costs more in your system than mine if your rollback tooling is weaker; a UI bug costs more if it's your checkout button. Apply the tiers as a calibration framework, not a fixed hierarchy, the three properties (blast radius, reversibility, detection lag) are the durable part.

Illustrative composite risk index: blast radius × (1/reversibility) × detection lag, scored 1–10. Failure costs from the tier definitions: Tier 1 = outage/breach; Tier 2 = silent corruption; Tier 3 = bad report number; Tier 4 = CI failure; Tier 5 = visual bug.

Each point represents one tier. Top-left = maximum review effort (high blast radius, low reversibility). Bottom-right = trust the model. The diagonal from T1 to T5 is the review priority gradient.

4. Tier 1 - Always Review Line-by-Line

Review posture: read every line, every time, even if the diff is one character. No shortcuts. No "LGTM" from skimming.

Files in this tier:

Dockerfiles and base image selections - a wrong FROM silently ships vulnerabilities or breaks production runtime
Infrastructure as Code - Terraform, Pulumi, CloudFormation, Kubernetes manifests, Helm charts
CI/CD configuration - GitHub Actions, GitLab CI, Jenkins, deploy scripts, release automation
IAM and secrets handling - policy files, RBAC configs, .env.example, anything that touches credentials
Database migrations - schema changes, backfills, destructive statements, anything that runs once and cannot be reverted cleanly
Environment state - .git/config, remote URLs, branch protection rules, pre-commit hooks

Why: these files have the highest blast radius in the system. A bad Dockerfile can 10x your image size and break every deploy until fixed. A bad IAM policy can grant world-readable S3. A bad migration can corrupt a 50M-row table. An AI-generated Terraform plan that uses -auto-approve without a plan-review step can delete resources you did not know existed.

AI is very good at generating plausible-looking infrastructure. "Plausible-looking" and "correct" are not the same thing. Read every line.

Example - Dockerfile:

Wrong:

FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
EXPOSE 3000
CMD ["node", "server.js"]

Correct:

FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
USER node
EXPOSE 3000
CMD ["node", "server.js"]

That single missing USER node line is why you read every line. It ships with root privileges by default. An attacker who breaks out of the container has root on your infrastructure.

Example - Deploy script with the Windsurf incident:

Wrong:

#!/bin/bash
set -e

npm run build
git push origin main:production

Correct:

#!/bin/bash
set -e

echo "Current branch: $(git rev-parse --abbrev-ref HEAD)"
echo "Current remote: $(git remote get-url origin)"
echo "HEAD commit: $(git rev-parse HEAD)"
read -p "Continue deploy? (y/n) " -n 1 -r
echo
[[ $REPLY =~ ^[Yy]$ ]] || exit 1

if [[ "$(git rev-parse --abbrev-ref HEAD)" != "main" ]]; then
  echo "ERROR: Not on main branch. Aborting."
  exit 1
fi

npm run build
git push origin main:production

Without that environment check, my Windsurf incident would have pushed to the wrong remote silently. The AI generated the script correctly; the problem was environment state the agent had changed and I did not verify.

5. Tier 2 - Review Before Merge

Review posture: block on human approval. Run it locally or in staging. Do not trust the AI's "I tested this" claim - it usually has not.

Files in this tier:

API contracts - OpenAPI specs, GraphQL schemas, protobuf definitions, public SDK interfaces
Authentication and authorization logic - login flows, session handling, token validation, middleware
Payment and billing code - Stripe webhooks, subscription logic, invoice generation, refund flows
Rate limiting, quotas, throttling - anything that protects your system from abuse or your bill from surprises
Cross-service contracts - message queue schemas, event payloads, webhook formats

Why: Tier 2 is where the confidence paradox from the Stanford study hits hardest. Auth bugs are the most cited class of "AI wrote it, looked right, shipped" failures. An AI-generated JWT validator that skips signature checks on expired tokens looks identical to one that does not. A Stripe webhook handler that does not verify the signing secret looks identical to one that does. The difference is a two-line check, and the diff does not scream at you.

These files are also where breaking changes ripple outward. A silently modified API contract can break three downstream clients you do not own.

Example - JWT validation (the dangerous diff):

Wrong:

function validateToken(token) {
  try {
    const decoded = jwt.decode(token); // decode != verify
    if (decoded.exp < Date.now() / 1000) {
      return { valid: false };
    }
    return { valid: true, user: decoded };
  } catch (e) {
    return { valid: false };
  }
}

Correct:

function validateToken(token) {
  try {
    const decoded = jwt.verify(token, process.env.JWT_SECRET);
    return { valid: true, user: decoded };
  } catch (e) {
    return { valid: false };
  }
}

The difference is jwt.decode() vs jwt.verify(). One line. The first ships every expired token as valid. The second rejects it. A human skimming this diff might miss the semantic difference.

Example - Stripe webhook verification:

Wrong:

app.post('/webhook', (req, res) => {
  const event = JSON.parse(req.body.toString());
  if (event.type === 'charge.succeeded') {
    markPaymentProcessed(event.data.object.customer_id);
  }
  res.json({ received: true });
});

Correct:

app.post('/webhook', (req, res) => {
  const signature = req.headers['stripe-signature'];
  let event;
  try {
    event = stripe.webhooks.constructEvent(
      req.body.toString(),
      signature,
      process.env.STRIPE_WEBHOOK_SECRET
    );
  } catch (err) {
    return res.status(400).send(`Webhook Error: ${err.message}`);
  }
  if (event.type === 'charge.succeeded') {
    markPaymentProcessed(event.data.object.customer_id);
  }
  res.json({ received: true });
});

Without the signature check, an attacker sends a fake webhook saying "charge succeeded" without paying. The second version requires the exact webhook secret Stripe issued you. Not optional.

6. Tier 3 - Review the Diff

Review posture: read the diff, scan for obvious smells, run tests, trust on the second look.

Files in this tier:

Core business logic - domain models, service classes, orchestration layers
Data pipelines - ETL scripts, transformation logic, aggregations that feed dashboards
Background jobs and workers - queue consumers, cron handlers, scheduled tasks
Caching layers - Redis wrappers, in-memory caches, cache invalidation logic

Why: these files shape behavior, not infrastructure. A bug here costs you a wrong number in a report, a delayed job, or a stale cache - annoying, but reversible within a sprint. Read the diff, flag anything that feels off, lean on your test suite as the second reviewer. Do not read every line; you will burn out.

The failure mode to watch for is what I called out in The Hidden Cost of AI-Generated Code: locally correct code that erodes global coherence. The diff looks fine. The third time you look at the file, you realize the AI invented a new pattern that conflicts with the three that already existed. Catch those in the diff review, not six sprints later.

Example - Business logic with a pattern drift:

Wrong (pattern inconsistent with codebase):

class UserService {
  async createUser(data) {
    try {
      if (!data.email || !data.name) {
        throw new Error('Missing fields');
      }
    } catch (err) {
      return { error: err.message };
    }
  }
}

Correct (matches existing codebase pattern):

class UserService {
  async createUser(data) {
    const validation = validateUser(data);
    if (!validation.ok) {
      return { error: validation.errors };
    }
  }
}

Both work. But one throws exceptions, the other uses error objects. Three sprint cycles later, another engineer will write code expecting exceptions. You just added inconsistency debt. Read the diff to catch this early; it will not fail in CI.

7. Tier 4 - Skim for Smells

Review posture: a 30-second look. Make sure it is not obviously wrong. Let tests catch the rest.

Files in this tier:

Internal utilities - helpers, formatters, date manipulation, string parsing
Test files - unit tests, fixtures, mocks, test helpers
Scripts - one-off data fixes, local dev tooling, developer ergonomics
Config for non-critical tools - linters, formatters, editor configs

Why: low blast radius, high reversibility. A broken util breaks the one function that calls it - you will see it in CI within minutes. Tests are self-verifying by design. Scripts are run once and thrown away. Obsessing over these files is where review effort goes to die.

The one exception: if the test is the only guardrail protecting a Tier 1 or Tier 2 file, promote it. A test covering your Stripe webhook verifier is a Tier 2 file in disguise.

8. Tier 5 - Trust But Verify Later

Review posture: trust the model, ship it, catch it in staging or production with your eyes.

Files in this tier:

UI components - React/Vue/Svelte components, styling, layouts
CSS and Tailwind - spacing, colors, responsive breakpoints
Copy and content - marketing pages, microcopy, tooltip text
Images, icons, static assets
Documentation - READMEs, code comments, changelogs

Why: these are the highest-surface-area, lowest-blast-radius files in your codebase. A mis-aligned button costs nothing. A typo in a tooltip is a same-day fix. Visual regression tests, Storybook, and a browser window catch the rest faster than you can read a diff.

This is the tier where AI legitimately earns its speed gains. Let the model work. Spot check in the preview deploy.

9. The Review Checklist You Can Steal

Put this at the top of your engineering handbook. Or paste it into your CLAUDE.md or .cursorrules so the agent itself respects the hierarchy.

Tier	Review Effort	Example Files	Failure Cost
1	Every line, every time	Dockerfile, Terraform, IAM, migrations, CI/CD, git config	Outage, breach, weekend lost
2	Block on human approval	API contracts, auth, payments, rate limits	Silent data corruption, security hole
3	Read the diff, trust tests	Business logic, pipelines, workers	Wrong number in a report
4	30-second skim	Utilities, tests, dev scripts	CI failure, fix in minutes
5	Trust, catch in staging	UI, CSS, copy, docs	Visual bug, same-day fix

Three rules that make this stick on a real team:

Encode the tiers in your CODEOWNERS file. Tier 1 files should require a senior engineer on the PR regardless of who authored it. GitHub will enforce what discipline will not.
Run a pre-deploy check for environment drift. Before any deploy script runs, it should print current branch, current remote, current HEAD, and ask for explicit confirmation. This would have caught my Windsurf incident in three seconds.
Scan Tier 1 and Tier 2 with automated tooling. Snyk, GitGuardian, tflint, hadolint, kubeval. Humans review judgement. Tools review syntax. Do not make humans do what tools do faster.

10. Closing: Review Narrower, Review Harder

The common reading of AI-assisted engineering is that review becomes optional. It does not. It becomes surgical.

Every hour of review time you used to spend reading a React component is now free. Spend it on the Terraform file instead. Spend it on the auth middleware. Spend it on the migration that runs on 50 million rows tomorrow morning.

The Stanford researchers summarized their finding in a line I have not stopped thinking about. Developers who trusted the AI less and engaged more with their prompts produced code with fewer security vulnerabilities. Less trust, more engagement. That is the whole discipline.

AI is a force multiplier on everything - including mistakes. The files where a mistake costs you the most are the files where you have to slow down, zoom in, and read every line like a human who still takes responsibility for production.

The other 80%? Let the model cook.

Working through the challenges in this post? I help engineering leaders and CTOs navigate complex technical decisions and scale high-performing teams. Schedule a consultation →

Introduction

1. The Time My Deploy Pushed to the Wrong Repo

2. Why "AI Codes Everything" Is a Bad Default

3. The Blast Radius Principle

4. Tier 1 - Always Review Line-by-Line

5. Tier 2 - Review Before Merge

6. Tier 3 - Review the Diff

7. Tier 4 - Skim for Smells

8. Tier 5 - Trust But Verify Later

9. The Review Checklist You Can Steal

10. Closing: Review Narrower, Review Harder

Related Articles

Knowledge Graphs for AI Coding: What the Tools Actually Build (and What I Measured)

The 2% Problem: Why AI Harness Beats Model Capability

The Haiku-First Engineer: Why Smaller Models Make You Better at Building

Contact