Context Engineering: The Operating Discipline for Reliable AI Systems

In Q2 2025, a logistics company retired an AI-assisted dispatch tool after six months of operation. The model was GPT-4. The failure post-mortem named one root cause: the context window was filled with historical route data - distances, average delivery times, past driver assignments. Real-time traffic constraints, updated vehicle availability, and current road closures were absent. The model was capable. It was simply working from the wrong brief.

That is not a model failure. It is a context engineering failure - and it is the most common class of production AI failure in 2026. Gartner (2024) found that 30% of generative AI projects are abandoned after proof of concept, with poor data quality and unclear business value as the leading causes. Both symptoms trace to the same root: teams invest in the model and neglect what they feed it.

Context engineering is the discipline that addresses this. It is distinct from prompt engineering - a narrower craft focused on phrasing individual instructions. Context engineering asks a harder question: what is the optimal configuration of information the model should see at each step of a workflow, and how do you build the systems that assemble it reliably?

What context engineering actually is

Think of briefing a consultant. You do not hand a McKinsey partner your entire file server and ask them to find the relevant documents. You prepare a structured brief: the decision to be made, the constraints that are fixed, the three reports most relevant to this engagement, and the format you expect back. Context engineering is that discipline, applied to AI systems at scale.

Anthropic's engineering team defines context engineering precisely: "Context refers to the set of tokens included when sampling from a large-language model. The engineering problem at hand is optimizing the utility of those tokens against the inherent constraints of LLMs." Phil Schmid at Hugging Face sharpened this in June 2025: "Context Engineering is the discipline of designing and building dynamic systems that provide the right information and tools, in the right format, at the right time, to give a LLM everything it needs to accomplish a task."

Context, in plain terms, is everything the AI can see at the moment it answers. Your question. The documents attached. The instructions set in the system prompt. The history of the conversation. A tool list. Any retrieved knowledge. It is finite - a model cannot attend to more than its context window allows - and quality degrades when that window is poorly filled. Anthropic describes LLMs as having an "attention budget" with diminishing returns as context grows: "a performance gradient rather than a hard cliff."

Prompt engineering is a component of context engineering - specifically the craft of writing the instruction portion of that context. But in production agent systems, prompts represent a small fraction of total context. Andrej Karpathy's analogy, widely referenced after the LangChain team formalized it in July 2025, is apt: the LLM is the CPU, and the context window is RAM. What you load into RAM determines what computation is possible.

Context engineering is therefore a systems design problem, not a writing problem. This article treats it as one. For teams that have already built out context primitives and knowledge graph architectures, the knowledge graphs deep-dive covers the implementation layer. The article you are reading is the operating framework - why the discipline exists, how to measure whether it is working, and how to structure it for production use.

The research problem: longer is not better

The dominant failure mode in enterprise AI deployments is not model weakness. It is context noise - filling the window with the wrong information, in the wrong order, at too great a length. A 2023 paper from Stanford - "Lost in the Middle: How Language Models Use Long Contexts" by Nelson Liu and colleagues, published in the Transactions of the Association for Computational Linguistics - documented this with precision.

The finding: language model performance follows a U-shaped curve based on where relevant information appears in the context. When the key data appears at the beginning or end of the input, accuracy is highest. When it is buried in the middle - surrounded by other documents - accuracy degrades significantly. This effect persists across model families and is not resolved by simply increasing context window size.

Illustrative - based on Liu et al., 2023 (arXiv:2307.03172) multi-document question-answering experiments. The U-shaped degradation pattern holds across model families tested in that study. Performance indexed to 100 at the best-case position.

The 100-LongBench paper (ACL 2025, arXiv:2505.19293) extended this work with a length-controllable benchmark that disentangles baseline model knowledge from true long-context capability. Its finding reinforces the same pattern: as context length grows beyond the practical threshold for a given task, accuracy does not plateau - it declines.

Illustrative - based on accuracy-versus-length patterns documented in 100-LongBench (ACL 2025, arXiv:2505.19293). Exact decay rates vary by model family, task type, and retrieval strategy. The pattern of diminishing returns beyond optimal context length is consistent across benchmarks.

The practical implication is counterintuitive: the optimal context window is narrower than the maximum context window. Filling available capacity is not a strategy. It is a symptom of context engineering debt.

The counterargument: what about larger context windows?

The obvious objection is that context windows are growing. Gemini 1.5 supports one million tokens. Future models may support ten million. Does that not make context engineering obsolete - allowing teams to simply include everything?

It does not, for three reasons. First, the position-sensitivity effect documented in "Lost in the Middle" persists even at large context sizes - models still struggle when the relevant signal is diffuse and surrounded by noise, regardless of total window capacity. Second, latency and cost scale with context size: a 128K-token context costs and responds more slowly than a 4K-token one, at any price point. Third, retrieval quality - the precision of what is included - determines output quality independently of window size. A million-token window filled with the wrong documents produces the same failure mode as the logistics dispatch system described at the start of this article.

Larger windows raise the ceiling. They do not change the discipline required to operate reliably within that ceiling.

A measurement framework

Most teams that deploy AI agents cannot measure whether their context setup is working. They observe output quality subjectively - "it seems better now" - and iterate without a baseline. This is the evaluation gap that Anthropic's engineering team identified in their "Demystifying Evals for AI Agents" guide: teams with structured evaluations can upgrade models in days while competitors without them face weeks of manual testing.

Seven metrics instrument context quality systematically. Each is defined below in plain terms first, with the technical label following for teams that need to cross-reference with benchmarking literature.

What you are measuring	Technical label	Business question it answers	Low-tech proxy (no tooling required)
How many of the right documents made it into the model's working memory	Recall@K	Is the retrieval step surfacing what the model needs?	Manually review the top 5 retrieved chunks for 10 test queries. Count how many you would have selected yourself.
Share of tasks completed without human correction	Task success rate	Is the overall context system producing usable output?	Track how many AI outputs you accepted vs. rewrote over a week.
Rate of factually incorrect outputs requiring catch-and-fix	Hallucination / error rate	How often does a human need to intervene to prevent a bad outcome?	Flag and count errors in a sample of 50 outputs. Divide by 50.
Time from request to first useful token	Latency (ms)	Is context size creating unacceptable delays in production?	Time a representative task with a stopwatch before and after a context change.
Cost per call in dollars, driven by total token count	Token cost ($/call)	Is the context size financially sustainable at production volume?	Check your API billing dashboard before and after reducing context size.
Share of tokens sent that the model actually uses in its response	Context utilization ratio	How much of what you are paying to send is dead weight?	Compare input token count to the length and specificity of the output. If outputs are generic despite rich context, utilization is low.
Fraction of outputs the user subsequently edits or rejects	Answer revision rate	Is the context producing outputs that require downstream human effort to fix?	Count accepted-as-is vs. edited outputs for a one-week period.

Not every team needs all seven. A content workflow running at low volume should start with task success rate and answer revision rate - both are measurable without instrumentation. An agent system at production scale should add token cost and latency. Recall@K becomes essential once retrieval is a meaningful part of the architecture.

Illustrative composition for a production retrieval-augmented agent. Token ratios vary significantly by task type, agent architecture, and retrieval strategy. Most teams default to overweighting conversation history - the lowest-signal segment.

The composition chart above illustrates a common failure mode: conversation history consuming 20% of context - a segment that is largely redundant noise in most workflows after the third turn. The mechanic that addresses this is compression, described in the next section.

Based on retrieval precision patterns documented in LangChain's Context Engineering for Agents (July 2025). Actual gains vary by corpus quality, query complexity, and re-ranking model. The directional finding - that fewer, better-ranked documents outperform large undifferentiated sets - is consistent across implementations.

The four mechanics: write, select, compress, isolate

LangChain's engineering team formalized four mechanics for managing agent context in their July 2025 guide. The taxonomy has since become the practical standard for context system design. Each mechanic addresses a different failure mode in how information enters, stays in, or leaves the context window.

Write - building the brief before the meeting

Saving context outside the context window so it can be loaded deliberately at the right time - that is what engineers call the write mechanic. In practice, it means structured artifacts: a project brief that defines the goal, constraints, and acceptance criteria; a session checkpoint that captures decisions and open questions after each meaningful interaction; role profiles that define different context configurations for different task types.

The Anthropic engineering guide recommends calibrating system prompts to the "right altitude" - specific enough to guide behavior, flexible enough to be robust across variants of the task. Claude Code's CLAUDE.md pattern, described in the project structure guide, is a direct implementation of the write mechanic: a structured file read before every session that prevents the model from re-deriving project state from scratch.

Junior engineer entry point: Create a project_brief.md file in your project root with five fields - goal, stack, key files, constraints, and what you are not doing. Make it the first thing you attach in every AI session. This alone eliminates most context reconstruction overhead.

# Project brief

## Goal
[What this project builds and why - one sentence]

## Stack
[Languages, frameworks, key services - 3–5 items]

## Key files
[The 5–8 files most relevant to active work - relative paths]

## Definitions
[Project-specific terms, abbreviations, or conventions]

## Acceptance criteria
[What "done" looks like - measurable if possible]

## Do not do
[Hard constraints, out-of-scope work, anti-patterns to avoid]

Select - pulling the three right files, not the full archive

Retrieving only the relevant context into the window at the moment it is needed - that is the select mechanic. The failure mode it addresses is context dumping: pasting all available documentation into a prompt and hoping the model finds the relevant parts. It does not, reliably - as the position-sensitivity research demonstrates.

The progression from simple to sophisticated: keyword search over markdown notes (returns top-5 matching chunks); embedding-based semantic search (returns chunks most similar in meaning, not just matching words); re-ranking (a second pass that scores retrieved chunks by task-specific relevance before including them). LangChain's 2025 data shows that tool selection accuracy improves threefold when relevant tools are retrieved into context rather than listed exhaustively. For knowledge graph implementations that formalize the retrieval layer, see the context primitives article.

Junior engineer entry point: Instead of pasting entire documents, split them into 200-400 word chunks. Use a simple keyword search to retrieve the top 5 matching chunks for each task. This alone outperforms full-document inclusion for most question-answering workflows.

Compress - the executive summary instead of the transcript

Retaining only the tokens required to perform the task - that is the compress mechanic. As conversations grow, early turns accumulate context that is no longer relevant: exploratory questions, discarded directions, intermediate outputs. The compress mechanic converts prior conversation history into a structured summary that preserves what matters: decisions, numbers, open questions, terminology, and constraints.

Claude Code implements this automatically - triggering a compaction step when the context exceeds 95% of the window. For workflows built without that infrastructure, the pattern is manual but straightforward: after each meaningful session, write a 5-to-7 point summary using the fields above, and use that summary as the context for the next session rather than replaying the full chat history. The token reduction guide covers 16 compression techniques with specific token measurements.

Illustrative comparison for a retrieval-augmented task with approximately 5,200 vs. 1,850 input tokens. Raw history includes all message turns from a multi-session project. Structured context includes a session checkpoint, retrieved snippets, and role instructions only. Actual savings scale with context size and model pricing.

Junior engineer entry point: At the end of every AI session, write a 5-bullet summary before closing the tab. Include: what was decided, what changed, what remains open, which files were touched, and the next step. Use that summary as the first message in the next session. This is the simplest possible compression implementation and it works.

# Session summary - [date]

## What changed
[Files modified, decisions implemented, features shipped]

## What was decided
[Choices made and why - keep the rationale, not just the outcome]

## What remains open
[Unresolved questions, blocked tasks, deferred decisions]

## Files touched
[List modified files for quick context at next session start]

## Next step
[Single most important thing to do next - one line]

Isolate - the right specialist for the right question

Splitting context so that different agents or tasks operate on separate, minimal subsets of information - that is the isolate mechanic. The failure mode it addresses is context pollution: giving a coding agent access to marketing copy retrieval tools, or a content drafting agent access to a database schema it will never use.

The implementation is a context_profiles/ folder with separate configuration files per task type: a debug profile includes logs, failing tests, and recent diffs; a writing profile includes audience definition, tone guidelines, and relevant prior outputs; a research profile includes source documents and citation requirements. LangGraph and multi-agent architectures formalize this with agent-specific memory scopes. The simpler version is a YAML file per role, loaded selectively. For teams building agent workflows end-to-end, the prompt-to-system guide covers the full pipeline architecture.

Junior engineer entry point: Create three text files: debug-context.md, write-context.md, and research-context.md. Each contains the system instructions and tool list for that specific task type only. Attach the relevant file at session start instead of a general-purpose system prompt.

role: debugging agent
system: |
  You are a senior engineer diagnosing a production bug.
  Focus on identifying root cause and proposing a minimal fix.
  Do not refactor unrelated code. Do not add features.
tools:
  - read_file
  - run_tests
  - search_logs
context_files:
  - logs/latest.log
  - src/failing_module.py
  - tests/test_failing.py
exclude:
  - docs/
  - marketing/
  - scripts/deploy/

Illustrative comparison across 10 sequential tasks in the same project. Without session checkpoints, the model re-derives project state on each session start. With checkpoints, prior decisions and context are reused directly. Actual reuse rates depend on task similarity and checkpoint quality.

The agentic frontier: context across agent boundaries

In multi-agent systems - where a coordinating agent dispatches subtasks to specialized subagents - context degrades at handoff boundaries. Each subagent receives a compressed representation of the parent task, and information fidelity depends entirely on how that representation is structured. Anthropic's multi-agent research architecture addresses this explicitly: the lead researcher saves a structured plan to shared memory, and subagents operate in parallel with isolated context windows that include only their assigned scope. Without deliberate handoff protocols, context entropy accumulates across agent boundaries and compounds into output degradation that is difficult to diagnose. The 2% problem article covers the full harness architecture - context management is a significant share of that 98%.

From practice: a real workflow

Abstract frameworks are tested against production constraints. The following examples come from a five-agent content and job-search workflow running continuously across six months of operation.

From practice - the write mechanic as identity guardrail: One of the system's core context artifacts is a 217-line identity file structured in five parts: permanent identity constraints, tone composition ratios, three signature narrative patterns, anti-drift guardrails, and reference calibration. The file does not say "sound authentic" - it says what to reject. Specific failure modes are named: vague claims of competence, hedging language, generic examples that an AI could invent without domain knowledge. The measurable outcome: 90+ published pieces across six months with no detectable voice drift between them. The identity file functions as a write-mechanic artifact - authored once, retrieved consistently, preventing context ambiguity about the author's voice at every drafting session.

Actual token counts measured across a 5-agent content workflow over 30 days of operation. Raw history includes all message turns from multi-session projects. Structured packet includes session checkpoint, retrieved relevant snippets, role instructions, and identity constraints only. Token counts via API usage logs.

From practice - the select and isolate mechanics as duplication prevention: A second context artifact is a tracked list of every topic covered across 90+ published pieces, maintained in a plain text file. Before any new drafting session, this list is retrieved and checked - a manual implementation of the select mechanic that prevents topic duplication. A companion archive folder holds previously used ideas; it is never surfaced to the drafting agent, implementing isolation by exclusion. The measurable outcome: zero duplicate topics across 90+ pieces, and no archived angle reused. The system has no AI-powered deduplication. It has structured context that makes duplication structurally difficult.

Both examples share a property: the context engineering is visible, auditable, and file-based. There is no black-box memory system. Every artifact is a document that can be read, edited, and version-controlled. This is the most important design principle for teams starting out - make context legible before making it automated.

A starter stack for engineers

This section is for the engineer who has never built a retrieval pipeline. You need a text editor, Python's standard library, and about 90 minutes to implement the first three steps.

Step 1: Project brief file. A single project_brief.md with six fields - goal, tech stack, key files, definitions, acceptance criteria, and a "do not do" list. Read it at the start of every AI session. This is the write mechanic in its simplest form. It prevents the most common context failure: the model re-deriving what you are building from scratch on every conversation.

Step 2: Session checkpoint template. After each meaningful session, write a structured summary to session_summary.md with five fields: what changed, what was decided, what remains open, which files were touched, next step. Use this as the first message in the following session. This implements compress without any automation.

Step 3: Context profiles folder. Create a context_profiles/ directory with one YAML or markdown file per task type. Each file contains only the system instructions and tool list relevant to that task. Load the relevant profile at session start. This implements isolate.

Step 4: Simple retrieval. Index your project notes and documentation as 200-400 word chunks in a folder. The script below takes a query and returns the top-5 matching chunks by keyword overlap - no embeddings, no vector database, no external dependencies. Pass the results to the model instead of entire documents. This is the select mechanic at minimum viable complexity. Upgrade to embedding-based retrieval when this becomes a bottleneck.

import re
from pathlib import Path

CHUNKS_DIR = Path("context/chunks")  # folder of 200-400 word .md files
TOP_K = 5

def load_chunks() -> dict[str, str]:
    return {p.stem: p.read_text() for p in CHUNKS_DIR.glob("*.md")}

def score(chunk: str, query: str) -> int:
    terms = re.findall(r"\w+", query.lower())
    text = chunk.lower()
    return sum(text.count(t) for t in terms)

def retrieve(query: str) -> list[str]:
    chunks = load_chunks()
    ranked = sorted(chunks.items(), key=lambda kv: score(kv[1], query), reverse=True)
    return [text for _, text in ranked[:TOP_K]]

if __name__ == "__main__":
    import sys
    results = retrieve(" ".join(sys.argv[1:]))
    print("\n---\n".join(results))

Step 5: An evaluation harness. Define five tasks you run repeatedly - representative questions or generation tasks. Run them with raw chat only, then with project brief only, then with retrieval, then with retrieval plus session summary. Measure task success rate (how many outputs you accepted without editing) across all four configurations. This is your baseline. Any context change should move this number.

Illustrative benchmark across 20 standardized coding and content tasks comparing four context configurations. Based on methodology from Letta Context-Bench (October 2025). Each configuration adds one layer of structured context. Actual results vary by task domain, model, and implementation quality.

The starting point is whatever you will actually build this week. One project brief file, maintained honestly, outperforms a sophisticated retrieval pipeline that no one updates. Context engineering is an operational discipline - it compounds when practiced consistently, and degrades when neglected.

Teams that invest in context engineering are not buying a better model. They are building the operating layer that makes any model reliable. Gartner's data shows that 85% of AI projects fail due to poor data quality - which is, at its core, a context problem. The performance gap between teams that instrument context and teams that iterate on prompts will widen as models improve, not narrow, because better models are better at using well-structured context. They are not better at compensating for bad context.

The model is table stakes. The discipline of the team that feeds it is the moat.

Sources

Anthropic Engineering - "Effective Context Engineering for AI Agents" (September 2025): anthropic.com/engineering/effective-context-engineering-for-ai-agents
Anthropic Engineering - "Demystifying Evals for AI Agents" (2026): anthropic.com/engineering/demystifying-evals-for-ai-agents
Liu et al. - "Lost in the Middle: How Language Models Use Long Contexts" (arXiv 2023, TACL 2024): arxiv.org/abs/2307.03172
100-LongBench - "Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?" (ACL 2025 Findings): arxiv.org/abs/2505.19293
LangChain - "Context Engineering for Agents" (July 2025): langchain.com/blog/context-engineering-for-agents
GitHub Blog - "Want Better AI Outputs? Try Context Engineering" (January 2026): github.blog
Letta - "Context-Bench: Benchmarking LLMs on Agentic Context Engineering" (October 2025): letta.com/blog/context-bench
Phil Schmid - "The New Skill in AI is Not Prompting, It's Context Engineering" (June 2025): philschmid.de/context-engineering
Gartner - "Gartner Predicts 30% of Generative AI Projects Will Be Abandoned After Proof of Concept by End of 2025" (July 2024): gartner.com

Working through the challenges in this post? I help engineering leaders and CTOs navigate complex technical decisions and scale high-performing teams. Schedule a consultation →