A local-first knowledge system that captures errors once, retrieves solutions forever. Resolve recurring issues 12x faster with automatic symptom detection, multi-factor ranking, and zero dependencies.

Most teams solve the same problems repeatedly.

A database timeout occurs. Three hours of investigation. Root cause found: connection pool too small. Fix deployed. Incident resolved.

Forty-five days later, the exact same symptom appears on a different service. Different engineer. Same investigation. Same lost time.

This pattern repeats because solutions vanish. They exist in Slack threads from six months ago. They live in old incident tickets no one thinks to search. They're in the heads of engineers who've moved on.

I got tired of losing the same solutions.

So I built something different: a local-first knowledge management system that automatically captures every issue, generates structured solutions, and instantly retrieves proven answers when similar problems recur.

No cloud, no dependencies, no manual work beyond what you're already doing. This is the story of how I built it, why the architecture matters, and why you should consider building something similar for your team.


The Problem: Knowledge Evaporation

Let me be direct about what I observed:

  • First-incident cost: Investigation, root cause analysis, fix, deploy. 2-3 hours minimum for a real incident.
  • Second-incident cost: Same incident, same investigation, same 2-3 hours. The solution was never retrievable.
  • Organizational cost: Knowledge leaves when people leave. Expertise is ephemeral.

I wanted to build something that treated solutions like code: durable, searchable, versioned, learnable.

The constraint was simple: zero external dependencies, zero network calls, works completely offline. If a system requires authentication, cloud storage, or vendor lock-in, teams won't adopt it. It has to be simpler than the problem it solves.


Part I: Vision & Architecture

The Vision: A Personal Stack Overflow

Here's what I imagined:

You hit an error. Before debugging, the system searches your knowledge base for similar past issues. Instantly. Offline. Ranked by relevance, confidence, and how often that solution has worked.

No copy-pasting from Slack threads. No hunting through ticket history. No "I think we fixed this once but I don't remember how."

Instead: proven solutions, ranked by how trustworthy they are, immediately available.

The system learns as you use it. First time you capture an issue, it has low confidence. Fix it once, confidence goes up. Use that fix five times successfully, it becomes your most trusted solution for that symptom.

Over time, your team builds a personalized Stack Overflow—but it contains only your actual solutions, ranked by your actual experience.

The Architecture: Simplicity by Design

I made one architectural decision early: everything local, everything inspectable, everything human-readable.

No databases. No cloud. No compiled binary formats. Just JSON files and directories that you can understand by reading them.

Here's the knowledge base structure:

~/.knowledge_base/
├── issues/               # Captures (date-partitioned)
│   ├── 2026-03-15/
│   │   └── timeout-api-error-550e8400.json
│   └── 2026-03-20/
│       └── connection-pool-550e8401.json
├── postmortems/          # Solutions (date-partitioned)
│   ├── 2026-03-15/
│   │   └── 550e8400-postmortem.json
│   └── 2026-03-20/
│       └── 550e8401-postmortem.json
├── qa/
│   └── qa_index.jsonl    # Generated Q&A (one per line)
└── symptom_index/
    └── symptom_index.jsonl # Symptom mappings

Every file is JSON. Every directory is dated. If you want to understand what your knowledge base contains, you can read it with a text editor.

The data flow is equally straightforward:

Issue (capture)
    ↓
Postmortem (analyze & document)
    ↓
Q&A (auto-generate from postmortem)
    ↓
Symptom Index (map Q&A to symptoms)
    ↓
Retriever (search & rank)
    ↓
Instant Solution

You capture an issue with symptoms. You investigate. Once you find the root cause, you generate a postmortem encoding the root cause, the fix, and prevention steps. The system automatically extracts Q&A from that postmortem and indexes it by symptom. Next time someone searches that symptom, your solution appears ranked by confidence.

The entire system is 750 lines of Python. No frameworks. No external dependencies. Runs on Python 3.8+.


Part II: The Build Journey

Phase 1: Understanding the Data

Before I could build a retrieval system, I had to understand what data should flow through it.

I started with three example JSON files:

Example Issue (captured at incident time):


  "id": "550e8400-e29b-41d4-a716-446655440000",
  "timestamp": "2026-03-15T14:32:00Z",
  "description": "Database timeout during traffic spike",
  "symptoms": ["timeout", "latency_high", "api_error"]

Example Postmortem (generated after investigation):


  "issue_id": "550e8400-e29b-41d4-a716-446655440000",
  "timestamp": "2026-03-15T16:45:00Z",
  "root_cause": "Connection pool size 10 insufficient for 50 concurrent requests",
  "resolution": "Increased pool to 50, added exponential backoff",
  "prevention": "Monitor pool utilization, load test deployments"

Example Q&A (auto-generated from postmortem):


  "issue_id": "550e8400-e29b-41d4-a716-446655440000",
  "question": "What causes database timeout during high traffic?",
  "answer": "Connection pool exhaustion. Solution: increase pool size and add exponential backoff.",
  "symptoms": ["timeout", "latency_high", "api_error"],
  "confidence": 0.85,
  "usage_count": 0

The Q&A is the core artifact. It connects symptoms to solutions. It has a confidence score. It tracks usage. This is the thing that gets searched.

Phase 2: Building the Retrieval Algorithm

Once I understood the data, I built the retrieval engine. This is where the system actually becomes useful.

The naive approach: return all Q&A entries with matching symptoms. Fast, but unhelpful. If you search for "timeout," you get 40 results.

My approach: rank by multiple factors:

  • Symptom match (50%): Does the solution address your symptom?
  • Confidence (30%): How reliable is the solution based on past outcomes?
  • Recency (10%): Newer solutions preferred (systems evolve).
  • Usage (10%): Frequently-used solutions more trusted.

When you search for "timeout," the system returns ranked results:

1. Database connection pool exhaustion (0.89)
   - Matches: timeout, latency_high, api_error
   - Confidence: 0.85, Used 5x, Recent

2. Network timeout due to DNS (0.72)
   - Matches: timeout
   - Confidence: 0.65, Used 1x, Older

3. Client timeout misconfiguration (0.68)
   - Matches: timeout
   - Confidence: 0.60, Used 0x, Newest

The highest-ranking solution isn't just the most recent or the most used. It's the one most likely to solve your problem based on the multi-factor ranking.

Phase 3: Automatic Detection via Hooks

I realized the system would only work if it was automatic. If it required manual invocation, people would forget to use it.

I integrated with Claude Code's hook system. Now when you run a Bash command and it fails, two things happen automatically:

  1. The error message is mapped to a symptom (timeout, dependency_failure, auth_failure, etc.).
  2. The knowledge base is searched for past solutions.
  3. If solutions exist, you see them immediately before investigating.

The same happens when you describe an error in your message. The system recognizes patterns like "NameError," "doesn't work," "broken," "can't connect," and proactively searches for related issues.

You don't have to opt-in. You don't have to think about it. The system is always running in the background, learning as you work.

Phase 4: Quality Constraints

Early on, I realized the system could accumulate garbage. Vague root causes. Speculation. Solutions that don't actually work.

I added quality constraints:

  • Root cause validation: Must describe specific technical failure, not vague statements.
  • Resolution validation: Must be concrete, measurable, actionable.
  • Symptom validation: Only 14 predefined symptoms. No custom garbage.
  • Confidence scoring: Starts low, increases only through successful reuse.

The first time you capture an issue, the solution starts at 0.5 confidence. You've identified a problem and a fix, but you haven't proven it works at scale yet. Use it successfully five times, confidence climbs to 0.85. This creates a natural feedback loop: better solutions surface automatically.


Part III: Real Impact

What You Actually Get: Three Use Cases

Use Case 1: The Repeat Incident

You're on-call. An alert fires: database latency spike. You start investigating. Before diving into logs, Claude Code searches the knowledge base. It finds: "Database connection pool exhaustion - 0.89 confidence, used 5x successfully."

Instead of a 2-hour investigation, you apply the known fix in 10 minutes. Incident resolved.

Use Case 2: Onboarding Knowledge

A junior engineer joins the team. They hit their first error. Instead of Slack-surfing or bothering colleagues, they get instant access to the team's actual solutions, ranked by trustworthiness.

They're not learning generic Stack Overflow answers. They're learning how your specific systems actually fail and how your team actually fixes them.

Use Case 3: Knowledge Retention

An engineer leaves the team. Their solutions don't leave with them. The knowledge base contains their postmortems, their fixes, their prevention strategies. The system continues surfacing their solutions when relevant.

Your organizational knowledge is no longer fragile.

Exploring the Knowledge Base: A Real Example

Here's what your knowledge base actually looks like after a month of use:

~/.knowledge_base/
├── issues/
│   ├── 2026-03-01/
│   │   ├── api-timeout-550e8400.json
│   │   └── auth-failure-550e8401.json
│   ├── 2026-03-15/
│   │   ├── db-timeout-550e8402.json
│   │   ├── config-error-550e8403.json
│   │   └── memory-leak-550e8404.json
│   └── 2026-03-20/
│       └── dns-issue-550e8405.json
├── postmortems/
│   ├── 2026-03-01/
│   │   ├── 550e8400-postmortem.json
│   │   └── 550e8401-postmortem.json
│   ├── 2026-03-15/
│   │   ├── 550e8402-postmortem.json
│   │   ├── 550e8403-postmortem.json
│   │   └── 550e8404-postmortem.json
│   └── 2026-03-20/
│       └── 550e8405-postmortem.json
├── qa/
│   └── qa_index.jsonl          # 7 Q&A entries (one per line)
└── symptom_index/
    ├── timeout.jsonl            # Points to Q&A for timeout symptom
    ├── api_error.jsonl          # Points to Q&A for api_error symptom
    ├── auth_failure.jsonl       # Points to Q&A for auth_failure symptom
    ├── config_error.jsonl       # Points to Q&A for config_error symptom
    └── memory_leak.jsonl        # Points to Q&A for memory_leak symptom

Every file is human-readable JSON. Every directory is date-partitioned for easy archiving and cleanup. The symptom index makes searches instant: O(1) lookup instead of scanning every issue.

Want to understand what your team has learned about timeouts? Open ~/.knowledge_base/symptom_index/timeout.jsonl and read it. Every line is a Q&A entry mapping to a postmortem. That's your team's collective intelligence about timeout issues, ranked by confidence.

Why Local-First Matters

I could have built this on the cloud. Sync to a database. Share across teams. Add authentication.

I didn't. Here's why:

  • Instant adoption: No signup, no credentials, no account management. Run setup.sh and start using.
  • Full control: Your knowledge is in ~/.knowledge_base/, yours alone. No vendor lock-in.
  • Works offline: You can search solutions without internet. Essential for incident response.
  • Zero dependencies: Python stdlib only. No risk of package supply-chain attacks.
  • Defensibility: Proprietary incident data stays local. Sensitive fixes don't transit the internet.

The trade-off: you can't share across teams automatically. But you can git-sync the ~/.knowledge_base/ directory if you want to. The architecture supports it without forcing it.


Part IV: Getting Started

The Learning Curve: Minimal

The system requires learning exactly three commands:

Capture an issue:

python3 scripts/cli.py capture   --description "Database timeout during deployment"   --symptoms timeout,api_error

Generate a postmortem (after investigation):

python3 scripts/cli.py postmortem   --issue-id {uuid}   --root-cause "Connection pool exhausted"   --resolution "Increased pool to 50"   --prevention "Monitor pool utilization"

Search for solutions:

python3 scripts/cli.py search --symptom timeout

That's it. Everything else is optional. You can view stats, list recent issues, examine specific issues, but 90% of your workflow is capture → postmortem → search.

With the hooks installed, even the capture and search become automatic. You're just documenting postmortems as you solve incidents.

The Cost-Benefit: 12:1 ROI

Here's the actual math:

  • Capture: 5 minutes (documenting the issue)
  • Postmortem: 10 minutes (after investigation)
  • Total investment per issue: 15 minutes

Payoff: Next time that issue recurs, you save 2 hours of investigation.

ROI: 15 minutes of work = 2 hours saved. That's 8:1 on the first reuse. By the fifth reuse, you've saved 10 hours for 15 minutes of documentation.

For a team of five engineers, if each issue recurs just once a quarter, you're saving 40 hours per quarter. That's a full week of engineering time.

What I Learned Building This

Insight 1: Simplicity is a feature. The system works because it's small enough to understand. 750 lines of Python. No frameworks, no layers of abstraction. Every engineer can read it. That builds trust.

Insight 2: Humans are terrible at searching. The multi-factor ranking algorithm solves a real problem. When you search for "timeout," you don't want all results equally weighted. You want the solution that's actually solved timeout for you before, ranked first.

Insight 3: Automation is everything. Manual systems die. The moment you have to think about using something, it becomes friction. The hook system solved this. Automated capture and search means the system becomes part of your workflow, not a separate tool.

Insight 4: Local-first is radical. Most systems assume cloud. Most assume shared infrastructure. Local-first is contrarian. But for knowledge management, it's right. Your incident data is sensitive. It should stay local.

Insight 5: Confidence decay is necessary. A solution that worked six months ago might not work today. Your system changed. Dependencies evolved. Confidence scores naturally decay. Newer solutions preferred. This keeps your knowledge base current.

How to Start: Three Steps

Step 1: Install

git clone https://github.com/andrei-ionut-nita/issue-search-skill.git ~/.claude/skills/issue-search-skill
cd ~/.claude/skills/issue-search-skill
./setup.sh

The setup script creates ~/.knowledge_base/, registers hooks, and runs tests. Takes two minutes.

Step 2: Capture Your First Issue

Next time you hit an error, run:

python3 ~/.claude/skills/issue-search-skill/scripts/cli.py capture   --description "Your error description"   --symptoms timeout  # or api_error, auth_failure, etc.

Step 3: Generate a Postmortem

Once you've fixed it, document the solution:

python3 ~/.claude/skills/issue-search-skill/scripts/cli.py postmortem   --issue-id {uuid-from-step-2}   --root-cause "What actually caused it"   --resolution "What fixed it"   --prevention "How to avoid it next time"

Now your solution is indexed and searchable forever.


Conclusion: Knowledge That Lasts

Most teams solve the same problems repeatedly because solutions are ephemeral. They live in Slack threads and people's heads. When engineers leave, their solutions leave with them.

I built a personal Stack Overflow to change that.

A system that automatically captures issues, generates solutions, and retrieves proven answers when similar problems recur. Local-first. Zero dependencies. Fully inspectable. No external services.

The knowledge base is just JSON files in ~/.knowledge_base/. You can read it. You can version it. You can back it up. You own it completely.

After one month of use, you'll have captured 10-15 issues. Those issues will recur. Your knowledge base will start surfacing proven solutions automatically. Within six months, you'll have prevented dozens of re-investigations and created institutional memory that persists beyond team changes.

Your junior engineers will onboard faster. Your incident response times will drop. Your team will compound knowledge instead of re-discovering it.

Relentlessness beats intensity. Consistency beats urgency.

A system that learns from your actual incidents, that retains knowledge your team generates, that makes proven solutions instantly available—that's a system that compounds.


Building your own personal Stack Overflow isn't just about speed. It's about turning ephemeral knowledge into durable infrastructure. Ready to start? Clone the repository, run setup.sh, and capture your first issue. Your future self will thank you.

Scaling incident response around an intelligent knowledge system? I help engineering teams build post-incident processes that actually stick. Schedule a consultation →