Decoding AI Magazine

Your Second Brain Is a Graveyard. Make It Agent Memory.

Paul Iusztin — Tue, 07 Jul 2026 05:01:21 GMT

I spent 18 months turning my second brain into my living research memory. My digital life contains 10,994 notes: over 5,000 in Obsidian, another 5,000 in Readwise, plus more in Notion and Google Drive, growing by roughly 250 a month.

The reality is that most of my notes and bookmarks transformed into a graveyard. When I start a new article or open a codebase, I can’t recall the high-signal notes I already have, so I work without them or burn an hour digging.

The reflex is to reach for Codex or NotebookLM. But those are harnesses: they sit on top of your work, and the moment the chat ends, the context window loses all of it.

More context isn’t the fix. You re-paste the same links and re-explain the same project every session, while past research sits inert in a vault your agents can’t reach.

So , CTO of Towards AI, and I built the alternative, and we’ll show you how: your own AI Research OS. It’s a memory layer that sits between your second brain and any harness (Codex, Claude Code, or your own). It runs deep research across your notes and the open web, and stores what it finds as an LLM wiki you can query, maintain, and grow.

We first touched on this subject in a talk at the AI Engineer World’s Fair. After we saw that people loved it (it was published as a keynote in the online tracks - out of 82 videos - and is the 2nd most popular one), I decided to write an article about it as well.

By the end you’ll learn how to turn your 10,994 notes into a queryable LLM wiki that your AI agents can use as their agent memory, useful for research, coding and content creation.

You will build an LLM wiki as in the image below, where the payoff already compounds with just 12 sources:

Or go crazier with a larger wiki:

By the end, you will also see why the wikis don’t contain all my 10,994 notes to keep ingestion costs low. Let’s go.

Want your AI work featured across three platforms?

I'm looking for one builder to co-create a customer story on how you used Opik to solve a real-world problem in your AI product.

Here's what you get: from a 1-hour interview where you show off what you built, I'll produce a co-authored article, a YouTube video, and a run of social posts, all spotlighting your work. Distributed across both Decoding AI and Opik (by Comet) channels to 200,000+ AI engineers and decision-makers.

Here are the last two:

If you want to collaborate, email me at pauliusztin@decodingai.com

Email me

Fine print: your product or project must use Opik. This is a collaboration between Decoding AI and Opik (by Comet).

Why a Bigger Context Window Won’t Save You

Most research doesn’t need this system. If you want a quick answer or a one-off, Google it, or reach for Codex and Claude Code. You need this only when the research has to stick and an agent has to reuse those sources later.

You don’t need a giant vault either. This is about topic density, not size. It already pays off on a topic-sized cluster: around 20 notes on one subject, or 2 or 3 repos, like coding agents such as Pi, OpenCode, and Aider. Louis-François runs it on a few hundred notes, not my 10,000+, and it still earns its keep.

Why not NotebookLM? It’s great at digesting sources, but you don’t own it, you can’t personalize it deeply, it isn’t agent-native, and it’s weak for coding since it’s browser-bound.

Why not a vector-database RAG pipeline? That’s the right call at production scale, but it’s infrastructure. It’s hard to inspect or edit by hand, and overkill for a personal tool you open every day.

So you build it yourself: a personalized research assistant that uses an LLM wiki knowledge base that’s light and human-friendly as its agent memory. The perfect context engineering technique as an interface between the human and the AI agent.

Now, I want to go over, step by step, how we reached our final deep research + LLM wiki design, with the “why” and “how” in mind.

The Deep-Research Loop, Version 1: Mining the Public Web

A year ago, building the Agent Engineering course, we scoped the loop tightly: give it a topic plus a few hand-picked golden links, and get back a single static research.md.

The deep research algorithm first scrapes the golden links for seed context. Knowing them up front lets the agent frame better questions. Then it runs query rounds: one orchestrator generates the questions, and a sub-agent per question searches (Gemini grounded in Google) and returns links plus a summary, which the orchestrator aggregates so the context never explodes.

Version 1 in one line: topic + golden links → deep-research algorithm → one static research.md.

Three rounds of 6 queries surface 40–50 links, which is too noisy to keep whole. A ranking step scores each source against the topic. Only the top-K get fully scraped, the rest are kept as summaries, and everything is compiled into one flat research.md. That version generated 35 course lessons fast.

The deep-research algorithm unpacked: orchestrator, six sub-agents per round, three rounds, ranking, then top-K full scrapes into research.md.

It worked for the course, but it was aimed at the generic public web, with every golden link hand-picked from our own second brain.

Version 2: Point the Loop at Your Second Brain

Same loop, new target: aim it at your own sources instead of the public web, where you’ve already filtered what matters. Your second brain is a curated set of golden links, so the input shrinks to just a topic and the loop finds the rest.

Version 2’s core move: take the same loop and aim it at your own sources instead of the open web.

You plug in Obsidian, Readwise, NotebookLM, and GitHub Stars, then extend with Gemini Deep Research, YouTube, Google Drive, or Notion. Code is a special case: a GitHub sub-agent clones a repo and builds a high-level architecture note.

The token-efficiency trick lives in the reranker. It scores each candidate from 0 to 1 against your question using only its metadata and summary, never the full text, then passes only the top-K to the model.

The loop over your own sources: topic-only input, personal sources plugged in, same rank-and-scrape tail.

The new problem is that the output is frozen. You still end up with a static research.md, and real research isn’t static. You want another question answered, or part of it goes stale, and re-running the loop from scratch is expensive in tokens and time. The fix is a new layer that sits on top of the raw data: the LLM wiki.

Version 3: From a Static Pile to a Living Wiki

The pivot came from Andrej Karpathy’s idea of LLM-maintained knowledge-base wikis [1]. Instead of re-deriving knowledge from raw documents on every query, which is the RAG pattern, the LLM incrementally builds and maintains a persistent, interlinked wiki between you and your sources.

Since I realized the power of LLM wikis, I started to use them for all my personal setups as my agent memory. It’s simple, powerful and beautiful.

Since giving this talk, we found that Google had published its Open Knowledge Format (OKF), an open spec that “formalizes the LLM-wiki pattern into a portable, interoperable format” [2]. It’s built on the very blocks you’ll see below: plain markdown with YAML frontmatter, an index, a log, and links that form a graph. When Google independently ships the same architecture as a standard, you know the direction is right.

During ingestion it generates, as a byproduct, derivatives over your raw sources: entities and concepts, comparisons between them, notes tied to your questions, open questions, and a synthesis:

Storing each finding as an individual raw file, instead of one flat research.md, is the one change that makes everything downstream queryable and growable. Sources can include Obsidian, GitHub, Google Drive, even custom URLs (plain curl for simple sites, Bright Data for bot-walled ones).

Sources → deep research → store raw files → index → generate wiki → query. The version-3 pipeline end to end.

The natural fear is that “index + wiki + query” means vector databases and knowledge graphs. It doesn’t, and that’s the surprising part.

A Memory Layer Built From Plain Files and No Database

Vector databases, knowledge graphs, and semantic and text search all add real complexity, too much for a personal research OS. So drop that infrastructure and build the whole thing on files and references: no database, just a simple index rooted in how your filesystem already works.

The index is the retrieval layer. An agent reads a single index.yaml first, a catalog with a summary and metadata for each source (original file, origin, title, authors, date).

There are 3 layers. A raw folder holds the immutable source data, a wiki folder holds the LLM-generated derivatives, and the index points to all of it. A real example is an index.yaml cataloging 10 sources and 38 wiki pages.

The index is the retrieval layer: one YAML catalog the agent reads first, pointing to everything else.

You even get a knowledge graph for free. Because everything is Obsidian-flavored markdown with references, Obsidian’s local graph renders the connections out of the box: entities like OpenCode or MCP linked to concepts like tool registry, context compaction, and sandboxing, and the sources they touch.

Obsidian is only a viewer, though. The whole system runs from any working directory through Codex or Claude Code, so the graph is a bonus on top.

Querying the Wiki, and Why It Never Freezes

The agent queries through progressive disclosure. It starts at the index.yaml summaries. If it needs more, it opens the source wiki page (an expanded summary), and often stops there.

If not, it follows references into the derivatives (entities, concepts, comparisons, notes). Only as a last resort does it read the raw source. Summaries are computed once, at ingestion, so the context window stays small.

The query drill-down: each level answers most questions, so the agent rarely touches raw sources and context stays small.

The wiki is also alive. Ask a question and the LLM can spawn a new concept, note, or comparison, and each question is written to a log. So the wiki evolves as you talk to it, not only when you ingest data.

Every question leaves a trace: new notes and comparisons accrete, and the log tracks the whole history.

It’s never frozen, and it grows 3 ways: ingest a new custom link, run another deep-research round, or let it create derivatives purely from your questions.

The wiki grows three ways: new links, new research rounds, and new derivatives born from your questions.

One design choice makes this safe to run against a decade of notes: the wiki never sits directly on your whole second brain.

Scope It to a Project With PARA

Your second brain stays an immutable snapshot. I organize everything with the PARA method: Projects, Areas, Resources, Archive [3]. Sources are piped into Resources as a flat list, then referenced into Projects and Areas. So Obsidian stays read-only; I don’t want the LLM editing notes I write by hand.

PARA keeps your global second brain an immutable, read-only snapshot the LLM never writes to.

The wiki is scoped per project, not global. When you start something new, you reference that snapshot through the deep-research loop and scope it down to the project, running the loop or ingesting specific repos, articles, and notes via skills plugged into a harness.

Scoping is a scale choice as much as a safety one: plain files stay fast and inspectable at project size, but a whole-vault wiki of thousands of documents is exactly where a vector or graph database finally earns its place.

That’s the ~100-note wiki from the top, kept small by scoping to one project, yet reaching my entire 10,000+ note second brain through the loop.

Scope the snapshot down into a project wiki, then let an agent turn that research into the actual work.

The mental model is that the project is the work and the second brain is the research. A project is anything where you turn research into work: an article, a video, a set of slides (this talk was built this way), or a whole codebase. That’s the whole architecture, and the fastest way to see why it matters is to watch it run.

See It Run: Three Demos

Everything ships as one open-source Claude Code plugin you can clone and run on your own notes (https://github.com/iusztinpaul/ai-research-os-workshop), tweakable for any harness.

It’s four skills: /research (build or query the wiki), /research-distill (a per-piece research.md), /research-lint (health-checks), and /research-render (slides or a brief).

Demo 1 is research from a brain dump around “agentic harnesses”. You point the research skill at a file of notes plus a few must-include references. It scrapes those for seed context and picks a depth: fast (1 round of 3 queries), light (2 rounds, 3 then 2), or deep (3 rounds of 3). Out come the raw files + the LLM wiki:

Demo 2 is ingesting and comparing GitHub repos. You give it 3 harness repos (OpenCode, Pi, Hermes), no deep research, scoped to architecture, sub-agents, memory, and the permission flow. It clones each, writes per-repo notes, then builds cross-repo comparisons.

Demo 3 is ad-hoc links on GraphRAG, then questions to create notes. You hand it 3 arbitrary links to build a wiki, then ask how agentic GraphRAG differs from a plain knowledge graph. It answers off the wiki and updates it in place.

It was super hard to show the full demos here. You can find the full examples within the Examples section of the repo or watch me go over them in Obsidian and Zed in the full video.

Where This Is Going

Some connectors are missing on purpose (Google Drive, Notion, Slack) — the point is for you to add what you need. Louis-François added YouTube transcript ingestion in seconds with one prompt to Codex.

Two weaknesses stand out. The byproducts are often poorly written, fixable with the article writing profiles I already use. And as data grows, the LLM makes subtle errors: concept confusion (comparing Claude Code and OpenCode, it merged how each handles tool calls into one explanation) and superficiality (the Claude Code queuing system took many follow-ups to get right).

The fix is constant linting, which is what /research-lint does: it sweeps the wiki for orphan sources, broken links, stale claims, and contradictions.

What’s Next

Everything you saw here is open source. Clone the repo (or install it as a Claude Code plugin) and run it on your own notes today: https://github.com/iusztinpaul/ai-research-os-workshop

If you’d rather see the whole system in action first, watch the full talk and I gave at the AI Engineer World’s Fair:

And if you want to go beyond a workshop repo, such as building a deep-research-plus-writing multi-agent system from scratch and shipping it to production with AI evals and an observability layer on top, that’s exactly what we teach in our Agent Engineering: Building Multi-Agent Systems course.

The multi-agent system built during the Agent Engineering: Building Multi-Agent Systems course

Enjoyed the article? The most sincere compliment is to restack this for your readers.

Whenever you’re ready, here is how I can help you

Go from agent user to agent builder. Master the foundations of AI agents and turn fragile demo code into reliable, production-ready systems with my course, Agent Engineering: Building Multi-Agent Systems (made with Towards AI).

35 lessons. Pure foundations from scratch. 4 mini-projects. 2 production systems. A certificate and direct access to me & industry experts in our Discord.

Built for software and data professionals transitioning into AI engineering. Rated 5/5 with 300+ students. The first 7 lessons are free:

Start here

Not ready to commit? Start with our free Agent AI Engineering Guide, a 6-day email course on the mistakes that silently break AI agents in production.

Explore Next

Karpathy, A. (n.d.). LLM Wiki: A pattern for building personal knowledge bases using LLMs. GitHub. https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f

Google Cloud. (n.d.). How the Open Knowledge Format can improve data sharing. Google Cloud Blog. https://cloud.google.com/blog/products/data-analytics/how-the-open-knowledge-format-can-improve-data-sharing
Forte, T. (n.d.). The PARA Method: The Simple System for Organizing Your Digital Life. Forte Labs. https://fortelabs.com/blog/para/

Images

If not otherwise stated, all images are created by the author.

From Harness Lock-In to Portable Context Layer

Paul Iusztin — Tue, 30 Jun 2026 08:20:17 GMT

Models are commoditizing fast. Harnesses already have. What you care about is your research, your notes, your conversations, your tasks, your preferences, and your domain knowledge.

“Free” open-source harnesses don’t make you free. What you want to own is the context layer. Every agent is the same model + runtime + harness underneath, rebuildable on an open stack [1].

I hit this myself. The deeper I built into a single harness, the clearer it got how much I’d lose the day I had to leave. The loss comes in 3 forms.

Failure 1 — You start from scratch. You’ve run Claude Code for 6 months. You switch to an open model and every past conversation, every preference the agent learned about you, is gone. You begin again from zero.

Failure 2 — Your skills are hostage. The harder lock-in is your business logic. If your skills and workflows are coupled to one harness’s features and keywords, switching doesn’t just lose chat history — your custom logic breaks or quietly performs worse on the next tool.

Failure 3 — You’re billed at their mercy. The plan you depend on can be pulled, suddenly gated behind expensive pay-as-you-go, restricted from the latest models (e.g., the Fable story), or repriced from 200 to 1,000 overnight. With high switching friction, you can’t leave. You never know how the AI world can change. But for sure it won’t stay as is now.

So what do you do?

Keep your context layer detached from the harness. Any harness, any model (open or closed), plugs in and within ~5 minutes knows who you are, what you’re working on, what matters to you and how you like things to get done.

Everything that follows is based on the two projects I am working on: Scrabble, my own context layer anchored into my Obsidian notes, Readwise library, Notion, and Google ecosystem; and Tree, the project I am working on with for our next book, where we teach how to build a personal AI assistant from scratch.

The Architecture of Your Context Layer

A context layer is made out of 3 core components. First, a unified memory that holds everything you know. Second, a serving layer (MCP server or skills) that is the interface to that memory and the business logic that defines how the memory is used. The harness sits on top and is deliberately disposable.

The unified memory fuses what used to be separate systems: a filesystem, keyword search (BM25), semantic vector search, and a knowledge graph of typed POLE+O nouns (Person, Organization, Location, Event, Object), all in one place. The goal is a memory that belongs to you.

What I always preach is to try your best to build your unified memory on top of a single database that supports text, semantic, and graph search (such as MongoDB). Start simple and add complexity only when your use case demands it.

The serving layer has 2 angles. The first is an MCP server, surfacing Tools, Resources, Prompts, Skills, and even MCP Apps (e.g. a graph visualizer) [2]. It wraps the business logic for how memory is queried and updated. That’s what makes it portable.

The second form skips the server entirely, which is based on skills shipped straight on the filesystem (an AGENTS.md plus a folder of skills). Skills are an interface too. They’re leaner, with nothing to run, but more tied to a given harness’s conventions. It’s how Scrabble’s wiki layer gets consumed.

In practice the strongest setups do both: a portable MCP server for memory access, skills layered on top for higher-level workflows [2]. To avoid fragmenting your business logic, host the skills directly on your MCP servers, coupling them with the rest of the server’s business logic.

In Tree, where I am building a unified memory powered by knowledge graphs, everything is discovered as ordinary MCP tools. That is exactly why “swap the harness, keep the memory” is a one-line config change, not a migration.

The context layer: Interchangeable harnesses over an MCP-server interface over one unified memory you own.

Here is what happens when you interact with the context layer:

You enter a prompt into the harness, which becomes an MCP tool call to the server. The server queries the unified memory, and results flow back into the agent’s context. Nothing crazy so far.

The interesting part happens when you start adding stuff to your memory. It’s where continual learning happens.

Building a Unified Memory for Continual Learning

At a high level, the pipeline is simple. Ingest data into the unified memory → extract entities and relationships into a graph → index for hybrid and graph retrieval → expose everything through an MCP server → connect it to a harness like Claude Code or OpenCode. Building the knowledge graph was the easy part. Designing how the agent interacts with it was the hard part.

Don’t expose raw database operations. Agents struggle with them. Instead, design the server around how agents actually search and write, giving them high-leverage primitives they can compose like LEGOs inside skills. In Tree that’s 6 tools: 3 to search, 3 to write.

There are 3 search tools. nl_query_memory is the default — an LLM maps a natural-language (NL) question into a MongoDB hybrid-search + graph query. query_memory is the deterministic fallback for structured filters when NL fails. deep_search_memory handles large result sets (50+ docs) via progressive disclosure: it writes intermediate results to a YAML index creating a localized LLM wiki on the fly.

The distinction between nl_query_memory / query_memory and deep_search_memory is important because when retrieving chunks from the unified memory, you need to further compress them before adding them into the context window to avoid exploding your costs. But that means you lose a lot of information. Thus, creating a wiki on the fly trades latency for performance and lower costs.

There are also 3 write tools: ingest_url, ingest_file, and ingest_conversation.

The last is auto-triggered by a hook that ingests incrementally every ~10 conversation turns and finalizes when the conversation ends. That hook is what makes the learning loop continual: the system writes back what it just learned without you asking, so it gets smarter the more you use it.

And “smarter” is concrete. Across tens of thousands of notes, it finds links between things you’d long forgotten you had, connections you’d never have made by hand. It’s how you finally harvest your own past work and your graveyard of bookmarks.

The continual-learning loop: The agent queries the graph through the MCP tools, and every conversation is ingested back into the unified memory.

Now, let’s see how to actually build the unified memory layer.

Build It Three Ways

You’re implementing 3 pieces: the unified memory, your custom business logic, and the serving layer. How much you build yourself is also split into 3 levels of effort, which translates to a build vs. buy discussion.

Level 1: Build all three from scratch on a database that supports everything at once (such as MongoDB), implementing the memory and the MCP serving layer yourself. Maximum control, maximum effort.

Level 2: Build only the business logic and serving layer, on top of an SDK that already implements the memory layer: Graphiti, a bi-temporal entity/fact/episode graph; Neo4j-Labs’ agent-memory, a graph-native POLE+O store; or mem0, a vector-first layer that stores facts as rows.

Level 3: Off-the-shelf memory engines you mostly just run: cognee, an Extract-Cognify-Load pipeline of typed nodes that are graph + vector at once; or managed services like Zep (temporal context graphs on the open-source Graphiti engine) and HydraDB (a graph DB on tiered object storage, pitched as “own your memory, no data leaving your stack”). Lowest effort, least control over the data model, and a vendor dependency. Cognee can import and export memory across Mem0, Zep, and Graphiti if you need to switch.

Version 1 (from scratch): The full agentic GraphRAG pipeline from raw sources to MCP tools.

Now zoom into Level 1, the from-scratch path. Building the memory yourself splits into 2 versions. Version 1 (more complex), such as Tree, the agent memory I am building, which is a GraphRAG system whose retrieval fuses vector and keyword search with knowledge-graph traversal. It contains a POLE+O ontology, the index (graph + vector + text), the data and memory pipelines that ingest into it, and the query algorithms that read it.

POLE+O is a popular data model for building ontologies, borrowed from law-enforcement and intelligence analysis. Its beauty is that it’s perfectly balanced. It’s not too shallow, nor too deep. Perfect for an LLM to extract KG triplets from your data. Here it is as a thin StrEnum of the 5 node types:

In case you want more granularity over your POLE+O model, you can introduce subclasses, such as:

Person: individual, alias, persona
Organization: company, nonprofit, government, ...
Location: address, city, region, country, ...
Event: meeting, transaction, communication, ...
Object: device, software, document, task, topic, project

To learn more, I have a full article on agentic GraphRAG and another one on designing an ontology for your context layer.

Version 2, the lighter one, is based on LLM wikis, such as Scrabble, my current agent memory. Markdown files with YAML frontmatter and cross-references, sitting directly on top of your existing infrastructure (Obsidian, Notion, Google Drive) — exactly how Scrabble runs over my own second brain. Google recently formalized this pattern as the Open Knowledge Format (OKF), a vendor-neutral directory of markdown + YAML whose only required field is type [3]. The filesystem is the state; no database required.

Which do you pick? It depends on how much control over the data model you want and your technical depth. Personally, I use both: Scrabble for everyday research across my second brain, and I plan to use Tree when I need precision at scale. Such as mining high-signal knowledge from 10,000+ documents.

Still, when it comes to implementing a unified memory layer on top of a knowledge graph, I keep getting asked one question: is a single database (MongoDB) enough? Why not throw in a specialized graph database like Neo4j or a vector database like Pinecone?

Is MongoDB Enough?

The simplest system that works wins. Instead of 3 databases (a SQL/NoSQL store, a vector DB, and a graph DB), you want one store that does all three. The payoff is concrete: far less operational overhead (one production DB, not three), less deep expertise required (being a power user of one engine beats being mediocre at three), and the ability to join documents, vectors, and graph in a single query. That last point matters most for memory: a single-store join is faster, cheaper, and lower-latency.

MongoDB is a good example, because it’s schema-less, supports all indexing operations, and has a huge ecosystem around it, with both an open-source version and their self-managed Atlas version. Another option I’ve tried that works well, but with more developer friction, is Postgres.

Also, as the cherry on top, you get lineage for free. Which is essential for adding references, understanding where each node came from, or simply preparing for an audit. You keep references on each knowledge-graph node instead of copying all data onto it: the document(s) a node was extracted from, the user who owns it, other metadata. That gives you easy lineage and versioning, and lets your memory connect to the rest of your system with no cross-database sync tax.

Now what about scale? For personal assistants, where you have only 100-10,000 documents, a single MongoDB cluster is more than enough. But even for medium to big enterprises, you can easily scale to millions of documents by adding more clusters (aka horizontal scaling).

After a conversation with a principal MongoDB engineer I understood that the real bottleneck is RAM. RAM is the most expensive component from your database cluster and you want to keep it as low as possible.

So index only what you query. Your memory holds 2 snapshots of the same knowledge: an append-only log (the immutable record of everything ingested) and a materialized view (the queryable graph rebuilt from that log). Vector indexes are inverted indexes ≥ the data they cover, so indexing both blows ~10 GB of data up toward ~40 GB of RAM.

Keep the log on disk (no vector index) and index only the materialized view, and it collapses back toward ~10 GB with the same data, at a quarter of the RAM.

Vector-indexing both snapshots inflates ~10 GB of data toward ~40 GB of RAM; index only the materialized view and it collapses back toward ~10 GB.

When does it make sense to use a specialized graph DB like Neo4j? The reality is that for most use cases, you will do only 2-3 hops traversals. Such as getting all the preferences of a user, and the documents/conversations they were extracted from. In these cases, a single database that does it all is amazing. But there are cases when you should reach for a specialized graph DB like Neo4j. For example, when you need to do 3+ hop traversals, your whole business logic relies on graphs or simply as an internal visualization tool. You could sync your MongoDB production database to Neo4j, just for exploration reasons, as their Cypher engine + visualization ecosystem is stronger.

Using Your Context Layer With Any Agent

The wiki version is the simplest to switch between agents. If your context layer is shipped as an LLM wiki, switching harnesses is just handing the new harness a path. For this article, I literally pointed the harness at the research wiki folder from my Second Brain. Without any fancy skills or plugins in place.

Or a step further, is to point it at my entire second brain — that’s Scrabble — whose AGENTS.md explains how to navigate it and which CLI tools and skills to use. It works out of the box because it’s just files.

The MCP version is a bit more work, but not much. Serving memory over an MCP server means re-pointing one config entry from one harness to the next — switch from Claude Code to Codex by configuring the path to your MCP server, and your whole memory moves with almost zero friction, regardless of harness or model.

Each MCP server ships its own tools (read tools and write tools), and the tool descriptions are the contract the agent reads. It gets especially interesting on the write side, where the tools decide when to persist (e.g. parsing a whole conversation into memory when it ends).

The MCP server serves the same memory, skills, and UI to any harness and any interface.

Skills are where the leverage compounds. On top of this memory, you can write deep-research, writing, or coding-agent skills that remember your preferences. A few of mine, all part of Scrabble:

/research, which runs deep research on a topic and produces a localized wiki (I have an open-source version in my latest ai-research-os-workshop talk made for the AI Engineer World’s Fair SF - full video);
/article-guideline, which expands a brain dump into an article plan anchored in a wiki;
/squid:plan, part of my Squid software factory, which plans a feature from my spec, codebase, and memory — injecting my personal takes on how I write software.

What’s Next

Agent memory is an unsolved, fast-moving topic. These designs work, but they’re far from perfect, and your data is still segregated across Obsidian, Notion, local files, and messages in ways that are hard to unify. In practice, I lean on the lighter wiki-based approach. One per project does the job. I reserve Tree’s heavy knowledge-graph memory for the few high-precision cases that earn its cost.

But here is what I’m wondering:

Your data can move between harnesses, but can your skills? I have this issue with Claude Code’s workflows and agents embedded into my skills.

Click the button below and tell me. I read every response.

Leave a comment

Enjoyed the article? The most sincere compliment is to restack this for your readers.

Special thanks to MongoDB for sponsoring this article and keeping it free!

Whenever you’re ready, here is how I can help you

35 lessons. Pure foundations from scratch. 4 mini-projects. 2 production systems. A certificate and direct access to me & industry experts in our Discord.

Built for software and data professionals transitioning into AI engineering. Rated 5/5 with 300+ students. The first 7 lessons are free:

Start here

Not ready to commit? Start with our free Agent AI Engineering Guide, a 6-day email course on the mistakes that silently break AI agents in production.

Explore Next

LangChain. (n.d.). Open Models, Open Runtime, Open Harness — Building Your Own AI Agent With LangChain and Nvidia. YouTube. https://youtube.com/watch?v=BEYEWw1Mkmw
Soria Parra, D. (n.d.). The Future of MCP. YouTube. https://www.youtube.com/watch?v=v3Fr2JR47KA
McVeety, S., & Hormati, A. (n.d.). Introducing the Open Knowledge Format. Google Cloud. https://cloud.google.com/blog/products/data-analytics/how-the-open-knowledge-format-can-improve-data-sharing
getzep. (n.d.). Graphiti. GitHub. https://github.com/getzep/graphiti
Neo4j Labs. (n.d.). agent-memory. GitHub. https://github.com/neo4j-labs/agent-memory
mem0ai. (n.d.). mem0. GitHub. https://github.com/mem0ai/mem0
topoteretes. (n.d.). cognee. GitHub. https://github.com/topoteretes/cognee
getzep. (n.d.). Zep. https://www.getzep.com/
HydraDB. (n.d.). HydraDB. https://hydradb.com/
Iusztin, P. (n.d.). ai-research-os-workshop. GitHub. https://github.com/iusztinpaul/ai-research-os-workshop
Iusztin, P. (n.d.). squid. GitHub. https://github.com/iusztinpaul/squid

Images

If not otherwise stated, all images are created by the author.

How Evaluation-Driven Development (EDD) Works

Paul Iusztin — Tue, 23 Jun 2026 08:57:02 GMT

The scariest AI failures are the silent ones.

You ship a change to your agent on a branch — a new feature, a prompt fix, a quick refactor. No errors. No complaints. Everything looks fine. But does it still work, or did you quietly break something that worked yesterday?

As Alejandro Aboy puts it: “the fact that they’re not complaining doesn’t mean there’s no issue going on.” A quiet user is not a happy user. Usually, it’s the opposite.

The more Alejandro and I talked about AI evals and EDD, the more his struggles sounded like mine. A story from builders to builder.

1. You can break what already worked. Change a prompt, refactor a tool, and an old feature quietly regresses. Alejandro lived it: cleaning noisy instructions out of his agent’s system prompt made it start fabricating IDs it used to get right. You only catch that by running the same tests before and after the change, and comparing.

2. The feature is brand new, so you have nothing to test it on. No dataset, no historical traces, no ground truth. Yet you still need to know whether it works, and how well. So how do you generate realistic test data fast, then feed it to evaluators that turn it into hard performance numbers?

This is the case study that gives you a plan of attack for both: Evaluation-Driven Development (EDD)

How to prove a new feature works, and didn’t regress, before you merge. It comes from a recent conversation with , a senior data and AI engineer at Workpath who owns the entire data stack, built the Workpath AI Companion, and writes The Pipe and The Line Substack.

And we won’t keep it abstract. Every example comes from one real product: Workpath, a strategy-execution SaaS that keeps large companies’ OKRs and initiatives aligned. (OKRs — Objectives and Key Results — are the goal-setting framework teams use to name what they want to achieve and the measurable results that prove they’re getting there.) Its AI-native feature is the Workpath AI Companion: an agent that scans a company’s strategy and OKR data end-to-end to keep enterprise teams aligned. It’s the exact system Alejandro runs EDD on every day.

So when we say developing a new feature, picture shipping a change to that Companion and proving, before you merge, that it works and didn’t regress.

The moving parts

The Develop-a-Feature Workflow

Imagine. You start working on a new feature, branch, and develop the change. But before you merge, you have to answer 2 questions:

What’s the performance of my new feature?
Did my change introduce any regressions into the existing codebase?

Only when both look good you accept the pull request. EDD helps you answer those 2 questions.

EDD is the offline validation gate between developing a change and merging it

Every feature is hypothesis-first. As Alejandro frames it, “I have a hypothesis... and everything should lie around that.” Every change starts as a stated hypothesis on a branch.

Based on that hypothesis, EDD runs a simulation and scores the results to answer the 2 questions.

Every feature ends in a PR, backed by an experiment, with clear traces and metrics. Framing the eval results as an experiment allows you to compare current results to previous ones, detecting regressions or tracking improvements.

This is how you can compare two experiments in Opik:

Comparing two experiments in Opik

What about the process that happens between starting a new feature and its experiment?

From an architectural perspective, we have:

The AI application, which can be an AI agent, workflow or a simple chatbot. In Alejandro’s case, it’s an AI agent built with Agno. More precisely, it’s the Workpath AI Companion he is building. But due to data privacy reasons, during the demo, he could share only a mock of the data.
A headless evaluation harness, powered by Claude Code.
An AI observability and evaluation platform responsible for capturing traces, managing eval datasets and evaluators, running experiments and comparing results. Alejandro is using Opik. The tool is open-source, but for ease of use, you can also try out their managed platform for free here for 25k spans/month.

Now... how do we generate data for these experiments? How do we get the traces? How do we populate the evaluation harness with the right context? We will see how all of that falls into place, where everything starts with two modes.

Two Modes: Manual Quick Check vs. Automated Experiments

The two modes are modeled by the /edd skill, which has two inputs: Mode and Aggression.

Mode 1 is a quick, manual check. You fire around 30 fresh traces, let Claude Code read them back from Opik one by one, and trigger a judge by hand only if you want a score. As Alejandro describes it, “it won’t trigger automatic evaluations; you trigger them manually.” No dataset, no experiment, ephemeral, minutes. His favorite for a small change: his Substack Author Agent kept over-asking for the publication URL on every new trace. A tiny, targeted fix, exactly what Mode 1 is for.

Mode 2 automates the judgment. When you touch a lot or ship new functionality, you turn the traces into a dataset and run an experiment, both Opik objects, where the judges score every item automatically and produce an experiment you can compare across runs. This is the only way to catch a subtle regression, because you compare 2 experiments.

Both modes start from a hypothesis on a branch, emit fresh simulated traces, and can use the same evaluators. The mode only changes whether the evaluation is done by hand or automatically.

The Aggression setting controls how adversarial the simulated traces get, from happy-path up to fully adversarial. As you turn up the knob, simulated traces get more aggressive, finding harder and harder corner cases to break the agent.

The skill’s decision flow: A small change takes the quick Mode 1 path, while new functionality takes the Mode 2 dataset-and-experiment path.

The secret sauce of Alejandro’s EDD approach is in how he uses Claude Code to simulate fresh traces.

Scope the Change and Simulate Its Traces

Remember. Our goal is to simulate relevant traces to test the performance of our feature. To do that, we use Claude Code to read the agent’s source code, especially the code around the new feature. After, we retrieve old traces (stored in Opik) that are relevant to our current code.

Based on these two signals, we generate ~30 traces targeting the new feature’s functionality.

The evals can only see what the trace carries. So the trace has to carry the whole harness, not just the answer.

The traces need to be high signal and as diverse as possible. The goal is to find holes within our system and fix them, not to validate what currently works.

To achieve that, the traces are generated based on 2 dimensions:

Regression evals (what worked still works) vs. capability evals (can do it on new things
Happy path (easy: testing the core logic) vs. adversarial (hard: finding edge cases, such as missing data, faulty tool descriptions or guardrails)

During generation, we can configure these parameters. For example, if we go full adversarial, the probability of finding errors increases. Which isn't necessarily a good thing, as you don’t want to overoptimize in advance either. You want to make the system as good as possible on the hot path. You don’t want to waste time on scenarios that might never happen. That’s why anchoring your trace generation to existing traces is an essential step for properly understanding the user’s behavior and which components to target when generating the traces.

⚠️ Important! Even if we simulate the data, we still want REAL traces and outputs to evaluate on.

This is what we have to do. The pipeline starts from the data, not from invented inputs. Claude Code analyzes the current traces to learn what inputs are worth generating, so we simulate only the inputs, NOT the outputs or internal state.

That’s the whole point. Synthesize the outputs too and you hit Alejandro’s problem: “every time I try synthetic datasets, I was losing everything the agent was doing beyond the response.” Grade the final answer alone and a wrong tool call stays invisible.

To get there, we send each simulated input to a headless copy of the agent, which runs for real: selecting tools, calling the staging backend, handling whatever comes back. As it runs, Agno records the full tool-call history and outputs into its OpenTelemetry trace, and the agent emits it to Opik.

Using this strategy, we simulate the inputs, run the agent, and record the trace with real values produced by the agent.

A single simulated trace in Opik

A simulated trace is only as trustworthy as the state the agent was in when it ran, and recreating that state is the hardest part.

Context Population: Mocking Production State

The hardest part of agentic evals is getting the agent into the right state, so it passes or fails for reasons that actually relate to your hypothesis. A trace generated from the wrong state is a useless trace.

In Alejandro’s use case, roughly 90% of the agent’s tools are API calls, so Claude Code gets a token and hits the real internal backend through a staging mocked account that already holds data. For the happy path, it pulls real goals, OKRs, and teams. To go adversarial, it forces errors and asks for data that doesn’t exist.

The reusable trick is where the context gets injected. Before the agent boots, Claude Code calls the API and injects the user’s context into dedicated system-prompt sections. So the agent greets you with “Hi Paul, want to check goals from the coding AI team?” It runs “as if for real.”

Alejandro is explicit that this is not pytest-style fixtures, but “the prompt is the only thing the LLM sees.” A faithful prompt-level state is a faithful enough production proxy. You mock at the system-prompt layer and stop worrying about reproducing the whole backend.

So instead of using the standard way of using fixtures to populate the backend, you can bypass everything and directly inject the context into the system prompt. From the LLM’s perspective, it’s the same thing.

Mock the state at the system-prompt level, hit a real staging backend, and the agent runs as if in production.

The last step is to transform the traces into an evals dataset.

On-Demand Datasets

You want two types of eval datasets:

A persistent, hand-built evaluation set that tests the core business logic. Useful for catching regressions.
An on-demand, synthetic dataset used to evaluate the feature you are working on.

We are interested here in the second option.

Via the /edd skill Claude Code assembles an Opik dataset, on the fly, from the branch-tagged simulated traces. The dataset is tagged so a later run can filter straight to it, then kept or thrown away.

Before committing to a big sample, Alejandro fires a couple of runs as smoke checks, “to catch anything awful” before spending tokens. Then he checks that the dataset’s coverage is good enough to be worth running an experiment against. Small, cheap, and it saves the expensive mistake.

Because the dataset is cheap to regenerate and scoped to one change, it’s disposable. Optionally, you might promote a couple of high signal traces into the persistent regression set.

A dataset is only useful once you’ve decided what metrics to use — aka the judges.

Define the Judge

You want to support 2 evaluator types.

Code metrics score the structural things deterministically, server-side, free, no LLM, like whether it called the tool or whether the format is right. Always try to evaluate a given metric via code metrics if possible.

LLM judges score the subjective things, like completeness, accuracy, and ranking quality.

Both evaluators are designed as binary classifiers: verified, or not. The urge to introduce 1-5 likert scales is huge. But the thing is that it’s incredibly difficult to get it right. What’s the difference between 2 and 3 or 4 and 5? Even when using multiple human annotators, the labels are inconsistent. With binary labels, the decision is clear: it’s correct or not. Which makes it incredibly easy for the LLM to get it right.

To get some nuance, on top of the binary labels, you want to add a critique that explains in 2-3 sentences why the output is correct or not.

The judge model is deliberately a different model than the agent, so the two don’t share blind spots and the judge can’t rubber-stamp its own family’s mistakes.

But here is the trick! The evaluators are static, carefully defined and calibrated up front from the codebase. The loop regenerates traces and datasets, not the metrics. When implementing LLM judges, it’s extremely important to align them with the domain expert. Once they are working well, you can use them for inference, which we are doing here on the dynamic datasets.

Opik’s Insights view turns each judge into a per-dimension score profile for the run.

The judges are configured within Opik, using their API to call the model to score each sample from the dataset.

In the image below, you can see the evaluators Alejandro configured for each experiment:

Content Completeness
Metric Accuracy
Ranking Quality
Relative Grounding
Response Directness
Semantic Search Accuracy
Skill Selection

The evaluator suite Alejandro runs on every experiment. One judge per dimension, scoring the whole dataset.

With the dataset built and the judges defined, you run the experiment. And you run it twice.

Run and Compare Experiments

The experiment runs in Opik, scoring the dataset against all the judges to produce a score distribution.

In a feature Alejandro was working on, he cleaned all the noisy instructions out of his agent’s system prompt, “hygiene before” vs. “hygiene after,” and ran 2 experiments over the same scope in Opik’s comparison view.

Run the same scope twice and the regression you introduced shows up as one short bar.

The after had regressed on one judge: tool-call parameter inference. The agent should remember which ID to pass to a tool, but the cleanup made it “get lost and fabricate IDs.” EDD caught his own change before it shipped.

Comparison matters because failure hides where a single trace can’t show it. Trace-level evals are usually fine, but problems surface across 5, 10, or 20-message conversations. Around the 10th message, the model slides into a “context rot zone”: a request that earlier earned a cooperative “let’s work with this” now gets “what do you mean by that?”

A user asks the agent to “scan 50 teams, get me all the OKRs.” It pushes back and offers to go progressively, returning 17 copy-pasteable tables. But by trace 21 of a 20-message conversation, you’re at 200k total tokens, paying heavily without caching.

The same scope, before vs after, overlaid in Opik. One regressed dimension can’t hide across the judges.

These are the kinds of errors proper evals protect you from! Not only performance, but also latency and cost issues that can blow up your infrastructure overnight.

Don’t Run Online Evals

Everything so far ran offline, on a branch, before the merge. The expensive default everyone reaches for instead is always-on online evals. That’s the trap.

Alejandro made the same mistake.

He thought running online evaluations on all the traces was essential for production. Until the bill! It was on credits, not cash, but it would have been around $2k a month just from triggering a few evaluations: “the bill just pops in — in one second you have thousands of dollars in debt.”

So you recalibrate. What’s the actual risk of evaluating whether the agent leaked an ID to the user? Low. So you sample or look for a pattern, run the heavy judges offline, and cap spend first: “consume an amount you know you can afford.”

The good news? The whole /edd skill and headless harness implemented by Alejandro via Opik is now an installable open-source Claude Code plugin. You can also create a free account on Opik with 25k spans/month to try out this EDD strategy on your own project.

The offline eval dataset

🎥 Watch the full conversation between Alejandro Aboy and me

Final Thoughts

You’re already driving one AI process with another. Would you hand the whole thing over to an agent that reads the traces, understands the agent’s mistakes, gets the signal from the evaluators and writes the code changes itself?
— Paul

Alejandro would want exactly that, on one condition. He’s tried prompt-only optimizers and doesn’t trust them, because they change the prompt but never test the agent’s full harness. Until then, the human stays in the loop, and every change earns its pull request.

Opik has been shipping exactly that: Test Suites, Agent Configuration and Playground. Where does your hand-rolled loop go from here?
— Paul

Because they tackle the prompt-only-optimization gap, Alejandro is bullish on adopting them — he expects they’ll close the whole agentic loop: analyze the code & failures → generate inputs → call the agent → build a dataset → evaluate each sample → fix the code → repeat, all in one place.

Try out Alejandro’s EDD code on GitHub, while leveraging Opik’s free tier as the observability platform.

But here is what I’m wondering:

Where do you draw the line between online and offline evals? What do you actually run always-on in production, and how do you cap the spending?

Click the button below and tell me. I read every response.

Leave a comment

Enjoyed the article? The most sincere compliment is to restack this for your readers.

Whenever you’re ready, here is how I can help you

35 lessons. Pure foundations from scratch. 4 mini-projects. 2 production systems. A certificate and direct access to me & industry experts in our Discord.

Built for software and data professionals transitioning into AI engineering. Rated 5/5 with 300+ students. The first 7 lessons are free:

Start here

Not ready to commit? Start with our free Agent AI Engineering Guide, a 6-day email course on the mistakes that silently break AI agents in production.

Thanks again to Opik for sponsoring this case study and keeping it free!

Try Opik for free here (25k spans/month free)

If you want to monitor, evaluate and optimize your AI workflows and agents:

Try Opik for free

Images

If not otherwise stated, all images are created by the author.

Build, Configure, or Use As-Is: The Agentic Harness

Paul Iusztin — Tue, 09 Jun 2026 05:00:28 GMT

and I were planning our upcoming book when we kept hitting the same realization. Just as LLMs got commoditized, the harness around them is commoditizing too, hardening into a handful of standardized, batteries-included frameworks. Once it’s a commodity, the hard question flips. It stops being “how do I get an agent running” and becomes “for each piece, do I build, configure, or just use it?” And that line is blurry.

Overbuild, and you burn weeks reimplementing a tool loop, a permission system, and a sandbox that is already available for free. Under-build, and you lean on the defaults forever, never building the one layer that’s actually yours, your context layer, your moat, so you stay a renter of someone else’s system.

The harnesses fall along a spectrum, from tool-like ones you customize as a user (Claude Code, Codex, OpenCode) through framework-type ones you build with, like Pydantic AI in Python or pi in TypeScript.

This article hands you that system design: the ~80% blueprint that’s conceptually the same across Claude Code, OpenCode, Codex, and pi, walked component by component, with one conclusion per piece.

We start with the big-picture architecture, the shape almost every harness shares, then walk it component by component: the tools the model calls, the catalog of agents, how subagents spawn and stay contained, how skills load cheaply, where memory really lives, how sandboxes both protect and scale you, and the permission layer with almost no AI in it. Each one closes on a verdict that fills in the map.

Build the Layer That’s Actually Yours (Product)

This article shows the system design of a harness, the commoditized part. The real value is the business layer you build on top of it. That’s what my Agent AI Engineering course teaches, built with Towards AI: only what you need to deliver value, not how to rebuild the harness.

35 lessons. 3 end-to-end portfolio projects. A certificate. And a Discord community with direct access to industry experts and me.

Built for software, data engineers or scientists transitioning into AI engineering.

Rated 5/5 by 300+ students. The first 7 lessons are free:

Start here

The 80% Every Harness Shares

Roughly 80% of a harness’s blueprint is identical regardless of the tool or framework, because every harness solves the same problems with different techniques. That shared 80% is exactly the part you mostly use as-is. The decisions about what to configure or build live in the remaining slice. At the highest level: a user message goes in, an answer comes out, and everything between is the harness.

In between, here are the 5 core layers:

The Agent is the innermost piece: the agentic loop (the ReAct loop, where the model reasons then acts) wrapping an LLM plus its tools. The LLM can be closed (Gemini, Anthropic, OpenAI) or open-source served over the OpenAI protocol on Modal, RunPod, GCP, or TogetherAI, or run locally with Ollama or llama.cpp. This loop is the core: strip away compaction, task budgets, and thinking and it’s roughly 150 lines.

The Harness is everything wrapped around the agent: a message queue with a priority gate, the sandbox, hooks, services (LLM gateway, memory, LSP servers, MCP client), skills, the permission system, an agents catalog, and subagents, with context engineering sprinkled everywhere.

The Runtime is the durable execution layer the whole harness runs inside: Prefect, Temporal, Kitaru. It gives you non-blocking human-in-the-loop, scheduling, durability and caching, and a credentials proxy.

The Presentation layer is how you talk to the harness, whether TUI, web, mobile, WhatsApp, or Telegram. The interesting question is how the same agent serves many front-ends, and there are 2 real patterns. One is a pub/sub bus, OpenCode style, where a headless server streams events to TUI, web, and desktop clients over HTTP+SSE so many clients observe 1 live session. The other is custom services bridging into 1 in-process loop, the Claude Code style.

The Observability layer is tracing, logging, metrics and evals sitting across everything, with tools like Opik, Langfuse, or Braintrust.

The layered anatomy of an agentic harness: The agentic loop at the core, wrapped by harness services, all running inside a durable runtime, with presentation and observability spanning the stack.

To see how the components connect, let’s follow a single message down the happy path: user → TUI → message queue → wait for the agent to be free → agent → LLM → tool → LLM → tool → … → LLM → answer → TUI → user.

Meanwhile, The TUI sends and receives over SSE. The priority gate decides when to inject a new message between loops rather than interrupting mid-loop. When the context window nears its limit (tokens ≥ contextWindow − reserve), compaction runs, keeping the window as [system prompt] + [summary] + [recent tail]. And a single tool call can fan out into hooks, sandboxes, services, and permission gates.

A user message buffered by the priority gate, run through the agentic loop’s stream→check→tool→append→recurse cycle, then streamed back to the TUI as the answer.

This skeleton, the loop, the queue, the runtime wiring, the message journey, is the commoditized 80%. As there is a ton going on on top of the basic agentic loop, let’s explore all the core components of the harness to get an intuition on what can be customized or built on top of it.

The Tools

The set of tools the LLM can call inside the agentic loop is the most visible part of a harness. Everything the model can invoke conforms to a single shape, a name, an input schema, and an execute method behind a flat registry.

Ground that in what a real harness ships. Claude Code organizes ~40 built-ins into 10 families, and these are what you get for free:

File I/O: FileRead, FileWrite, FileEdit, Glob, Grep. The model’s hands on your files. Read one, write a new one, edit in place, find files by name pattern with Glob, and search their contents with Grep.
Execution: Bash. A single tool to run shell commands. The most powerful tool available that allows an agent to run shell, Python, TypeScript or in general interact with your machine.
Orchestration: EnterPlanMode / ExitPlanMode, Sleep, Agent (spawn a subagent), EnterWorktree / ExitWorktree. These shape the work itself. Plan mode gates edits behind a read-only planning pass, Sleep pauses the loop, Agent spawns a subagent, and the worktree pair carves out an isolated branch to edit without touching the main tree.
Tasks: TaskCreate / Update / Get / List / Output / Stop. A task state machine that lets the agent track a to-do list and run long jobs that outlive a single turn.
Web: WebSearch, WebFetch. The window outward. Search the web, then pull a specific URL’s contents into context.
MCP: an MCP tool factory + ToolSearch, ListMcpResources, ReadMcpResource, MCPAuth. This is how external tool servers plug in. The factory mints 1 tool per each tool from the MCP server to flatten out the tool discovery logic into a single tool set (e.g., /mcp__brown__edit_content_prompt). ToolSearch surfaces the right one when hundreds are attached, and the rest list resources, read them, and handle auth.
Scheduling and misc: ScheduleCron, RemoteTrigger, Skill (a dispatcher, 1 tool with N skills by argument), LSP, AskUser. The odds and ends. Schedule a run on a cron, trigger one remotely, dispatch a skill by argument, query a language server with LSP, and hand a question back to the human with AskUser.

The built-in tool families are the commoditized surface. The customizer’s job is to configure which tools each agent may call. The architect’s is to build new domain tools as MCP servers plugged into the same registry. That’s where your product’s actual capabilities live.

Tools are what the model can do. The agents’ catalog is who does it.

The Agent Catalog Is Just a Config File

Each harness ships a set of predefined agents, and the version worth copying defines them as config, not code, because that makes them discoverable and pluggable without touching the loop. The format varies: a markdown file with YAML frontmatter, where the body is the prompt, in pi and Claude Code, or plain YAML or JSON in OpenCode. I reach for YAML, but the point is the same. The core fields are small: name, mode, model, tools, disallowedTools, and permission.

When you open Claude Code, you chat directly with the primary agent, the primary process. The trickiest part to understand is that both the primary and subagents can wear multiple hats. The agent catalog distcribes these hats, these modes the agent can take on.

A catalog worth copying looks like this, synthesized across the references (the mode: primary | subagent | all axis is OpenCode’s, while Plan, Explore, and General Purpose ship in Claude Code):

Build (mode: primary), the default agent.
Plan (mode: primary), read-only.
General Purpose (mode: subagent), the fallback when no specific agent fits.
Explore (mode: subagent), read-only search and locate, running on a cheap model.
Code Reviewer (mode: subagent), read-only and git/diff-aware.

The code-reviewer subagent (in Claude Code) tools allowlist grants FileRead, Grep, Glob, and Bash(git *), while its disallowedTools denylist blocks FileEdit, FileWrite, and Bash(rm *). So it can read the tree and run git, but it can never edit a file or shell out to delete one. That dual allowlist/denylist, with rule syntax like Bash(git *). The safety trick is that the scope is narrowing only. OpenCode enforces it by deriving a child’s permissions from its parent’s, so a delegated agent can never out-permission the one that spawned it.

Use the bundled agents as-is for everyday work, then author your own as YAML or markdown files. You rarely need to build a custom agent. That usually happens when you build your custom application. For example, I did it only for my deep research and writing skills, which required a ton of customization. Ultimately, ending up as completely apps covered as a skill.

To get the full picture, let’s understand how subagents work.

A Subagent Is a New Loop

A main orchestrator spawns a subagent through the Agent tool; the subagent runs its own loop, and only a compressed summary of its output is re-injected into the parent.

Most harnesses support subagents, though some, like pi, do it via plugins instead of natively. The hard part is keeping orchestrator and child communicating without the child’s full context polluting the parent loop and ensuring the orchestrator, “orchestrates” the subagents as expected. Remember that the orchestrator is an agent, not a workflow encoded in code, which means it can easily go off track and forget a step.

In Claude Code a subagent is not new code. It’s the same loop re-entered with a cloned context and a restricted tool list, and only a condensed summary flows back. A periodic ~30-second summarizer fork produces a live progress label plus a bounded final summary. The lesson to steal: a subagent is your existing loop, narrowed, with a summary on the return path. Only recently they started introducing subagents as new processes.

Harnesses almost never support swarm architectures where every agent talks to every other. They support a master–slave orchestrator topology where one main agent tracks the children.

Parent and subagent talk over a channel that sits outside the isolation boundary.

Spawning is half the problem. The parent and child still have to talk, and there are three channels, ordered by how far apart the two run.

Cheapest is in-process: the child is a nested call, so its output is just a return value handed back to the caller. A queue sits one step out. The parent drops work on a message queue, the child consumes it, and the parent awaits a result event. Because the queue is a shared bus, other clients can watch the same exchange live, the way OpenCode streams a subagent’s events to many observers. Most decoupled are shared JSON files: a lock-serialized mailbox, one file per recipient, that agents in separate processes or worktrees write and poll. pi’s one-way subprocess, streaming JSON lines back to the parent, is the same idea narrowed to a pipe.

For most builders this is use-as-is, lightly configured. The spawn mechanism and orchestrator topology come standard, and all you configure is each subagent’s tool and permission scope and which agent it spawns. You only build when you need exotic isolation, like pi’s out-of-process model for untrusted children. That’s a rare need.

Now let’s see how skills fit into the picture.

Skills

A skill is one of the simplest implementations to understand yet one of the highest-impact things in the whole harness. It’s essentially a markdown recipe, instructions plus an allowed-tool set, that the model pulls in on demand.

Skills from three sources (bundled, user-defined, MCP prompts) are merged, capped at ~1% of the context window, assembled into a skills context, and injected as a system reminder.

Concretely, skills come from 3 sources merged together: bundled skills shipped with the harness (e.g. src/skills/bundled), defined skills dropped into .agents/skills, and MCP server prompts. The pipeline is short. A GetSkills step collects all 3 sources, caps the total at ~1% of the context window, assembles a single skills context, and wraps it as a .

That 1% cap is the whole trick, and it works because of progressive disclosure. Skills are surfaced by name and description only, so the agent sees a cheap menu of capabilities and reads a skill’s body on demand, which is why the always-loaded skills context can be hard-capped at ~1% and still scale to dozens of skills. pi takes the same spirit further, surfacing its skills via prompt injection rather than as tools.

This is pure configure, or really authoring, and it’s the single best return on effort for the user and customizer tiers. Writing a markdown skill is the cheapest way to teach the harness a new workflow, and the 1% cap means you can pile on dozens.

Skills, tools, and subagents all hang off the loop, and they’re mostly things you configure. Memory is different. It’s the one component where you actually build your own layer.

Memory Is the Layer You Actually Build

In most harnesses, out-of-the-box memory is loaded directly into context, not via a tool. The model never calls a tool to “remember.” Relevant memories are read off disk and prepended to the system prompt before the turn runs, and new memories are extracted after the turn by a separate process.

A file-backed design, Claude Code-style, is worth grounding concretely. The store splits into 2 kinds of files. User-defined .md files come first: AGENTS.md is always loaded, and **/AGENTS.md is loaded dynamically per directory, on demand. LLM-extracted .md files come second: MEMORY.md is an always-loaded index, hard-capped at ~200 lines / 25 KB, while logs/YYYY-MM-DD.md is an append-only daily log where only the relevant logs are loaded. A small-model side-query ranks topic files from the log by their frontmatter description, not embeddings, and picks the top few to inject, which is debuggable and needs no vector store.

By default a forked extractor updates MEMORY.md live after each turn. A daily-log variant runs a nightly /dream distillation instead: a small LLM extracts the conversation into logs/YYYY-MM-DD.md, then a second distills those logs into MEMORY.md. In other words the pipeline looks like this: raw conversation → daily logs → durable memory.

Three out-of-the-box memory designs: file-backed, SQLite-backed, and an append-only session tree. Plus the custom MCP-server memory layer that sits above all of them.

Most harnesses use a file-based system for memory. Which is good enough for uses cases such as coding. Other tools, like Cursor or OpenClaw, build a vector index over your memory instead. That’s why many people report better memory from OpenClaw. As instead of parsing your whole memory as append only logs or forgetting context when building the MEMORY.md index, OpenClaw builds a vector index over your memory.

Here’s the heart of the build/configure/use thread, though. The defaults get you started and AGENTS.md is worth configuring, but the highest-leverage move is a custom memory layer behind an MCP server, a database exposed through an MCP server with your own read/write logic. Because it’s harness-independent, you jump from Claude Code to Cursor to anything and the agent instantly picks up who you are.

Real independence means owning your own context layer.

This is the one place to build, and it pays off for every tier that’s serious. The context layer behind an MCP server is the moat. It’s harness-portable, fully yours, and the thing that makes the assistant your assistant.

Owning your context is about what the agent knows. The next layer is about where its code runs, sandboxing, which protects you and, surprisingly, lets 1 harness scale to many jobs.

The Sandbox: One Jail, Many Remote Workers

The obvious reason for a sandbox comes first: it keeps the agent in a controlled environment with no direct access to your machine. Establish the key separation early.

When the model issues a Bash command, the harness decides where it runs: remotely on Modal, locally in a sandbox (Docker/Firecracker), or directly on the host

Sandboxing lives at the Bash and PowerShell tool layer, not the UI. When the model issues a Bash tool call, a decision runs about where it executes. If remote, the command runs in a sandbox such as Modal. If local, the harness asks whether to use a sandbox at all: yes means it runs inside a local sandbox (Docker, Firecracker, …). No means it runs directly on your machine.

The enforcement detail worth stealing, the way Claude Code does it, is that the jail is derived from the same permission rules the agent already uses, and it always denies writes to its own settings file.

On top of security sandboxes can change how we define software architecture. Reframe sandboxes as workers from classic distributed systems: each sandbox is a worker that runs jobs in parallel, and 1 harness can manage and scale many of them. So the same harness that protects you locally can fan out dozens of remote jobs. Depending on your sandbox type, you can run data ingestion jobs or even training jobs if the VM has a GPU. Everything from your harness. Codex is a harness that is all in on remote sandboxing.

Now, let’s wrap up the article with the most important component: the permission layer.

The Permission Layer Has Almost No AI in It

The permission system is the hardest part to reason about, and the strange thing is it has essentially no AI in it, yet it’s what makes the whole system safe to run. Its job is narrow: for every tool call, decide to (a) run it, (b) ask the user, or (c) deny it.

For every tool call, the harness resolves a decision: allow it, ask the user, or deny it

The structure has 2 flavors. Agent modes change default behavior: default, acceptEdits, bypassPermissions, and plan. User-defined rules live in config, in .agents/settings.json and .agents/settings.local.json, where you declare what the agent can and cannot run, including wildcard rules like Bash(git *). The harness combines mode metadata and user rules at runtime to resolve each call.

The “Can use tool?” question has 3 outcomes.

Allow calls the tool. Ask surfaces it to the user, and on allow it calls the tool, while on deny it synthesizes a denial tool-result and continues. Deny synthesizes the denial directly.

When deciding what to do, the harness runs tool filter → user settings → mode.

Here’s the counterintuitive payoff. “Bypass everything” is not total. Plan mode is enforced prompt-side, via a system reminder telling the model to only edit the plan file.

Which shows how fragile these mechanisms still are, as we just hope for the best that the model will pick up the instruction.

You almost never build this. But it’s incredebly important to properly configure it. It’s probably the most important part to configure right to ensure it has just enough access to your data and machine.

What’s Next

These are just the core components that almost any agentic harness needs and has.

But there is more to it.

Worktrees for parallel isolated edits, multiprocessing subagents for true parallelism, and a plugin system for extending the harness without forking it. Which I will address in future articles.

But here is what I’m wondering:

Which component did you decide to build rather than configure? Was owning it worth it, or did you reinvent something the harness already had?

Click the button below and tell me. I read every response.

Leave a comment

Enjoyed the article? The most sincere compliment is to restack this for your readers.

Whenever you’re ready, here is how I can help you

35 lessons. Pure foundations from scratch. 4 mini-projects. 2 production systems. A certificate and direct access to me & industry experts in our Discord.

Built for software and data professionals transitioning into AI engineering. Rated 5/5 with 300+ students. The first 7 lessons are free:

Start here

Not ready to commit? Start with our free Agent AI Engineering Guide, a 6-day email course on the mistakes that silently break AI agents in production.

Images

If not otherwise stated, all images are created by the author.

How to Keep Your AI Agent's Knowledge Graph Clean

Paul Iusztin — Tue, 02 Jun 2026 05:00:53 GMT

Two months ago, I started building unified memory layers on top of knowledge graphs. One question kept coming back from readers. How do you handle entity resolution and deduplication without corrupting the graph?

Rather than guessing, I spent serious time studying how mem0, cognee, and Neo4j actually solve it. The recurring question exposes a confusion almost everyone shares. People treat entity resolution and deduplication as the same step.

That confusion is exactly what corrupts graphs. People collapse naming and identity into 1 fuzzy check.

Also, if the merging step is not properly designed, 2 different real-world entities can silently merge, corrupting your graph.

Resulting in losing the trust in your graph that made it worth building. The graph quietly rots. Nobody trusts it, and the entire memory layer you invested in goes unused.

The failure is invisible until it becomes expensive to undo. The fix is to separate naming from identity.

We will walk through the end-to-end pipeline. This includes LLM extraction, entity resolution for naming, embedding the full node and deduplication for identity. Plus, the 2 safety nets most tutorials skip.

We covered the full memory-system design and the ontology design in prior articles. This piece focuses only on keeping the graph clean. By the end, you will be able to design a graph that stays clean and usable as it grows.

Why We Killed RAG in Production (Product)

This article shows how to keep a graph memory layer clean. In a recent podcast, I covered the decision that comes before it: whether you need retrieval at all.

I explain why we killed RAG for a financial advisor product. All of an advisor’s data summed to 64,000 tokens, so loading the full context beat RAG’s zigzag retrieval loop. The formula I use: your data-to-context-window ratio.

We also get into regretting MCP everywhere, treating vibe-coded output as a compilation step, and why AI evals become the real job once the model writes the code.

Watch the episode

One Pipeline, Five Steps

In goes a document or a conversation turn. Out comes a set of canonical, deduplicated nodes correctly wired into the existing graph. Everything between is about making sure each new node is named and identified right.

First, an LLM extractor reads the text and emits entities and relationships connected by (entity, relationship, entity) triplets. It anchors within the POLE+O, Facts, and Preferences ontology. This ensures it only extracts the entity types you actually care about, as we explained in depth in this article.

For example, a sentence about a person working at a company becomes a (Person)-[:WORKS_AT]->(Organization) triplet. The ontology told the extractor those are the types that matter.

If using only LLMs for extraction becomes too costly, you can use a cost-tiered cascade here, starting with fast statistical models like spaCy for common entities, moving to zero-shot models like GLiNER for domain-specific types, and falling back to an LLM for complex cases.

Before touching the graph, we must decide what this new entity should be called. The system normalizes its name against existing nodes of the same type. This is the finding-the-canonical-name step, and no merges happen yet.

From raw documents to a clean graph node: extraction, resolution, embedding, deduplication, then the merge/flag/add decision.

Next, we compute an embedding over the entity’s full context. This includes its name, type, and attributes. We embed more than just its bare name. This is what later lets deduplication compare identity rather than spelling.

We compare the embedded node against existing nodes. This decides whether it is the same real-world entity as one already in the graph.

Based on the deduplication outcome, the system makes a final routing decision. It either merges into an existing node, flags the pair for human review, or adds a brand-new node.

A new mention of a company gets extracted as a typed entity. The resolution step normalizes it to a canonical name. Then, it gets embedded with its full context so we capture its semantic meaning. It is compared against existing same-type nodes to verify its identity. Finally, it gets added, merged, or flagged for review.

Resolution and deduplication are 2 distinct decisions doing 2 distinct jobs. Let’s zoom in on each one.

Entity Resolution: “What Should We Call This?”

During resolution we find the canonical name for each entity. It answers “what should we call this?”.

It handles typos, acronyms, and surface-form similarity. These are the noisy ways humans and documents write the same thing. It uses exact, fuzzy, and semantic matching in a short-circuit chain.

The short-circuit chain passes the entity to the next matcher only if no confident match is found. If exact match fails, it tries fuzzy match. If fuzzy match fails, it tries semantic match (using light embeddings only on the name).

But it matches only against the names of existing nodes of the same type. You never compare a PERSON name against an ORGANIZATION name.

“NYC” resolves to “New York City”. “JP Morgan” resolves to “JPMorgan Chase”. The 3 forms "John Smith ", "john smith", and "Jon Smith" all collapse to 1 canonical “John Smith”.

This happens because resolution absorbs whitespace, casing, and typo variations. Fuzzy string matching uses token-based comparison to handle word order and partial matching for abbreviations. At this stage the system only updates the node’s canonical_name property. No graph merges happen yet.

Resolution chains exact → fuzzy → semantic matching against same-type names to assign a canonical name (without ever merging nodes).

Often, you also keep track of a list of aliases for each node. Whenever you find a new hit via fuzzy or semantic match that doesn’t match the current canonical_name, you add it to the list of aliases. Like this, in future checks you can speed up matching by checking the alias list first.

Similar names are not strong enough evidence that 2 entities are identical. This is the line most people blur. Blurring it is what causes silent corruption.

Apple the company is not Apple the fruit. They have different types, so type-gating already separates them. A harder example is Jensen Huang the CEO of NVIDIA versus a doctor in Taipei with the same name.

They have the same name and the same type. Yet they are 2 different real-world people. Naming similarity alone would happily fuse them.

Still, canonical names are extremely useful for GROUP BY operations where, during querying and visualizations, we can quickly understand the data. During human review, we can even spot duplicates and resolve them manually.

That is why identity is a separate decision. Resolution has told us what to call the node. It has deliberately not told us whether the node is a duplicate.

That second, riskier question belongs to deduplication.

Deduplication: “Is This the Same Entity?”

Deduplication is the identity layer. It answers the harder question: “is this the same real-world entity?”. It is the step where merges actually happen [5].

In goes 1 embedded node. Out comes a single routing decision: merge into an existing node, flag it for review, or create a new node.

The system embeds the full entity context. It compares it against existing nodes using semantic and fuzzy similarity across that full context. The richer signal is what lets it distinguish 2 same-named, same-type entities that resolution could not.

By the context of a node, we refer to the entity’s attributes such as its text, image, video content or even its metadata properties such as a person’s email or date of birth. Or an object’s model or manufacturer. Still, you don’t want to embed everything, such as identifier, but per each ontology type pick the fields that contain the highest signal.

The combined deduplication score is an explicit weighted blend. It uses the embedding score multiplied by 0.7 and the fuzzy score multiplied by 0.3. Based on a similarity score from 0 to 1, we have 3 bands.

High confidence (≥0.95) triggers an auto-merge. Medium confidence (0.85–0.95) flags the pair for human review. Low confidence (<0.85) creates a new node.

Near-certain identity is allowed to merge automatically. The uncertain middle is escalated. Weak evidence just becomes a fresh node.

False merges silently corrupt the graph. The corruption is invisible until it is expensive. Take the Paris example: 2 LOCATION nodes both named “Paris”.

One is the capital of France, and the other is Paris, Texas. They have the same name, the same type, and very similar bare-name embeddings. But they are 2 different places.

The dangerous part is the middle band, the gray area. This is where the system is not sure and a human has to step in.

Deduplication scores full-context similarity, then routes to auto-merge, human review, or a new node.

When Confidence Lands in the Gray Zone

When a deduplication score lands in the medium band (0.85–0.95), the system deliberately does not merge. It flags the pair for a human to decide, as merging is a dangerous operation we should be really deliberate about.

The source node gets tombstoned, meaning it is kept queryable for forensics but skipped from future matching. Actually undoing a merge means re-ingesting the source data. That reversibility cost is the whole reason for the gray zone.

Whenever a new entity is flagged for human review, a new node is created and a (:Entity)-[:SAME_AS {status:'pending', confidence}]->(:Entity) edge is added inside the graph itself. The human review step transitions that status to confirmed or rejected. The review queue is just a Cypher query over pending SAME_AS edges, ordered by confidence.

For each flagged pair, the reviewer answers 1 question. Is this actually a duplicate, a new node, or neither?

This usually happens to entities that are related but not identical. The Codex model and the Codex CLI are related, but not the same object. The same applies to Jensen Huang the CEO versus a same-named doctor in Taipei.

This is hardest at the start of an entity’s lifecycle. When metadata is scarce, similarity spikes, and you risk polluting 1 node with another’s attributes.

Human review catches the uncertain pairs the live pipeline surfaces. But some duplicates never get surfaced at all. That is the gap the dream pipeline closes.

Cleaning the Graph While It Sleeps

While the system ingests documents, data often flows through in parallel. If 2 entities are processed at the same time, the resolution and deduplication steps never get to compare them against each other.

The system would never check whether Claude Code from Conversation X and Claude Code from Document Y are the same entity, because neither existed in the graph when the other was written.

You run a dream pass every night. It re-runs the deduplication pass on recently ingested nodes only. Otherwise, you will have to loop through all nodes in the graph. Which as the graph grows, becomes increasingly expensive.

It does not run the full resolution chain. Because the embeddings were already computed at ingest time, this is a light operation. It is primarily database reads and writes, not fresh model calls. Since it mostly adds I/O pressure, run it when organic traffic is low, which is usually during the night, hence the name the dream pipeline.

What’s Next

I’ve spent the past 4 months building unified memory layers on top of knowledge graphs, and I learned that keeping them clean is the hardest part. Keeping your knowledge graph clean is the maintenance step that decides whether the graph ever gets used. A graph full of noise, fragments, and false merges will not be trusted or queried.

In case you want to learn more, remember that we also covered the full memory-system architecture via knowledge graphs and the ontology design in prior articles.

But here is what I’m wondering:

What are the core strategies you’ve used to keep your knowledge graph clean and usable? Something close to our approach here, or something completely different?

Click the button below and tell me. I read every response.

Leave a comment

Enjoyed the article? The most sincere compliment is to restack this for your readers.

Whenever you’re ready, here is how I can help you

35 lessons. Pure foundations from scratch. 4 mini-projects. 2 production systems. A certificate and direct access to me & industry experts in our Discord.

Built for software and data professionals transitioning into AI engineering. Rated 5/5 with 300+ students. The first 7 lessons are free:

Start here

Not ready to commit? Start with our free Agent AI Engineering Guide, a 6-day email course on the mistakes that silently break AI agents in production.

References

Iusztin, P. (n.d.). Understanding the Neo4j Graph Agent Memory System. Decoding AI Magazine. https://www.decodingai.com/p/understanding-neo4j-graph-agent-memory-system
Iusztin, P. (n.d.). Ship a Knowledge Graph Ontology in 5 Minutes. Decoding AI Magazine. https://www.decodingai.com/p/ship-a-knowledge-graph-ontology-in-5-minutes
POLE+O Data Model. (n.d.). Neo4j Labs. https://neo4j.com/labs/agent-memory/explanation/poleo-model/
How Entity Extraction Works. (n.d.). Neo4j Labs. https://neo4j.com/labs/agent-memory/explanation/extraction-pipeline/
Entity Resolution and Deduplication. (n.d.). Neo4j Labs. https://neo4j.com/labs/agent-memory/explanation/resolution-deduplication/

Images

If not otherwise stated, all images are created by the author.

Stop Chasing the Perfect Ontology

Paul Iusztin — Tue, 26 May 2026 05:00:51 GMT

For a while now I’ve been trying to build a proper memory layer on top of my research, writing, and content creation. Today it all lives in my Second Brain in Obsidian, where the primitives are files like notes, videos, and articles.

What I actually want is to shift those primitives from files to entities and relationships, such as people, locations, objects, topics, preferences, and facts. I want the memory to get closer to reality so I can watch how things evolve over time. I want a knowledge graph.

Everyone agrees knowledge graphs and GraphRAG provide a more performant substrate for a unified agent memory layer than plain RAG. But kicking one off is far harder. The resistance always collapses to the same wall: how you model your data. Your ontology is the hardest part of the system.

If you can’t define your ontology properly for your domain, the graph won’t represent the reality you want. The right entities and relationships simply aren’t there. As a result, GraphRAG ends up performing worse than the simple RAG you were trying to beat.

This translates straight to a memory layer. There’s no dodging it. Even if you stay file-only (a “virtual knowledge graph,” like an LLM knowledge base over your notes), you still hit the same data-modelling question: which primitives, and which entities, do you even extract?

The instinctive reaction is to design the perfect, complete ontology upfront. That’s exactly the trap that freezes the project.

The strategy is a not-overkill ontology. You need something flexible enough to kick off with almost no friction before you really know your domain, extending it with domain-specific detail as you explore your data.

Concretely, you use a small, fixed, generic, but extendable noun data model, known as POLE+O. Plus two core primitives, Preferences and Facts, for everything that doesn’t fit into the nouns.

You ship something that works, then add subtypes as a lightweight data-exploration step shows you where the generic types clash with your real data.

This approach lets you stand up a knowledge-graph memory layer for your own assistant without burning weeks on schema design. To build this, we first need to understand what an ontology actually is and why targeted models beat exhaustive ones.

Start Your Transition Into AI Engineering (Product)

This article showed how to design the ontology your knowledge-graph memory needs. My Agentic AI Engineering course shows the harness around it. I just released a free preview to build and run a working agent in 5 minutes.

You build a multi-agent system with two MCP servers (Research Agent + Writing Workflow), a deep research algorithm, an evaluator-optimizer loop, observability, and LLM-as-judge evals. Patterns required to ship AI.

Built for software, data engineers or scientists transitioning into AI engineering.

7 free lessons, 2 MCP agents ready for your GitHub portfolio. Part of our 35-lesson course. Rated 5/5 by 300+ students.

Start the free preview →

What Is an Ontology?

An ontology is the formal answer to 1 question. When you read the world, what do you write down as nodes, and what do you draw as edges? It specifies the kinds of things that exist in your domain, their properties, and how they relate to each other.

The ontology’s job is to map a targeted slice of the real world into the digital world. A good ontology is highly targeted to the problem you actually want to solve. If you over-model, you drown in noise and never ship. Plus, it get’s extremely expensive to extract and maintain the knoweldge graph. If you under-target, the graph doesn’t reflect the reality you care about.

An ontology is a deliberately narrow funnel from the real world into a queryable graph.

Look at concrete, shipped ontologies for real-world proof. The create-context-graph domain catalog made by Neo4j publishes 22 ready-made domain ontologies. Every single one lands at exactly 10 to 12 entity types. They use a shared 5-noun base plus only 5 to 7 domain-specific nouns.

For example, the Personal Knowledge domain models the world as Note, Contact, Project, Topic, Bookmark, and JournalEntry. The Agent Memory uses Agent, Conversation, Memory, ToolCall, and Session. The lesson here is that real ontologies are small on purpose. They capture only the entities required to answer the questions the system is designed for.

So if targeted and small is the goal, why does everyone — me included — reach for big and perfect first? That’s the trap.

The Overkill Trap: Why My Knowledge Graphs Never Shipped

When I first encountered the ontology concept, I assumed I had to study my domain in depth. I thought I needed to model all of finance, for example, and design the ideal ontology before working with any real data. You can’t actually do that before you have a system running and data to look at. You just pile up assumptions that mostly turn out wrong.

I got frozen. Every knowledge-graph solution I started stayed on my laptop and never got used, because I was waiting on an ideal ontology I could never reach. Without understanding the ontology, I couldn’t even write a decent extraction step to populate it. I was deadlocked, bringing 0 value.

The breakthrough was realizing I need a couple of models that let me start generic and extend over time. As I get more data, analyze it, and actually understand my problem, the schema evolves. Let’s meet the base model that lets you start in 5 minutes instead of 5 weeks.

The POLE+O Data Model

POLE+O is a tiny, fixed, top-level vocabulary that can classify almost anything you pull out of text. It stands for Person, Object, Location, Event, and Organization [2]. It originated in law-enforcement and intelligence analysis. The Organization type was added for general-purpose entity extraction. The point of a fixed base is queryability. There are always exactly 5 base nouns to filter on, so the graph stays answerable no matter how it grows underneath.

5 fixed base nouns, each extensible with optional subtypes

Person covers people, aliases, and personas. Object covers physical or digital things. Location covers places, addresses, and regions. Event covers meetings, transactions, and incidents. Organization covers companies, teams, and institutions. Two or three of these catch the overwhelming majority of what a personal assistant needs.

Here are POLE+O’s five base types and the default subtypes each one ships with:

Here’s the beauty of this approach. You extend the base nouns with your own subtypes, and that’s how you tailor a generic ontology to your specific domain. It works exactly like object-oriented programming. You start from base classes you adopt without thinking. Then you subclass into specifics as your use case clarifies.

You can kick off with nothing extended and add concrete types only as you understand your data better. Neo4j’s agent-memory library uses precisely this approach. POLE+O is its default, swappable ontology.

The data-exploration workflow runs in a simple loop. First, kick off with generic POLE+O. Second, run an exploration extraction over your real data. Forget production reliability. You only care about understanding what’s there. Third, inspect the graph for clashes where the generic model lies about your data. Fourth, add or rename subtypes to fix each clash. Finally, repeat the process. You won’t get it perfect, and that’s the point. You iterate like any other AI app instead of freezing.

You discover subtypes by watching where generic POLE+O mislabels your real data, then patch the clash and loop.

Look at named examples from real extraction runs. Claude Code comes back tagged as a Person when it’s clearly an Object. The “AI Engineer” conference lands as an Event when you wanted an Organization. DeepSeek is tagged a Person, not an Object.

Portugal and New York both get a flat Location label even though one’s a country and one’s a city. An agentic harness shows up as a generic Object when, for knowledge work, you’d rather have a Topic type. Each clash is a signal to add 1 subtype, not to redesign the whole schema.

POLE+O nouns and their subtypes cover the things in your world. But to fill in the gaps there are two specials tricks we have to go over.

Preferences: The Things a Noun Likes

Preferences are the second family of entities you attach to the graph. They are things a noun likes or dislikes. A Preference is a characteristic of an entity. It represents a stance. The canonical case is a person who likes, prefers, or dislikes something.

Concretely, a Preference entity looks like this:

category groups the preference, preference is the statement itself, and context optionally records when or where it applies. confidence runs from 0 to 1. The embedding makes it semantically searchable.

Make it concrete. “Loves Italian food”, “prefers dark mode”, and “dislikes long meetings” are clear examples. Each is a stable stance the assistant should remember and adapt to.

By default, a Preference hangs off the Person. That’s the most common and useful case. You can extend preferences to other objects, like an Organization’s policies, a car’s settings, or an Event’s dress code.

Because I’m building a personal assistant, I start by attaching Preferences only to the Person. This keeps the graph clean, low-noise, and small. I’ll extend it later only when a concrete use case demands it.

Preferences attached only to the user. The dotted edges are extensions you add only when you need them.

Preferences are the personalization layer. They act as the memory of the user’s stances. They are the “sweet sauce” that makes every future response feel tailored.

There is one issue. Plenty of useful knowledge is just an atomic fact. Forcing all of that into the ontology is how graphs explode in complexity. The fix is a deliberately generic primitive.

Facts: The Trick You Haven’t Thought Of

The Facts entity is the fallback for everything that doesn’t cleanly fit a noun or a Preference. You drop the claim into a generic Fact. This is the move that keeps the ontology small and stops you from over-thinking the schema.

A Fact is the closest thing to a classic-RAG chunk. An LLM produces each Fact during extraction. Each Fact holds a single, atomic concept which works like a charm via semantic search.

The beauty is that with facts you avoid the usual chunking errors, such as splits mid-thought, mixed concepts, and arbitrary boundaries. In reality, a Fact is a triplet. A subject, predicate, and object like “Eiffel Tower / is / 330m tall” gets embedded and stored as 1 granular unit.

Here is the shape of a Fact entity:

The triplet — subject, predicate, object — is the whole fact. valid_from and valid_until give it optional bi-temporal validity. The embedding, computed over the concatenated triplet, is what makes the fact retrievable by semantic search.

It’s confusing that we have a triplet stored as a node. But this is what it makes it flexible. We don’t worry about modeling these one-off triplets directly into the ontology, but the LLM extracts them as-is from the text.

Facts are usually wired to nothing. They have no relationships to other entities. They are retrieved only via semantic search and text search. A Fact stays in the graph but is independent of it. This works because a graph store runs vector search and graph traversal in the same query engine [4]. Which means facts are retrieved only via semantic/text search.

Facts are atomic triplets retrieved by similarity and wired to nothing; POLE+O entities are reached by walking the graph. Same store, two retrieval modes.

Facts let you ship a memory layer before you have the perfect ontology. Anything you can’t yet model degrades gracefully into a searchable atomic node instead of blocking the build. Early on, you lean on Facts. As the graph matures, claims migrate toward typed entities and edges. It costs nothing to schema and nothing to maintain when entities merge or get deleted.

What’s Next

The takeaway is the posture. An ontology is a living artifact you bootstrap from a fixed generic base and grow through a data-exploration loop, exactly like any other AI application.

If you want to see the whole strategy implemented, the fastest path is to play with Neo4j’s agent-memory SDK or its MCP server. It uses POLE+O as a swappable default, subtypes as cheap extensions, and Preferences and Facts as first-class primitives. Studying it is what made all of this finally click for me.

I’m actively migrating my own Obsidian Second Brain toward the POLE+O, Preferences, and Facts primitives. This turns thousands of files into a graph I can actually traverse, visualize, and watch evolve over time.

But here is what I’m wondering:

If you worked with Knowledge Graphs, what was your process in discovering your own ontology?

Click the button below and tell me. I read every response.

Leave a comment

Enjoyed the article? The most sincere compliment is to restack this for your readers.

Whenever you’re ready, here is how I can help you

35 lessons. Pure foundations from scratch. 4 mini-projects. 2 production systems. A certificate and direct access to me & industry experts in our Discord.

Built for software and data professionals transitioning into AI engineering. Rated 5/5 with 300+ students. The first 7 lessons are free:

Start here

Not ready to commit? Start with our free Agent AI Engineering Guide, a 6-day email course on the mistakes that silently break AI agents in production.

References

Create Context Graph. (n.d.). Domain Catalog. create-context-graph. https://create-context-graph.dev/docs/reference/domain-catalog
Neo4j Labs. (n.d.). POLE+O Data Model. Neo4j Agent Memory. https://neo4j.com/labs/agent-memory/explanation/poleo-model/
Neo4j Labs. (n.d.). Neo4j Agent Memory. GitHub. https://github.com/neo4j-labs/agent-memory
Neo4j Labs. (n.d.). Why Neo4j? Graph-Native Memory Architecture. Neo4j Agent Memory. https://neo4j.com/labs/agent-memory/explanation/graph-architecture/

Images

If not otherwise stated, all images are created by the author.

Inside Neo4j's Agent Memory

Paul Iusztin — Tue, 19 May 2026 08:55:51 GMT

I already have a second brain setup based on Obsidian, Readwise, NotebookLM, and Claude Code. I dump all my notes, research, and highlights there. Whenever I want to create content, I create a scoped wiki targeted toward the topic. I gather information from my second brain using a deep research algorithm on top of my private data and external resources via NotebookLM. The wiki is structured like the LLM Knowledge Base presented by Andrej Karpathy.

This setup fails to extract and maintain shared entities, preferences, and facts across the wiki as the knowledge base grows. For example, if the topic “Claude Code” is mentioned in 10 documents, I want to extract all the metadata about it into its dedicated folder. I want to see what other entities it relates to, such as Anthropic, San Francisco, Codex, or Gemini CLI. I also want to see how many documents mention it to rank frequency. You can do that with a pure file-based system and Obsidian, but performance degrades when your data scales past 50 documents.

The same concept applies to any unstructured knowledge base. You need a way to extract and connect knowledge from your conversations, documents, and images. This becomes essential between conversations so your agent doesn’t forget you. Instead, it provides a personalized experience. It’s also critical for context engineering to inject the right context at the right time and keep the LLM focused on relevant facts.

Most teams default to one of two memory approaches. Both collapse under real use. A file system gives you append-only logs that the agent re-reads from scratch, which fragments and rots context.

A vector index gives you fuzzy semantic recall but no merge, no identity, and no way to know if this is the same Karpathy you knew yesterday. Durable AI memory requires a structured graph to track identity and relationships [1]. Without this structure, the assistant forgets past interactions and fails to build compounding intelligence.

Knowledge-graph memory is the next step on the arc from Retrieval-Augmented Generation (RAG) to agentic RAG to agent memory [2]. Building a unified knowledge-graph memory system is hard, so most teams skip it.

During my research, I stumbled upon neo4j-labs/agent-memory. It’s a masterpiece. Who knows more about knowledge graphs (KGs) than Neo4j?

After I spent 2 days playing with it and understanding the codebase, I realized it was the perfect mental model for any agent memory system powered by KGs.

In this article, I’ll walk through the core architectural patterns of neo4j-labs/agent-memory. It features 1 graph, 3 memory tiers, the POLE+O ontology, a 3-stage extraction pipeline, a composite resolver, and the SAME_AS pattern.

By the end, you’ll have a concrete mental model. You can ship on top of their Software Development Kit (SDK) or hook it into your agent via their Model Context Protocol (MCP) server. Alternatively, you can steal the patterns and ship the same architecture on Postgres or MongoDB if a full graph database in production doesn’t make sense for your use case.

Start Your Transition Into AI Engineering (Product)

This article shows the memory layer your agent needs. My Agentic AI Engineering course shows the harness around it, and I just released a free preview that lets you build and run a working agent in 5 minutes.

Built for software, data engineers or scientists transitioning into AI engineering.

7 free lessons, 2 MCP agents ready for your GitHub portfolio. Part of the 35-lesson course. Rated 5/5 by 300+ students.

Start the free preview →

What’s Inside `neo4j-labs/agent-memory`

The SDK takes natural-language interactions on the write side and returns a fused memory context on the read side. Everything anchors to a single Neo4j graph. For our scoped wiki, notes and Readwise highlights about Claude Code flow in. A structured pull of what the agent knows about Claude Code, how it relates to Anthropic, and its frequency across 50 documents comes out.

At its core, there is 1 graph and 3 memory tiers joined by typed edges: short-term conversations, long-term typed entities, and reasoning traces. They’re stitched together by :MENTIONS, :TOUCHED, and :INITIATED_BY relationships [3].

The architecture contains 8 small, single-responsibility modules. The models/ module holds Pydantic schemas. The schema/ module handles Cypher migrations. The extraction/ module runs the Named Entity Recognition (NER) pipeline. The resolution/ module holds the composite resolver. The dedup/ module manages the SAME_AS pattern. The core/ module provides MemoryClient.get_context(). The mcp/ module runs the FastMCP server with 15 tools. The integrations/ module holds 9 framework adapters for tools like LangChain and LlamaIndex.

The 8 modules sit between an MCP / framework interface and a single Neo4j graph that holds all three memory tiers.

Consider an end-to-end scenario. You drop a Readwise highlight about Claude Code into your scoped wiki. The extraction/ module pulls Claude Code as an Object, Anthropic as an Organization, and Codex as an Object. The resolution/ module canonicalizes each against existing nodes. The dedup/ module checks vector similarity and either auto-merges or flags a pending :SAME_AS edge. The schema/ module commits :MENTIONS edges from the note to each entity.

Later, MemoryClient.get_context() pulls fused context across the same graph in one call. This matters concretely for the scoped-wiki agent. You can ask what you discussed last session, what you know about Claude Code, and why the agent surfaced a Codex comparison last Tuesday. The SDK answers all three against the same graph. It uses the same Cypher dialect with no cross-store join logic.

Short-Term, Long-Term, Reasoning Memory

The SDK splits memory into three layers that all live on the same Neo4j graph [3]. Short-term memory is the linear message sequence. It uses ordered :Message nodes chained by :NEXT edges, scoped to a :Conversation. Long-term memory is the typed entity graph. It uses deduplicated :Entity nodes with vector embeddings and arbitrary domain relationships.

Reasoning memory is a tree per agent run. It uses a :ReasoningTrace root with child :ReasoningStep nodes capturing thoughts and tool calls. For the scoped-wiki agent, short-term memory holds your current chat. Long-term memory holds the canonical Claude Code entity plus its relations to Anthropic, San Francisco, Codex, and Gemini CLI. Reasoning memory holds the trace of how the agent picked those specific notes to answer you.

Three relationships do the entire stitching. The :MENTIONS edge joins short-term to long-term memory. The :INITIATED_BY edge joins reasoning to short-term memory. The :TOUCHED edge joins reasoning to long-term memory. These three edges make provenance a one-hop query rather than a log-reconstruction project.

Three tiers, one graph — the typed edges (:MENTIONS, :INITIATED_BY, :TOUCHED) make every cross-tier question a one-hop query.

Reasoning memory is the novelty from this architecture. By storing past successful or failed thinking patterns into the memory, the agent can one-shot future similar requests or at least know not to repeat similar mistakes. Intuitively, it’s similar to Reinforcement Learning (RL), but instead of baking the optimizations into the weights, you do it at the database level.

The most important part of this architecture is the ontology.

The Ontology

The long-term memory uses a closed five-type vocabulary for its ontology known as POLE+O. It uses Person, Object, Location, Event, and Organization, borrowed from intelligence-analysis taxonomies [5]. Every entity is exactly one of these five types. Subtypes are open, but the top-level vocabulary is fixed.

In the personal assistant, Karpathy is a Person. Claude Code is an Object. Anthropic is an Organization. Your Tuesday deep-research run is an Event. San Francisco is a Location.

Type and subtype materialize as multi-tier Neo4j labels. The query builder sanitizes and PascalCases them into labels like :Entity:Person:Individual. You can search by type or subtype, making this solution highly efficient.

Using this strategy, you can extend each core type from POLE+O with your own custom domain. Other defaults are: :Entity:Location:City, :Entity:Event:Concert, :Entity:Organization:Company, etc. Here is a catalog of over 20 domains such as Data Journalism, Gaming, Personal Knowledge, and Product Management.

Entities modeled via POLE+O are nouns. The SDK adds 2 other node types beyond entities.

:Fact nodes hold every claim mentioned in the text. They’re intentionally generic so the ontology doesn’t get over-specified. They serve as a fallback when nothing else fits. You can intuitively see them as chunks of text that contain only 1 concept.

Then there are :Preference nodes that store user preferences via a SUPERSEDED_BY relationship. As agent memory is user-centric, this provides the WOW effect where the agent remembers past preferences and learns from them over time.

For the scoped wiki, “Anthropic developed Claude Code” is an edge. “Claude Code 1.0 shipped in 2025” is a :Fact. “I prefer agent-harness comparisons over pure benchmarks” is a :Preference.

A scoped-wiki graph built from the five POLE+O types — every node is exactly one of Person, Object, Location, Event, Organization, and every typed relationship is a :RELATED_TO edge with the semantic name carried as a property.

Extraction: From Raw Text to Typed Entities

The SDK runs entity extraction as a speed-versus-accuracy ladder. It uses spaCy for fast statistical NER. It uses GLiNER and GLiREL for zero-shot extraction. It uses an LLM stage for cases that need real semantics and to extract the relationships between them [6].

Each stage maps its outputs back to POLE+O types. It uses explicit merge strategies when 2 extractors disagree. When you drop a Readwise highlight about Claude Code into your scoped wiki, spaCy lifts proper nouns like Anthropic and San Francisco. GLiNER catches domain entities like Claude Code and Gemini CLI. The LLM stage only fires when the previous 2 stages leave ambiguity, or when the model needs to extract relationships.

From raw text to a clean graph — the three-zone SAME_AS pattern is what stops the same entity from becoming three nodes.

Routing every mention through an LLM would multiply extraction cost massively for marginal recall on rare entities. The ladder pushes high-confidence cases to cheap models. It escalates only ambiguous mentions to the zero-shot models and reserves the LLM stage for when real semantics matter.

The real problem is at the normalization step.

When Two Mentions Are the Same Entity (And When They Aren’t)

Resolution and deduplication are 2 different problems. Resolution sets a canonical string property on an existing reference. Deduplication decides whether a new node gets created at all. Conflating them is how graphs end up with 3 Anthropic nodes that none of your queries find together [7].

Resolution runs 3 strategies on the name field in cost order. Exact matches existing canonical strings. Fuzzy uses RapidFuzz string similarity for surface variants like “A. Karpathy” and “Karpathy, Andrej”. Semantic falls back to embedding similarity for cases like “the founder of Eureka Labs”. It only matches between nodes of the same type, meaning a Person only resolves against Person candidates.

After resolution runs, two mentions like “Apple” and “Apple Inc.” end up with different surface names but the same canonical name. That’s why a second step is needed. Deduplication looks at the semantics, not just the name.

Same name, three outcomes: High similarity auto-merges, the middle band defers to a human, and low similarity creates 2 nodes that share a canonical name but live as separate referents.

For deduplication, the SDK uses vector and fuzzy similarity across the entire node content. This ensures the node is actually the same, not just a name coincidence. In other words, this avoids false positives. Using vector and fuzzy search, the SDK computes a score.

Scores at or above 0.95 trigger an auto-merge. Scores below 0.85 create a new node. Scores between 0.85 and 0.95 don’t silently merge. Instead, they create a :SAME_AS edge with a pending status. This flags the edge for a human or downstream agent to resolve later. This pattern stops “Jensen Huang the NVIDIA CEO” from merging with “Jensen Huang the Taipei dermatologist” just because their embeddings landed 0.91 apart [7].

A false merge is silent and unrecoverable. A false split is noisy but recoverable. You can’t undo a false merge without re-ingesting from the raw source data. That’s why you should leave uncertainty to a human.

Zooming into the Retrieval Algorithm

Because all three tiers live on one graph, a single retrieval can compose vector similarity over :Entity embeddings, multi-hop expansion over typed relationships, time-ordered :NEXT conversation walks, and reasoning-trace lookups via :INITIATED_BY and :TOUCHED joins. All of these run as steps in the same Cypher query. Neo4j 5.20 introduces db.index.vector.queryNodes, making vector similarity a first-class graph operation [4].

When you ask what you know about Claude Code, how it relates to Codex and Gemini CLI, and why you looked at it last week, the agent fuses three things in one pull. It uses vector similarity over your Readwise highlights to surface relevant passages. It uses a multi-hop traversal of :DEVELOPED_BY and :COMPETES_WITH edges to bring in Anthropic and Codex neighbors. Finally, it uses an :INITIATED_BY jump back to the prior conversation that discussed agent harnesses. There’s no cross-store join logic and no orchestrator.

From our tests, the library leaves the context construction to the user of the SDK. In other words, you get the whole output from the graph, and it’s your responsibility to further compress it before passing it to the LLM.

What’s Next

The neo4j-labs/agent-memory architecture is more complex than what this article covers, but this is the core idea behind it. I’ll cover other components in more depth in future articles, including designing the ontology and keeping your knowledge graph clean over time.

I think this open-source repository is a perfect blueprint you can take to build your own agent memory solution, even with Postgres or MongoDB, to avoid keeping multiple databases in production. Still, Neo4j is probably the best choice for data mining and exploration.

For small to medium-scale projects with thousands of nodes and short hop traversals, I’d probably build my own agent memory solution from scratch on top of Postgres or MongoDB. I’d reach for Neo4j as an internal tool within my organization, or when the scale or complexity becomes too large for Postgres or MongoDB.

But here is what I’m wondering:

How are you handling agent memory today? Flat files, a vector index, a knowledge graph, or something stranger?

Click the button below and tell me. I read every response.

Leave a comment

Enjoyed the article? The most sincere compliment is to restack this for your readers.

Whenever you’re ready, here is how I can help you

35 lessons. Pure foundations from scratch. 4 mini-projects. 2 production systems. A certificate and direct access to me & industry experts in our Discord.

Built for software and data professionals transitioning into AI engineering. Rated 5/5 with 300+ students. The first 7 lessons are free:

Start here

Not ready to commit? Start with our free Agent AI Engineering Guide, a 6-day email course on the mistakes that silently break AI agents in production.

References

Seale, T. (n.d.). This week Anthropic dropped Claude Sonnet 4.5. LinkedIn. https://www.linkedin.com/posts/tonyseale_this-week-anthropic-dropped-claude-sonnet-activity-7379787334398926848-iVOE/
Monigatti, L. (n.d.). The Evolution From RAG to Agentic RAG to Agent Memory. Leonie Monigatti. https://www.leoniemonigatti.com/blog/from-rag-to-agent-memory.html
Neo4j Labs. (n.d.). Understanding the Three Memory Types. Neo4j Agent Memory. https://neo4j.com/labs/agent-memory/explanation/memory-types/
Neo4j Labs. (n.d.). Why Neo4j? Graph-Native Memory Architecture. Neo4j Agent Memory. https://neo4j.com/labs/agent-memory/explanation/graph-architecture/
Neo4j Labs. (n.d.). POLE+O Data Model. Neo4j Agent Memory. https://neo4j.com/labs/agent-memory/explanation/poleo-model/
Neo4j Labs. (n.d.). How Entity Extraction Works. Neo4j Agent Memory. https://neo4j.com/labs/agent-memory/explanation/extraction-pipeline/
Neo4j Labs. (n.d.). Entity Resolution and Deduplication. Neo4j Agent Memory. https://neo4j.com/labs/agent-memory/explanation/resolution-deduplication/

Images

If not otherwise stated, all images are created by the author.

From Vibe Coding to a Real Engineering Team

Paul Iusztin — Tue, 12 May 2026 11:04:20 GMT

I needed a TypeScript harness for my latest book code. It required a Terminal User Interface (TUI), an agent loop, tools, Model Context Protocol (MCP) support, skills, and slash commands. I will be honest with you. I first tried to vibe code this project.

As I knew what I was looking for, it worked. Until it didn’t. The code was working until you started looking more closely at the details. Only the first 20 characters were rendering inside the TUI, and the skills weren’t invoked by the agent loop.

So I deleted the whole code base and started over with a new strategy.

The cost of vibe coding isn’t abstract. It’s the next feature you can’t ship because you’re debugging a slash-command renderer that looked finished. This is what most people get wrong. Output that compiles and looks done breaks the moment you reach for the rough edges.

I divided the harness into tasks. I one-shotted the barebones version, which was just a TUI plus an agent loop with bash, grep, and a todo tool. Then I layered MCP, skills, and slash commands as separate features.

You can’t one-shot whole applications. You can one-shot big features if you scope them right and run them through a real engineering process.

This is known as agentic coding. Not vibe coding. You’re using agents to write the whole codebase, but you are still the mastermind behind everything.

But I wanted more. I wanted to automate this process. But with a single constraint in mind: “the code should HAS to be good”.

That’s why I built Squid. It’s an opinionated six-agent Claude Code setup available at iusztinpaul/squid. It ships features the way a real software team ships them.

Squid has already shipped our content-automation tool, expanding it from articles to posts, notes, threads, and messages. It shipped the book’s code data pipelines and TypeScript harness.

In this article I will show you how it works.

The concrete blueprint relies on a specialized team and an e2e lifecycle.

Start Your Transition Into AI Engineering (Product)

Squid applies the multi-agent pattern to coding. My Agentic AI Engineering course applies it to writing, and I just released a free hands-on lesson that distills the whole system.

You build a multi-agent system composed of two FastMCP servers (Deep Research + LinkedIn Writer) orchestrated by a harness, plus an observability and evals layer on top. The shift from classic backend/frontend stacks to MCP servers and harnesses is the pattern shaping modern agentic AI.

Built for software and data engineers moving into agentic AI engineering.

Part of the 35-lesson course. Rated 5/5 by 300+ students. First 7 lessons free.

Start the free lesson →

The Six Agents Engineering Team

The system contains six agents. No agent both writes code and decides whether the code is correct.

My Agentic Engineering Team

The product manager agent manages the tasks and ensures the feature adheres to the software architect’s specifications. It takes a raw feature specification, writes or updates an Architecture Decision Record (ADR) for non-obvious choices, and splits the feature into ordered tasks. It also maintains the Domain-Driven Design (DDD) glossary so vocabulary stays consistent between the business and engineering.

Note how, because Claude can easily handle both PM and software architecture work, we decided to merge these roles together. We did this to avoid fragmenting the context just to follow a standard human process. Ultimately, planning should be closely aligned with the software architect’s vision. In human processes, dividing these two responsibilities often created more issues than solutions.

The software engineer agent uses red-green Test-Driven Development (TDD). It writes the failing test, writes the minimal code to pass it, and then refactors. The software engineer uses direct command-line interfaces (CLIs) like git, mongosh, and gh. It never uses MCP wrappers. CLIs are more flexible because they tap directly into the power of bash. Plus, LLMs have seen considerably more bash code than MCP wrappers during training.

The tester agent specializes in the adversarial end-to-end edge-case pass. It catches false-confidence claims where the software engineer says the tests pass. It does this by reading every acceptance criterion against concrete evidence, like the test name, file lines, and command output.

The pull request reviewer agent performs a diff-only review. It looks for dead code, duplication, missing test coverage, and documentation adherence. It does a narrow performance review on hot paths only. It’s explicitly told not to micro-optimize one-off scripts.

The on-call agent loops on the Continuous Integration (CI) pipeline until it passes. In an earlier iteration, the CI check lived inside the software engineer and tester loop, and it got skipped constantly. Promoting it to a dedicated agent invoked by the orchestrator increased the probability the step runs.

The self-improve agent is an optional meta agent. After the feature is done, while looking over the results, the human can run the self-improve agent to scan the run for high-signal lessons and propose updates to the agentic coding layer that consists of CLAUDE.md, skills and subagents. This is a double-edged sword. It can constantly improve your workflow or quickly degrade it if you are not careful. That’s why it’s incredibly important that this step is gated by a human.

The secret sauce is in anchoring the agents into your own documentation.

Keeping Up With Documentation: ADRs & DDD Glossary

The ADR directory acts as compressed architectural memory across runs. Every non-obvious choice regarding the datastore, synchronization defaults, authentication boundaries, or dependency lock-in ships with an ADR. These records include the status, context, decision, and consequences at docs/adr/.md. The product manager reads the directory before grooming a new feature, so decisions stay consistent across feature branches.

The DDD glossary gives shared vocabulary between the business and engineering at docs/glossary.md. It enforces one canonical name per concept. Code identifiers, OpenAPI schemas, database columns, and customer-facing interfaces all use the term exactly as it appears there. This gives Claude Code business context, not just code context, properly anchoring your code in your domain. The software engineer, tester, and pull request reviewer all reason about the same domain.

I have an honest caveat. The agents still under-use both the ADRs and the glossary. The spine exists, but I am still working on getting the agents to lean on it consistently.

Now the agents have the context they need to execute a feature from a raw specification all the way to a merged pull request.

The Night Skill. The End-To-End Workflow.

The /night skill takes one input, which is a feature specification written by the human, and produces one output, which is a merged pull request with green CI. Everything in this section sits between those two endpoints.

The /night pipeline is a long-running lifecycle. That’s why it’s called the “night” skill. It’s scoped to run for hours at a time, often with multiple pipelines in parallel.

It has two human checkpoints and five retry caps, while everything else is automated. The orchestrator acts as a manager. It never writes code itself, never runs tests itself, and never reviews the diff itself. It launches agents and enforces human validation.

After a human carefully writes a detailed feature specification, it calls the /night skill, which creates a new branch and worktree. The product manager reads the glossary and ADR directory, updates or writes a new ADR if needed, and splits the feature into a task plan.

Then we hit the first human gate. The user approves the plan, optionally sharpened by the /grill-me skill. The /grill-me skill is inspired by Matt Pocock’s work, which forces the agent to ask sharp questions back about anything fuzzy in the plan, such as interfaces, modularization, or new tools. This conversation is the line between vibe coding and agentic coding.

Next is the inner loop per task. The software engineer implements the code, the tester verifies it, and failures route back to the software engineer. This loop is capped at 5 attempts. Convergence is mostly mechanical through a run, fail, fix, and run cycle.

The product manager then performs an acceptance review on the whole feature from the user’s perspective. Rejections are packed into a single task back into the inner loop. This is capped at 3 attempts, because judgment-call loops are where Claude Code spirals.

Next, we repeat a similar loop using the PR reviewer agent, which looks at the diff, with a maximum of 3 attempts to avoid perfectionism. Adding a maximum number of attempts here is critical, because during review an LLM almost always has something else to say.

After the push, the on-call agent watches CI with a maximum of 5 attempts, routing failures back to the software engineer.

When the CI is green, we notify the user (e.g., via Slack) that the PR is ready for review. Optionally, based on any potential issues found while running the /night skill, we run self-improve to propagate that into your memory.

The /night lifecycle. Two human gates, five retry caps, everything else automated.

My Agentic Coding Setup

Beautiful! With this process I one-shot most of the features I am working on. And when it’s not a one-shot, I’m typically 95–99% there by the time I review the PR.

How the Tester Stopped Re-Running What the SWE Already Ran

The biggest problem with the e2e workflow above is that it’s slow and redundant. I preferred that over generating AI slop that I have to manually review and fix.

Still, there are a few tweaks that we can make to the workflow to improve speed and efficiency.

For example, when the tester re-ran the linter, type checker, formatter, and the happy-path suite that the software engineer had already run, we paid for everything twice. This was the number-one source of having a system that works but is too slow to use.

To fix this, the tester now accepts the software engineer’s reports for formatting and happy-path tests. It only runs the adversarial end-to-end edge-case pass itself. This covers the part the software engineer can’t credibly self-verify. Trust is bounded. Intuitively, I realized I’d started shifting the Tester toward QA-style practices, rather than just running simple tests.

I am still iterating on optimizations. For example, I want to route some subagents to Claude Sonnet models instead of Claude Opus. I also plan to narrow toolsets per role to reduce reasoning failures.

Also, depending on what you are working on, you might want to use the system more as a fast, snappy assistant than as a long-running workflow that prioritizes correctness above all.

Day vs. Night: Two Orchestrators, One Team

That’s why we have two pipelines running the same agents. The /night skill is the full lifecycle. It’s long-running, set-and-forget, has two human gates, and runs while you are away from the keyboard or working in parallel.

The /day skill is the lean inner loop. It runs the software engineer, the tester, and human commits for surgical edits. It skips product manager grooming, the pull request reviewer, and the on-call agent.

There is a concrete use case for the /day skill. When I read a merged pull request and find code I don’t like, the /day skill runs the stripped software engineer and tester loop to apply targeted edits. Then the on-call agent cleans up any CI fallout. This is the surgery that keeps the system from becoming a black box.

Day vs. Night: Same agent team, two orchestrators tuned for different workloads.

Both pipelines have one thing in common. The human is in the loop on purpose, not as a fallback.

Why Code Templates Are a Waste of Time in 2026

Most teams are still scaffolding from cookiecutter templates that were outdated the day they were committed. This is a maintenance tax disguised as productivity. Squid stops paying that tax. Technology moves fast enough that any frozen template’s frameworks, tooling, interfaces, and opinions all need their own maintenance pipeline. That’s only worth it if one template fans out across dozens of projects.

A Copier or cookiecutter template isn’t free. I tried scaling one across Python, TypeScript, and Go. I watched the project balloon into a maintenance burden where most files would never be used. Maintaining a template engine to support multiple stacks is a full-time job.

Asking Claude Code to copy from the last project fails too. It propagates the technical debt baked into the source codebase. You inherit the mess, not the ideal state.

The real shift relies on markdown, not Jinja. I call these agentic templates.

You encode good practices as skills and CLAUDE.md files. Fundamentals like clean architecture, CI/CD discipline, testing patterns, and development cycles rarely change. When they do change, you edit prose instead of regenerating from a template engine that quickly slides into dependency hell.

Tooling stays dynamic. You don’t pin framework versions inside a template. You keep a decision tree of allowed choices and let the agent pull the latest interfaces on demand via Context7 at scaffold time.

Project structure can’t be templatized. The anti-pattern organizes by type, putting files into agents/, nodes/, schemas/, and tools/ directories. One business module’s logic ends up scattered across four folders, forcing both humans and the agent’s context window to thrash.

The correct pattern organizes by actionability, keeping one bounded context per directory. Each domain owns its own types, store, Application Programming Interface (API), and prompts. That’s locally readable, easier to maintain, and easier for the agent to reason about.

Because we describe the structure in Markdown files instead of cookiecutter templates, we can define it like this:

Avoid global dumping grounds like utils/ or helpers/. Avoid a root-level types.py grab bag. Avoid grouping tests by type.

The /scaffold skill acts as an interactive bootstrap. An AskUserQuestion prompt drives a tight decision tree covering project identity, layout, components, backend, frontend framework, infrastructure, agent team, tracker, ADR and glossary opt-ins, and external services. A deterministic table picks only the matching specifications from the specification library. Unused categories never enter the context. The skill writes a tailored CLAUDE.md brief, lays down an empty folder skeleton, and hands off.

Then, based on the agentically generated template, you can use /night or /day to start writing real code.

Open-Sourcing Squid

I don’t want to keep Squid for myself. I want to share it with the community to learn from and contribute to.

Thus, I am open-sourcing Squid.

You can install it as a Claude Code plugin:

/plugin marketplace add iusztinpaul/squid
/plugin install squid@squid

I want you to try it, build something awesome with it, and if you like it, contribute back:

Check the full codebase

Still, here is what I’m wondering:

What is your agentic coding setup? How is Squid different from your own approach?

Click the button below and tell me. I read every response.

Leave a comment

Enjoyed the article? The most sincere compliment is to restack this for your readers.

Whenever you’re ready, here is how I can help you

35 lessons. Pure foundations from scratch. 4 mini-projects. 2 production systems. A certificate and direct access to me & industry experts in our Discord.

Built for software and data professionals transitioning into AI engineering. Rated 5/5 with 300+ students. The first 7 lessons are free:

Start here

Not ready to commit? Start with our free Agent AI Engineering Guide, a 6-day email course on the mistakes that silently break AI agents in production.

Images

If not otherwise stated, all images are created by the author.

Building Agentic GraphRAG Systems

Paul Iusztin — Tue, 05 May 2026 05:01:08 GMT

I gave this talk twice in one month: at O’Reilly’s Context Engineering Event and at Abi Aryan’s Maven course on LLM inference at scale. After being blasted with questions, I realized something: GraphRAG isn’t a retrieval algorithm, it’s a data modeling problem.

Powering agents with knowledge graphs (KGs) and ontologies is still an unsolved problem. All the engineers I spoke to want GraphRAG, but don’t know how to implement it.

But at its core, we should ask a different question. Why do we even need GraphRAG in the first place? Why complicate our solution over a simple RAG system?

There are three core reasons.

First, you face context rot. As the context window fills, the signal-to-noise ratio collapses. The LLM degrades.

You pay for this degradation in quality, cost, and latency [1].

Second, you face data fragmentation. In the agent era, your data lives in silos most builders share: documents, notes, research, emails, and text messages. We are no longer lucky enough to have all the data nicely stored in a single database.

Third, the agent’s unified memory naturally maps to a knowledge graph (KG). People have preferences and experiences. They went into specific locations, met with other people, or have a list of items to do. Things get trickier when “Arthur told Felix that his favorite coffee shop is in the center of Timisoara”, but after two months “it moved to Lisbon”. You need to start tracking relationships between people, locations, and most especially how these relate in time.

GraphRAG solves all three.

This is a data modeling problem, not a retrieval algorithm. It took a painful LangChain detour and a hard MongoDB RAM conversation to settle that for me. You need an ontology.

Image 2: The full GraphRAG system architecture.

By the end of this article, you will learn about ontology-first design, the three extraction modes, append-only data models, and hybrid retrieval joined by Reciprocal Rank Fusion (RRF). Finally, you will see how to expose the GraphRAG engine as a unified memory layer via an MCP server to power your agents. In other words, how to do agentic GraphRAG.

Before walking through the architecture, let’s understand why the story has to start from the ontology.

Build Your Own Multi-Agent System Free Workshop (Product)

This article shows what an MCP-served unified memory looks like end to end. If you want to actually build agentic systems with MCP servers like this, I open-sourced a hands-on workshop for that.

Two MCP servers from scratch: a Deep Research Agent (Gemini + Google Search grounding) and a Writing Workflow with an evaluator-optimizer loop.

Packaged with slides, a ~2-hour video, runnable reference code, and an “implement-it-yourself” skeleton via agentic coding best practices (25 tickets, one orchestrator skill, and two agents: SWE and tester).

Originally presented at the AI Engineering Conference Europe. 200+ stars on GitHub. Free.

Go to workshop →

Why the Story Starts From the Ontology

Whenever you need to connect dots across a corpus of multiple documents rather than find the most relevant paragraph, you go for GraphRAG. Knowledge is stored as entities and edges.

You traverse connections rather than find similar text.

An ontology is a collection of classes and the relationships allowed between them. If you come from object-oriented programming, you already have the right intuition.

Throughout this article, we will build a digital twin. My favorite example. We will define a Global Ontology of six entity types organized into two sub-ontologies.

The data pipeline deterministically constructs the Document Ontology. It contains DOCUMENT and CHUNK nodes. It uses PART_OF, NEXT, REFERENCED, and MENTIONS edges.

The LLM extracts the Person Ontology. It contains PERSON, TASK, EPISODE, and PREFERENCE nodes. It uses RELATED_TO, TODO, EXPERIENCED, and HAS edges.

The schema is flexible. You define it for your business case. Every section after this one assumes these exact node and edge labels.

Image 3: Left shows the Global Ontology split into a Document Ontology and a Person Ontology. Right shows an instantiated KG with nodes wired together via the eight typed edges.

Skipping the ontology carries a heavy cost. I tried LangChain’s MongoDBGraphStore, which lets the LLM extract entity and relationship types freely. Five documents produced 17 node types and 34 relationship types.

This included part_of, Part Of, and part of as three separate types. The underlying data model does not enforce a schema at the storage layer.

With an ontology, the LLM can only extract what you defined. The constrained scope also allows you to use cheaper extractor models.

That’s why GraphRAG is the right tool when you have a clearly defined schema. It works when you need to identify relationships.

It reduces hallucination on complex queries that span interconnected facts. Domains where knowledge graphs naturally fit are legal, medical, financial, business operations, productivity tools and in my opinion, the crown jewel: personal assistants. With a KG, you can naturally build the unified memory of your personal assistant to properly remember what you like, what you did, and what you have to do, all anchored in time.

For example, Palantir built its empire using ontologies. Google uses KG to power its search, and Microsoft uses it in its internal ops tools.

With the ontology defined, the next architectural choice is the shape of the graph itself and how to extract those entities from raw text.

RDF vs. Property Graphs, and the Three Extraction Modes

Image 4: RDF vs. Labeled Property Graph on the same Arthur fact. RDF explodes every property into its own triplet. Property Graphs attach properties to the node. Agent stacks use property graphs in practice.

Every graph is structured as a collection of (entity, relationship, entity) triplets. But there are two ways to attach data to each entity or relationship instance, known as Resource Description Framework (RDF) and labeled property graphs.

RDF attaches each piece of metadata as another triplet. The graph explodes in size. Property graphs attach metadata as JSON on the entity or relationship.

In practice, GraphRAG and agents use property graphs [3].

Now, during extraction, where we actually map data into our (entity, relationship, entity) triplets, plus their corresponding data, we have three core methods.

Structured extraction is schema-guided. The LLM outputs entities per the Person Ontology.

Semi-structured extraction uses metadata and lineage without an LLM. You parse the email’s links and attachments.

Unstructured extraction uses an LLM without a schema. The LLM invents its own labels. This is useful for discovery, not for grounded retrieval. In other words, we use the LLM to extract triplets without an ontology. Exactly what we said to avoid in the previous section.

Here is the data-source mapping for the Person Ontology of the digital twin:

Table 1: Data-source mapping for the digital twin.

The Document Ontology can be completely done through semi-structured mechanics, since we already know what document each chunk comes from, the author of each document, and the references between them.

💡 A student asked about open-domain extraction. Exploratory extraction is great early on when you are figuring out what ontology makes sense for your data. You can use zero-shot Named Entity Recognition (NER) models like GLiNER for that exploratory phase [4]. Which you can easily run locally without having powerful inference hardware. Without that discipline, the output becomes unusable noise within tens of documents. A constrained scope lets you swap the frontier model for a small fine-tuned extractor like Gemini Flash Lite, Claude Haiku or even better, use Liquid open-source models fine-tuned on your ontology.

These extraction modes feed directly into a five-component system that turns raw documents into queryable memory.

The Five-Component Architecture

The input consists of heterogeneous documents scattered across multiple silos. The output is a single queryable knowledge graph. The agent can search and write back to it via two tools.

Everything in between is plumbing built to serve that one job.

The data pipeline gathers from URIs, notes, emails and Google Drive. It normalizes everything into a document collection written to a warehouse.

The memory pipeline turns documents into knowledge-graph objects and writes them into the unified memory modeled as a KG.

The KG is the queryable artifact. The agent communicates with the knowledge graph via an MCP server that exposes search and write tools. If you are building in Python, choose FastMCP over the native MCP SDK, as it’s much easier to use and offers a better developer experience.

Image 5: The five-component architecture. Sources flow through the data and memory pipelines into the materialized knowledge graph. The agent talks to it through two MCP-exposed tools.

The search_memory family of tools brings only the slice the agent needs into the context window. The write_memory tools run the same data + memory pipelines on demand on a conversation or URI instead of running them in batch mode [5].

Ultimately, we connect the MCP server to a harness such as Claude Code or Codex, where we inject custom business logic on how the tools should be used through a family of assistant-memory and assistant-learn skills.

For 2-3 hop traversals, Postgres or MongoDB handle documents, vectors, and graph-lookup in a single piece of infrastructure [7].

Reach for Neo4j only when deep traversals or specialized graph algorithms are core to the product [8]. Or a good trade-off is to use it internally just for data exploration. Do not design for Google scale when you are processing thousands of documents.

The memory pipeline sits at the core of this architecture, transforming raw documents into the exact triplets the rest of the system queries.

The Memory Pipeline

The memory pipeline cleans the incoming document.

Next is optional chunking. If you can avoid chunking, avoid it. It introduces problems and is more about RAG-era reflexes than a necessity. You always have to customize the solution based on your data and try to introduce as little complexity as possible.

Next, the graph extractor emits triplets. You should use Pydantic-style schema descriptors so the LLM knows how each field should look.

Normalization is the most important step. You track the evolution of a single entity over time. Do not allow multiple versions of the same person to exist. The system re-uses the same canonical ID across extractions. New metadata and new relationships layer on top [9].

Finally, you embed the relevant fields for semantic search.

Now, let’s look at the core ways of data models you can use to store your KG.

Single Mutable Collection vs. Append-Only Log Data Models

There are two main approaches on how you can model your collections: as an append-only log or as a single mutable collection. Both have their pros and cons.

The append-only log consists of two collections: an append-only log and a queryable materialized view.

The system appends every event to an immutable log. A periodic materialization step squashes all events for the same ID into one canonical record.

You get versioning, temporality, and reversibility for free. You pay in RAM and operational complexity. As RAM is the most scarce and costly piece of hardware for hosting databases, this quickly translates into larger compute costs.

The single mutable collection approach drops the log. Each extraction directly upserts into the queryable collection.

You get simpler ops and real-time visibility, but the temporal audit trail is gone. Pick the single collection if operational simplicity and reduced costs beat time-travel.

Pick the two-collection append-only approach if you genuinely need an audit trail. Append-only collections never delete and never update. The same ID can appear multiple times across extractions, reflecting updates of an entity or relationship instance across the KG.

You can replay history up to a point in time, soft-delete, and revert a bad extraction. Materialization squashes all logs sharing an ID into one canonical entity.

An intuitive way of comparing the two methods is that the single mutable collection option is the same as the materialized view of the append-only option. Thus, one option comes with an append-only log, which comes with versioning and temporality, while the other doesn’t.

How Would This Look Within the Digital Twin?

Each log event lands with an auto-generated ObjectId plus a single chunk_id and source_document_id pinning it to one origin, with no embedding because nothing has been merged yet into the final instance. Materialization groups events by (name, type) for nodes and by the (source, kind, target) triplet for edges, swapping the ObjectId for a deterministic composite ID that is the merge key, unioning every contributing document into a sources array, and embedding each canonical entity once.

Image 6: The two-collection MongoDB shape. Left column shows the append-only log node and edge. Right column shows the materialized node and materialized edge.

Nodes and edges share a single collection, separated only by a kind discriminator. So within our MongoDB implementation, $graphLookup walks source_node_id → target_node_id recursively without joining across collections.

Image 7: The one-collection MongoDB shape. Nodes and edges coexist in a single collection, both keyed by deterministic string IDs.

A student asked about community detection and isolated nodes. Once materialization runs, the system computes communities over the canonical node collection. An isolated node is just a singleton community. Filter or keep it based on your use case. Postgres and MongoDB handle hundreds of millions of small records. They can also scale vertically easily through sharding by partitioning on the entity and relationship IDs.

Now, let’s finally understand how we can query the KG and plug it into an agent.

Finally...Let’s Understand the Retrieval Algorithm

During retrieval, we use a hybrid index.

Text search uses exact keywords. Semantic search is meaning-based. Graph search is a multi-hop traversal across the typed edges.

Communities are an optional fourth index for topical clusters.

Image 8: Top-down retrieval example for the query: “Create a presentation on GraphRAG for O’Reilly”.

GraphRAG retrieval is a two-stage move [10].

Stage 1 runs text and semantic search. It merges results with Reciprocal Rank Fusion (RRF). Apply a cutoff to get your entry points [11].

Stage 2 walks 2-3 hops across the typed edges to expand the result set.

During retrieval, GraphRAG’s addition over RAG is this multi-hop step, after the RRF merge, which is standard for most RAG systems.

Image 9: Two-stage retrieval. Text and semantic search feed RRF for entry points. From there, 2-3 hop graph traversal expands the result set.

Still, there are two important details to highlight. There’s bottom-up, which expands entities for depth, while top-down hops across communities for a high-level overview [2]. This translates to a trade-off between context size, latency and performance.

Image 10: Bottom-up vs. top-down GraphRAG. Both start at text and semantic search. Bottom-up expands entities for depth. Top-down hops across communities for a high-level overview.

Now, to close the loop, let’s connect everything to an agent.

The Cherry on Top: Agentic GraphRAG

GraphRAG becomes agentic when an agent gets to write to and search the knowledge graph autonomously [5].

Image 11: Agentic GraphRAG via MCP. The agent calls search and write tools exposed by an MCP server.

The agent dynamically writes queries against the materialized knowledge graph using a family of search_memory tools. The write_memory family of tools runs the data and memory pipelines on the current conversation or any other type of document. These tools are exposed to the agent via the MCP server, implemented in FastMCP.

This differs from the five-component architecture explained earlier: this time, the agent decides when to search/write to memory.

The search tools can directly implement the text + semantic + graph-search algorithm programmatically, or let the agent write the query code on-demand, which gives more flexibility at the cost of potentially less optimal code.

As for the write tools, allowing the agent to ingest the current conversation ensures continual learning by dynamically tracking the user’s preferences, to-dos, experiences and more.

At the moment, harnesses such as Claude Code use the filesystem to implement the memory layer. But as the data grows, gets more complex, or we have to operate under strict cost/latency requirements, we will need more powerful solutions than just hoping the agent will figure it out through progressive disclosure.

What’s Next

In this piece, I presented only the high-level architecture and strategies around GraphRAG.

The issue is that when you start diving into each component, such as normalization, extraction, embedding or data modeling, you will realize that everything is extremely custom to your own data and use case.

This is especially true because GraphRAG is still in its early days, where there is no clear plan of attack.

That’s why I am actively working on a new book on how to implement a personal assistant from scratch (yes, together with Maxime Labonne!), where we will explore building a memory layer stage by stage: RAG, then GraphRAG, with an AI Evals layer on top to measure the actual gain in performance when introducing GraphRAG. As soon as I have more details on this, I will let you know.

But here is what I’m wondering:

Are you using a single database (Postgres / MongoDB) or splitting graph and vector workloads across specialized systems (Neo4j + Pinecone)?

Click the button below and tell me. I read every response.

Leave a comment

Enjoyed the article? The most sincere compliment is to restack this for your readers.

Whenever you’re ready, here is how I can help you

35 lessons. Pure foundations from scratch. 4 mini-projects. 2 production systems. A certificate and direct access to me & industry experts in our Discord.

Built for software and data professionals transitioning into AI engineering. Rated 5/5 with 300+ students. The first 7 lessons are free:

Start here

Not ready to commit? Start with our free Agent AI Engineering Guide, a 6-day email course on the mistakes that silently break AI agents in production.

References

Anthropic. (n.d.). Effective Context Engineering for AI Agents. Anthropic. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
Larson, J. (2024, April 2). GraphRAG: Unlocking LLM Discovery on Narrative Private Data. Microsoft Research. https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Negro, A., Kus, V., Futia, G., & Montagna, F. (n.d.). Knowledge Graphs and LLMs in Action. Manning. https://www.manning.com/books/knowledge-graphs-and-llms-in-action
Neo4j Graph Data Platform. (n.d.). How Entity Extraction Works. Neo4j Agent Memory. https://neo4j.com/labs/agent-memory/explanation/extraction-pipeline/
Monigatti, L. (n.d.). The Evolution From RAG to Agentic RAG to Agent Memory. Leonie Monigatti. https://www.leoniemonigatti.com/blog/from-rag-to-agent-memory.html
Govindarajan, V. (n.d.). OpenClaw Architecture - Part 3: Memory and State Ownership. The Agent Stack. https://theagentstack.substack.com/p/openclaw-architecture-part-3-memory
Iusztin, P., & Rodrigues, J. (n.d.). How We Killed Our RAG Pipeline.
Neo4j Graph Data Platform. (n.d.). Why Neo4j? Graph-Native Memory Architecture. Neo4j Agent Memory. https://neo4j.com/labs/agent-memory/explanation/graph-architecture/
Neo4j Graph Data Platform. (n.d.). Entity Resolution and Deduplication. Neo4j Agent Memory. https://neo4j.com/labs/agent-memory/explanation/resolution-deduplication/
Hedden, S. (n.d.). How to Build a Graph RAG App. Towards Data Science. https://towardsdatascience.com/how-to-build-a-graph-rag-app-b323fc33ba06/
Arancio, J. (n.d.). Comment on Hybrid RRF Retrieval Pipeline. Substack. https://substack.com/@jeremyarancio/note/c-205294494
Liu, J. (2025, May 19). There Are Only 6 RAG Evals. jxnl. https://jxnl.co/writing/2025/05/19/there-are-only-6-rag-evals/
Zhang, B. (2026, January 22). Scaling PostgreSQL to Power 800 Million ChatGPT Users. OpenAI. https://openai.com/index/scaling-postgresql/
Govindarajan, V. (n.d.). OpenClaw Architecture - Part 2: Concurrency, Isolation, and the Invariants That Keep Agents Sane. The Agent Stack. https://theagentstack.substack.com/p/openclaw-architecture-part-2-concurrency

Images

If not otherwise stated, all images are created by the author.

What Held Up at 3 AM: One Engineer's RAG Case Study

Paul Iusztin — Wed, 29 Apr 2026 11:04:33 GMT

Most AI demos work. Most AI products don’t. This series is a collection of interviews with engineers who shipped AI agents to production, covering the stacks they chose, the architectures they regretted, and what actually held up at 3 am.

This is an interview with Michael Maximilien, former CTO and Distinguished Engineer at IBM and Chairperson of the Board of the NodeJS Foundation. Now, the founder and CEO of ClawMax.ai, an AI agent orchestration platform powered by OpenClaw and the creator of weave-cli, an open-source tool for shipping Retrieval-Augmented Generation (RAG) systems.

Watch our full interview on YouTube ↓

Michael Maximilien spent a year building RAG systems for customer after customer. Every new project required navigating dozens of moving parts. He had to pick a vector database, select an embedding model, chunk the data, ingest it, search it, and iterate.

“I was doing this a lot and I wasn’t getting the results I wanted.” — Max

The failures were concrete. Halfway through an ingestion run, Milvus would run out of memory. Two collections made it in. The third was broken. Without a checkpoint or resume function, he had to recompute everything from scratch.

“The experiment doesn’t just run, it fails. You have to be able to pick up from the failure.” — Max

Another failure mode involved manually comparing Weaviate against Milvus. One configuration typo could lead to drawing the wrong conclusion.

“You might end up thinking Weaviate is better than Milvus when actually your comparison was wrong.” — Max

This manual flywheel stole time from actually helping his customers ship their products. He burned days on reset and re-ingest cycles that failed halfway. Worse, he produced results he could not trust.

Most teams treat RAG as a simple setup task. They picked a vector database because it trended online. They pick an embedding model because OpenAI is the safe default.

They guess a chunking strategy, guess the top-K retrieval parameters, and ship it. Then they spend the next six months vibe-checking the system.

Users complain. The team swaps a configuration knob. Nobody knows if it actually helped because nothing was measured.

“There’s a lot of steps.” — Max

You lose the working system you thought you had. You burn weeks debugging silent ingestion failures because no trace exists.

Customer trust evaporates when the same question gets three different answers across releases.

Max took the opposite bet. He built Weave CLI: a unified command-line tool for RAG over eleven vector databases.

It features first-class observability implemented with Opik, an open-source evaluation and optimization tool, baked in from the first commit. You can try out their managed platform for free here for 25k spans/month.

By the end of this case study, you will understand how to unify your RAG stack so that switching a database, an embedding model, or an agent is merely a config change. You will learn how to measure everything, so every switch is tracked, evaluated and compared. Ultimately, you will learn how to benchmark your solution against multiple parameters to find the best configuration for your problem.

“There’s no one solution. You iterate and evaluate.” — Max

But first, let’s understand what Weave CLI is and how it works.

Understanding the System Architecture of Weave CLI

Weave CLI wraps 11 vector databases behind a single interface. From the outside, it looks and feels like any other RAG system. On the ingestion side, it populates the chosen vector database with chunks, metadata, and embeddings. On the query side, it takes natural-language questions and returns top-k ranked chunks that an agent can use to create an answer with citations.

What makes Weave CLI special is that everything is swappable via a configuration file: the vector database, the embedding model, the chunking strategy, the query agent, the RAG agent that interprets the chunks and so on. With the goal of making it very easy for you to benchmark, iterate on and improve your RAG solution.

Image 1: Weave CLI in one breath.

Weave CLI is composed of seven core components, each swappable by configuration.

The user-facing component is the Cobra-based CLI and the interactive REPL. Weave stack sits underneath as the deployment layer. It brings the whole system, the databases, up or down with a local Docker/Podman Compose fallback.

Behind that surface sits the intelligence layer. Ten built-in agents share an AgentChain sequencer. Agents are used both within the CLI and during ingestion. Weave CLI supports RAG, QA and summarization agents, but what’s more interesting is during ingestion. For example, you describe your data, the SchemaAgent proposes a collection schema and a vector-database fit, the ChunkingAgent recommends a chunking strategy and an embedding provider is picked to match. The Executor drives a seven-step orchestration covering query analysis, planning, user confirmation, execution, reporting, display, and evaluation metrics.

Image 2: High-level system architecture.

The data layer is built around the VectorDBClient interface in src/pkg/vectordb/interfaces.go. It is cleanly split into four sub-interfaces: CollectionOperations, DocumentOperations, QueryOperations, and SchemaOperations. A package-level factory registry in factory.go registers all 11 adapter sub-packages using the ports-and-adapters pattern.

Still... There is a trade-off to this design. A unified interface is a lowest-common-denominator by construction, so if you need PGVector’s transactional semantics or Neo4j’s native graph traversal as first-class features, a unified adapter costs you that expressiveness.

On top of the vector-database layer sit five embedding providers: OpenAI, sentence-transformers, Ollama, Cohere, and Voyage. The ingestion pipeline runs alongside these, handling file scanning, processing, and batching.

As of April 2026, Max has strong views on which vector database to choose. Weaviate is his default for cloud deployments. Pinecone is the pick for hosted solutions. OpenSearch covers self-hosted cloud. Milvus handles both local and cloud. Qdrant is his go-to for local use because its Rust implementation is low-memory and fast.

On top of the agent layer, we have the observability layer implemented using Opik and OpenTelemetry, along with an evaluation harness with four LLM judges. The evaluation harness is itself pluggable between a local evaluator and Opik.

The configuration is the source of truth for the whole stack. A config.yaml file holds the non-secret details of the vector database, agent, embedding model, and LLM, while secrets are loaded from a .env file. Check all the configs here.

Let us trace a query through the system end-to-end using a concrete example: a user asks about Leica Noctilux lens auctions. The flow unfolds across nine hops, each an Opik span. First, the user submits the natural-language query to the REPL, which immediately starts monitoring the trace with Opik.

Image 3: The data flow of the RAG query execution.

The QueryAgent then validates and classifies the intent, passing control to the PlanningAgent to generate an execution plan. Next, the VectorDB adapter performs a semantic search to retrieve relevant documents. The ContextBuilder filters, deduplicates, and sorts these results before handing them to the RAGAgent, which generates the final answer with citations. Finally, the REPL ends the Opik trace, in which every step emits a span containing details such as costs, latency and input/output.

Why Building in Go and Not TypeScript or Python

Most popular agentic CLIs / REPLs, such as Claude Code or OpenCode, are built in TypeScript. But Max, as a former Node.js board chairperson, strongly suggests just going with Go or Rust if memory constraints are a concern.

Why? Because Go apps ship as single binaries. Simple. Beautiful. It runs everywhere.

On-premise customers cannot rely on npm, uv, or JVM registries being reachable inside their own networks, and dependency pinning does not fix network isolation. A single compiled binary sidesteps the entire problem. Go’s track record as the language of Kubernetes is the existing proof that this trade-off works for infrastructure tooling, which is exactly the category Weave CLI sits in. Max himself spent 10 years writing Go on those systems.

Max’s second argument is that language choice matters less than it used to, because AI coding assistants lower the learning-curve barrier across the board.

“Most people don’t write code anymore.” — Max

Still... Max’s newer project (ClawMax.ai) is mostly in TypeScript because it is the best tool for the job, not because he switched allegiances.

“The stack decision has to be what your system wants, not what the herd is doing.” — Max

Next, we zoom into the layers doing the heavy lifting, the ingestion pipeline and the unified VDB interface.

Supporting 11 Vector Databases

The vector database layer is where the ingestion pipeline meets the unified VDB interface. To see how they work together, we’ll trace Max’s Leica Noctilux auction catalog through the system one step at a time.

Each document in the catalog is a single lens listing. It contains a photo of the Noctilux, a short caption with the model number and condition, a price, and a few lines of provenance. The text is sparse. Most of the signal sits in the image itself, and the caption is just enough to disambiguate one Noctilux from another. That sparseness drives a multi-modal ingestion decision up front. The image and the surrounding caption are embedded into two separate collections, one keyed on image vectors and one on caption text vectors. At query time, the auction agent fans out to both collections and merges the results through the ContextBuilder.

Before the actual ingestion, we run a FileScanner that walks the 426 listing files on disk, applying glob matching, exclusion filters, and SHA256 deduplication (src/pkg/pipeline/scanner.go). Re-running ingestion on the same directory skips unchanged documents, making this step fully idempotent and computationally cheap.

The DocumentProcessor extracts text and images from each listing (src/pkg/pipeline/processor.go). For the Leica dataset, the PDF extractor pulls the caption text, and OCR runs on the lens photo to catch any model number printed on the barrel. This step is idempotent but computationally expensive due to PDF parsing and OCR, and it fails if the document format is unsupported. Next, the ChunkingAgent dynamically selects the best chunking strategy for each document.

💡 Chunking is a tier 1 knob. Public benchmarking shows that swapping between recursive, sentence-level, and token-level strategies can move retrieval accuracy by double-digit percentages on the same corpus [1].

Next, we move to embedding (src/pkg/embeddings/model_registry.go). In the Leica flow, caption text flows through the text embedder, and image descriptors flow through a separate image embedding model. Raw images larger than per-backend limits (Milvus caps fields at 65KB) get offloaded to S3/MinIO, leaving only a URL in the VDB payload. The default option is to use OpenAI’s embedding model, which is highly expensive in compute and API costs and can fail if you hit rate limits. When scaling, you can use open-source embeddings via Ollama. They run locally with no API key.

The BatchWriter processes documents with durability, such as checkpoint and resume functionality. For example, when ingesting data at scale, you often have network I/O failures or database connection drops. Through checkpointing, we ensure the state is idempotent. Batch checkpointing is the difference between a short retry and a multi-hour rebuild.

“You have to recompute everything from scratch, which is crazy.” — Max

The VectorDBClient Interface sits at the core of the adapter pattern (src/pkg/vectordb/interfaces.go) used to support the 11 databases. The project started with Weaviate. Milvus was surprisingly similar. Qdrant was also very similar. MongoDB was a different beast, but the interface still fit.

“The biggest surprise was PGVector.” — Max

PGVector is the most incompatible on paper. Postgres is a relational database with its own migrations. Yet the unified interface fits.

The pipeline ends at any of the eleven vector databases (src/pkg/vectordb/factory.go), emitting a final vectordb.adapter span. The 426 Leica listings are split into roughly 426 caption vectors in one collection and 426 image vectors in a parallel collection, both sharing listing IDs as the cross-reference key.

Image 4: The data flow of the document ingestion pipeline.

These steps cover every component any production ingestion pipeline needs, and Weave CLI ensures each one is swappable by configuration (src/pkg/stack/ingest.go): the FileScanner, the DocumentProcessor, the ChunkingAgent, the embedding provider, the BatchWriter, the VectorDBClient interface, and the concrete VDB adapter.

During retrieval, when a user asks weave query "summarise the 2024 auction catalogue", the QueryAgent classifies the intent, the PlanningAgent decides to hit both Leica collections, and the VectorDB adapter runs a semantic search on each. The ContextBuilder then merges the image-collection hits with the caption-collection hits, deduplicates by listing ID, sorts by relevance score, and extracts content in priority order (caption text first, image metadata second, URL fallback last) into a single prompt for the RAGAgent.

The ingestion pipeline and VDB interface are the skeleton of Weave CLI. The agent layer is what makes it feel like Claude Code for vector databases.

Zooming into the REPL

Weave CLI provides a Claude-Code-like experience for vector databases, which, at its core, is a Read-Eval-Print Loop (REPL) environment hooked up to multiple agents.

Image 5: The Agent Layer up close.

Weave CLI ships with 12 built-in agents that you configure via YAML. Three of them are user-facing:

Precise QA — asks a question and answers it, and says it cannot answer when it lacks information. Zero hallucination tolerance.
RAG — finds the closest chunks and generates an answer over them. This is the default.
Summarize — produces a short summary of retrieved chunks.

💡 The beauty is that you can add or modify them as you please.

The next eight agents power the Claude-Code-like orchestration loop: the QueryAgent for intent classification, the PlanningAgent for the execution plan, the WeaveAgent for tool execution with retries, the BashAgent for safe execution, the RAGAgent that the RAG persona dispatches to, the OutputAgent to format progress, the ReportAgent to generate operation reports, and the EvalAgent to track metrics.

The final two are domain helpers used during ingestion: the ChunkingAgent and the SchemaAgent.

Similar to the vector database layer, all the agents implement the same interface:

Let’s tie everything together. When you ask a query, the QueryAgent classifies intent and acts as a router. The PlanningAgent generates a plan of CLI commands. The BashAgent executes them and pipes the output through a command-line JSON processor for filtering. The OutputAgent formats the result. This is the Claude-Code-like loop in action.

The cherry on top is that the Weave CLI capabilities are also exposed as a Model Context Protocol (MCP) server. Thus, instead of using the Weave CLI directly, you can leverage its full functionality through your harness of choice (Claude Code, Codex, etc.).

Twelve agents, eleven databases, five embedding providers, and multiple chunking strategies create a lot of surface area. Opik is what makes the whole thing observable when something breaks.

Monitoring the System

With so many moving parts, you need to know the system is working. Opik is how Weave CLI answers that question: it traces every LLM call, every agent step, and every database write as an OpenTelemetry span.

“Using Opik to tell me how many LLM calls, tokens, and cost per query.” — Max

During development, Max tracked a bug in which documents appeared to be ingested but were never persisted to Milvus. The Opik trace waterfall showed the database flush operations were silently timing out.

💡 If you want to try it out, you can create an account for free on Opik’s managed platform here for 25k spans/month.

The fix was adding dedicated timeout contexts per collection. Without the trace, this would have been a multi-day hunt through logs.

Image 6: Opik turns the RAG pipeline into a measurable waterfall.

The integration provides cost and latency visibility per trace. You see tokens and dollars per query without writing custom logging. It provides a latency breakdown.

Image 7: Opik’s monitoring dashboard.

Finally, it provides error visibility to make silent failures loud.

How hard was it to integrate Opik into Weave CLI?

“It’s a very straightforward integration — I pass all queries to the LLMs through Opik via OpenTelemetry, and then I query Opik to aggregate cost from the start of the command to the end.” — Max

Every step in the ingestion and retrieval data flows emits a span (src/pkg/llm/opik.go), which are aggregated under traces containing all the steps between a user request/response.

It includes the query, the LLM reasoning, the tool calls, and the final response. The executor initializes Opik tracing here (src/pkg/executor/executor.go).

Monitoring helps you debug your system. Evaluation moves everything forward, allowing you to quantify your application’s performance.

Evaluating the Default Setup

How do you know your agent is actually better after you swap an embedding model, a vector database or your chunking strategy? You need a good evaluation practice.

“My customers always have five or six questions they ask every release to sanity-check the system. They know what to expect. So I took their QA questions and made them the baseline eval dataset.” — Max

Evaluation datasets come from real user behavior anchored in your business use case, not from standardized, generic benchmarks. If you do not have users yet, you should compile a small set of sanity questions a domain expert would actually ask.

How does this work in Weave CLI?

You start by defining an evaluation dataset in YAML format. This includes the query, expected answer, expected citations, and a minimum relevance score.

Here is the full baseline.yaml file. Or this is how it looks in Opik:

Image 8: Opik’s dataset dashboard.

Then you pick an evaluator harness that includes a set of metrics to evaluate against. This harness is itself pluggable: you pick between a local evaluator and Opik (src/pkg/evaluation/provider.go).

We use two families of evaluators. Rule-based evaluators use regular expressions, exact matches, and citation presence (src/pkg/evaluation/custom_evaluator.go) to compute metrics such as CitationMatching for the RAG agent.

They are fast, deterministic, and free. You use them for structural checks.

The second family uses an LLM as a judge. Weave CLI ships four of these judges (src/pkg/evaluation/provider_opik.go). They evaluate Accuracy, Faithfulness, Hallucination, and Context Relevance.

They are slower and cost tokens. You use them for semantic quality.

“The hallucination, citation, and accuracy metrics are all from Opik’s library — I ported them to Golang.” — Max

💡 One key step most people forget is to align the LLM judge with the human expert. In our use case, the correlation between an LLM judge’s faithfulness score and human judgment hovers around 0.55. Judges are a signal, not a ground truth. For example, on average, I spent three weeks labeling a few-shot examples and computing agreeability scores before I trusted my own judgment.

Then, you run the evaluation command against a chosen agent. Finally, you compare the result of the experiment with the previous run. Each pair of agent and dataset is one experiment (src/cmd/eval/run.go).

Image 9: The evaluation spine.

The --use-opik flag ships every trace and evaluation result to Opik (src/pkg/evaluation/runner.go). Once in Opik, you get dataset management and experiment comparison.

Image 10: Opik’s experiments dashboard.

Scoring every run forces a decision on which agent to ship. Benchmarking on top of your custom datasets provides a structured way to choose a parameter, such as your chunking strategy or top k results, without guessing.

Benchmarking and Optimizing the System

An experiment is a single parameterized run over an agent, dataset, embedding, chunking strategy, database, and judge. A benchmark is a structured set of experiments.

You hold most variables constant to isolate the effect of one. Benchmarking is how you turn random runs into a parameter- and prompt-search problem. This is often known as the optimization flywheel.

“That’s the reason I created Weave CLI. Because this is tedious, but also error-prone.” — Max

Every benchmark is one configuration typo away from drawing the wrong conclusion. Disciplined benchmarking catches that error.

Experiment metadata guarantees reproducibility. Every experiment records the database, embedding model, chunking strategy, dataset, and everything else required to reproduce it. That’s usually the whole config.

Opik tracks this out of the box. Without it, a benchmark from four weeks ago is useless.

Image 11: Opik’s experiment dashboard.

When working on RAG systems, the optimization flywheel involves resetting the database, re-ingesting data with new parameters, re-evaluating, and comparing on your metrics of choice.

“Benchmark is comparing multiple agents side by side. Same dataset, different agents — and each (agent, dataset) combination is its own experiment, you can compare later with its metadata.” — Max

You fix a baseline dataset and hold it constant. You vary one axis, typically the agent. You score against multiple metrics.

Each pair of agent and dataset is one Opik experiment. You compare them side-by-side to spot regressions and unexpected wins.

Image 12: The optimization flywheel.

You trigger this loop via the command line with weave eval run --dataset baseline --agents precise-qa,rag,summarize --use-opik. Every subsequent benchmark streams into the same Opik project.

Max ran this loop for his Leica auction customer. He held the dataset and agent constant.

He varied only the embedding provider. He tested OpenAI against sentence-transformers. The open-source model won on quality by 11 percent.

It was 240 times faster for re-embedding. The vectors were 50 percent smaller, and the cost was zero.

This is a counterintuitive outcome. Without a structured benchmark, Max would have defaulted to OpenAI and been wrong.

How to Keep the Flywheel Under Control?

This optimization process involves running your ingestion and retrieval hundreds of times. Which can get costly fast. Super fast. The ingestion checkpointing makes it affordable.

Still, you should optimize your system in order of cheapest-to-change, biggest-win-first [8]. First, tune retrieval parameters like top-K. They are free to change and often provide the biggest wins.

Second, tune the embedding model. It is the cheapest component to swap and has a huge impact. Third, tune the chunking strategy. It requires re-ingestion but offers moderate quality gains.

Finally, tune the vector database. It has the highest switching cost and usually the smallest difference in quality.

The optimization flywheel effectively isolates variables, but it remains a manual process today.

The good news is that Weave CLI is heading toward full automated hyperparameter optimization across databases, embeddings, and chunking strategies. Just imagine. You will launch it before the weekend, and it will return on Monday with the best configuration for your dataset.

💭 P.S. If you want to use Weave CLI but think it’s missing a feature, Max is more than pleased to add it. Just open a PR/issue on the repository.

You can reproduce this benchmark step by step on your own stack by following this doc.

Watch our full interview on YouTube for all the 3am stories ↓

Final Thoughts

Looking back, what was the hardest thing to implement, and what surprised you the most while building weave-cli? — Paul

The hardest part was designing a unified VectorDBClient that felt natural across 11 providers with wildly different APIs. The adapter pattern was the insight that made it work.

The biggest surprise was benchmarking OSS embeddings against OpenAI on the client’s data and finding them 11% higher quality, 240x faster, and free. A call we’d never have made without evals in place.

If you had to rebuild Weave CLI from scratch, at what point would you introduce monitoring and evaluation? Would you do it earlier, later, or at the same time? — Paul

I’d introduce monitoring from day one. Having Opik traces during the early vector DB work would have immediately surfaced issues such as the silent Milvus persistence failures, which we debugged manually. As for evals, I’d keep at the same stage (after the core RAG pipeline was functional), but I’d design the harness interface up front for citation tracking and confidence scoring.

Opik was easy to integrate and was key to getting the client dashboard working, since I could just run experiments and use evaluations and tracing to decide on the best options for the client.

Now, your next practical step is to experiment with Weave CLI on a real problem. Point it at 100 documents you want to do RAG on, ingest everything into two collections with two different embedding providers, and run the benchmark against the baseline evaluation dataset.

You can follow the step-by-step tutorial from here

But here is what I’m wondering:

While building your latest RAG system, what was your strategy to find the right parameters, such as the embedding model, chunking or retrieval strategies?

Click the button below and tell me. I read every response.

Leave a comment

Enjoyed the article? The most sincere compliment is to restack this for your readers.

Whenever you’re ready, here is how I can help you

35 lessons. Pure foundations from scratch. 4 mini-projects. 2 production systems. A certificate and direct access to me & industry experts in our Discord.

Built for software and data professionals transitioning into AI engineering. Rated 5/5 with 300+ students. The first 7 lessons are free:

Start here

Not ready to commit? Start with our free Agent AI Engineering Guide, a 6-day email course on the mistakes that silently break AI agents in production.

Thanks again to Opik for sponsoring this case study and keeping it free!

Try Opik for free here (25k spans/month free)

If you want to monitor, evaluate and optimize your AI workflows and agents:

Try Opik for free

References

Chroma. (n.d.). Evaluating Chunking Strategies for Retrieval. Chroma. https://research.trychroma.com/evaluating-chunking
OpenTelemetry. (n.d.). Traces & Spans specification. OpenTelemetry. https://opentelemetry.io/docs/concepts/signals/traces/
Husain, H. (n.d.). Creating a LLM-as-a-Judge That Drives Business Results. Hamel Husain. https://hamel.dev/blog/posts/llm-judge/
Husain, H. (n.d.). Escaping POC Purgatory: Evaluation-Driven Development for AI. Hamel Husain. https://hamel.dev/blog/posts/evals/
Liu, J. (2025, May 19). There Are Only 6 RAG Evals. Jason Liu. https://jxnl.co/writing/2025/05/19/there-are-only-6-rag-evals/
Comet. (n.d.). Opik — LLM observability & evaluation platform. GitHub. https://github.com/comet-ml/opik
Yan, E. (2024, August 18). Evaluating the Effectiveness of LLM Evaluators (LLM-as-Judge). Eugene Yan. https://eugeneyan.com/writing/llm-evaluators/
Liu, J. (2024, February 28). Levels of Complexity: RAG Applications. Jason Liu. https://jxnl.co/writing/2024/02/28/levels-of-complexity-rag-applications/

Images

If not otherwise stated, all images are created by the author.

Stop Orchestrating AI Agents. Use Ralph Loops Instead.

Paul Iusztin — Thu, 23 Apr 2026 11:02:54 GMT

When building Brown, my writing assistant, I designed five specialized LLM nodes. One handled the introduction, another wrote sections, and others managed the conclusion, title, and editing. It became complicated, slow, and expensive.

I eventually collapsed the system into two agents: a writer and a reviewer operating in a loop. The simpler version performed better. The model retained the full context, and verification became a simple review step rather than a massive orchestration problem.

Most AI teams hit this exact wall. Developers spend more time babysitting AI than engineering, copying error logs and re-prompting models. The real bottleneck is the human.

Three failure modes explain why.

First, context rot. In long AI conversations, the context window becomes a junk drawer. Every failed attempt piles up until the sliding window drops the original specification. The model slides into a “dumb zone” where it hallucinates and forgets its goals. Traditional fixes like summarizing break down over dozens of reasoning rounds.

Second, premature exit. AI agents declare victory too early. Anthropic’s research notes that agents usually look around, see that progress has been made, and declare the job done [1]. Standard ReAct loops inherit the flaw.

Third, single-pass fragility. One prompt, one context, one shot. When it fails, the failure is chaotic. Jumping to multi-agent orchestration introduces distributed systems nightmares.

Ralph loops break the cycle by making “try again with fresh eyes” the default. Named after Ralph Wiggum from The Simpsons, the pattern wipes the conversation, reloads the full specification fresh each iteration, and uses the filesystem and git as the memory layer.

Image 1: The top shows context accumulating until the model forgets. Bottom shows the state living on disk, where each turn starts clean.

They remove the AI’s ability to grade its own work, using objective signals such as passing tests or linters to call the job done. Boris Cherny, creator of Claude Code, states that giving Claude a way to verify its work increases quality two to three times [2].

One model. One loop. One verification signal. Failure becomes predictable, the loop catches errors and re-prompts automatically, creating a relatively deterministic feedback loop that will 10x the quality of the agent.

Image 2: The Ralph loop. One model, one task per iteration, filesystem and git as memory, objective verification as the only exit.

Now, let’s look at what Ralph loops are and when you can actually use them in practice.

Go Deeper Into Production AI Engineering (Product)

Ralph loops prove that most of the leverage lies in the harness, not the model. If you want to master how to design, verify, and ship those AI harnesses in production, check out my Agentic AI Engineering course, built with Towards AI.

34 lessons. Three end-to-end portfolio projects. A certificate. And a Discord community with direct access to industry experts and me.

Rated 5/5 by 300+ students. The first 6 lessons are free:

Start here

What Ralph Loops Are and How They Work

Geoffrey Huntley named the pattern after Ralph Wiggum from The Simpsons, noting the character tries the same thing over and over until it works. Huntley’s motto captures the philosophy: the technique is deterministically bad in an undeterministic world. The simplest implementation is a bash while-true loop that pipes a prompt file into the agent forever, acting as a continuous harness pattern [3], [4].

As Einstein reportedly said: “Insanity is doing the same thing over and over again and expecting different results.” Well... I am sure he didn’t predict the rise of Claude Code, because that’s exactly what Ralph loops are all about.

Models are stochastic, strong at reading large contexts but imperfect on first pass. Re-running the same instruction forces self-review. The first iteration produces good but flawed output.

During the second pass, the model spots what it missed and refactors. The third iteration handles cleanup. Huntley delivered a minimum viable product quoted at 50,000 for just 297 in tokens using a single Ralph loop: a 170x cost reduction over the human estimate [3].

Image 3: Single-pass fails chaotically. Multi-agent deadlocks on shared state. Ralph loops isolate one task per iteration and uses verification as the exit gate.

You can run these loops in two modes. Shared context keeps the session alive for explicit self-review. Fresh context starts a new session each iteration, removing confirmation bias.

The model sees only the repository and the skill file.

Nowadays, it’s common to replace brittle n8n workflows with a single Claude Code skill. This becomes even more powerful when running in a Ralph loop, especially because at the end of each run, you can take the signal and tell the model to update the skill with anything it should have done differently.

The skill evolves and quality improves automatically [5].

This self-improving mechanism reduces manual prompt tuning when applied to specific, repetitive engineering tasks.

Three Real-World Use Cases

In practice, Ralph loops don’t have a clear implementation pattern. They are more of an intuitive strategy you can get creative with. Thus, you have multiple ways of implementing them.

From my experience with Claude, you have three options for running Ralph loops, from highest abstraction to lowest:

/ralph-loop plugin — the fastest path. Install it, run /ralph-loop in your session, and it manages the cycle for you.
/loop command — Claude Code’s built-in scheduler. /loop every 1 minute /your-skill fires the skill on a schedule [6].
while true bash loop — the most primitive form. A one-liner that pipes a prompt file into the agent and restarts it forever.

Because Claude Code keeps state through the files it’s working on, it retains context from the failed attempt and reads its own git diffs. Each iteration learns from the last.

Image 4: The Stop Hook turns objective signals into the loop’s only exit condition.

Implementing a ticket backlog with test-driven development

You can set up a ticket folder with numbered text files. Run a while-true loop that tells Claude to implement the next most important ticket using Test-Driven Development (TDD). The model writes tests first, writes the code, commits the changes, and moves on.

Claude reads all tickets, skips completed ones, picks the next priority, implements it, marks it done, and commits. No dependency graph is needed because the model decides the ordering on the fly. One dumb loop acts like a relentless single-threaded engineer working through the backlog.

For example, set up a doc/tickets folder with numbered tickets (001, 002, 003...). Each describes a feature or fix. Then run:

while true; do
  claude "implement the next most important ticket using TDD principles from doc/tickets. commit when done"
done

Or use Claude Code’s built-in loop:

/loop every 1 minute
build the next ticket from doc/tickets using TDD, run tests, commit when done

Adding test coverage

You can set a concrete goal to raise coverage from 16 percent to 95 percent. The loop reads coverage metrics, writes tests for uncovered functions, runs the suite, identifies gaps, and iterates.

The coverage report provides the objective backpressure. The loop does not stop until the numbers validate success. Each iteration chips away at untested code paths until the threshold is met.

The implementation is as easy as:

while true; do
  claude "analyze coverage gaps, write tests for uncovered functions, run the test suite, fix failures. stop when coverage exceeds 95%"
done

Framework and dependency migrations

Migrations require crisp completion criteria. Upgrading React v16 to v19, Next.js 14 to 15, or migrating Jest to Vitest demands a clean build and passing tests. The agent swaps syntax, updates dependencies, and runs build commands.

It uses compiler errors and failing tests as feedback. Each cycle fixes a batch of errors until the toolchain confirms the code is clean. Deterministic verification signals make framework migrations the perfect Ralph loop candidate.

These are three concrete starting points. Before you wire the first one up, there is one honest limit you should know.

What’s Next

Ralph loops are the starting point. Once comfortable, add self-improving skills that update their instructions after each run, wire stop hooks for objective quality gates to avoid infinite loops, and connect to external systems like Linear or GitHub Issues so the loop reacts to new work automatically.

The pattern scales further than it looks. OpenAI’s Codex team shipped one million lines of code across 1,500 pull requests with zero human-written code using what they call a “Ralph Wiggum Loop” [7].

These loops are safe when repo-contained and the toolchain acts as the judge. They get dangerous with irreversible side effects outside the repo. Alexey Grigorev learned this when a Claude Code agent ran terraform destroy on DataTalks.Club’s production infrastructure, wiping the database, VPC, and all automated snapshots — two and a half years of data gone in one iteration. If your loop can destroy shared state, review every plan manually [8].

What is the first piece of work in your repo you would trust a Ralph loop with? You could choose a TDD backlog, a coverage ramp, a framework migration or what else?

Click the button below and tell me. I read every response.

Leave a comment

Enjoyed the article? The most sincere compliment is to restack this for your readers.

Whenever you’re ready, here is how I can help you

35 lessons. Pure foundations from scratch. 4 mini-projects. 2 production systems. A certificate and direct access to me & industry experts in our Discord.

Built for software and data professionals transitioning into AI engineering. Rated 5/5 with 300+ students. The first 7 lessons are free:

Start here

Not ready to commit? Start with our free Agent AI Engineering Guide, a 6-day email course on the mistakes that silently break AI agents in production.

References

Anthropic. (n.d.). Effective Harnesses for Long-Running Agents. Anthropic. https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents
Cherny, B. (n.d.). I’m Boris and I Created Claude Code. X. https://x.com/bcherny/status/2007179832300581177

Huntley, G. (n.d.). Ralph Wiggum as a “software engineer”. Geoffrey Huntley. https://ghuntley.com/ralph/
LangChain. (n.d.). The Anatomy of an Agent Harness. LangChain Blog. https://blog.langchain.com/the-anatomy-of-an-agent-harness/
Parsons, C. (n.d.). Ralph Loops: Build Dumb AI Loops That Ship. AI Engineer. https://read.readwise.io/read/01kp5bgy8b07y256ythvkz2tt7
Anthropic. (n.d.). Claude Code: Best Practices for Agentic Coding. Anthropic. https://www.anthropic.com/engineering/claude-code-best-practices
Lopopolo, R. (n.d.). Harness engineering: leveraging Codex in an agent-first world. OpenAI. https://openai.com/index/harness-engineering-codex/
Grigorev, A. (n.d.). How I Dropped Our Production Database and Now Pay 10% More for AWS. Alexey Grigorev. https://alexeygrigorev.com/posts/dropped-production-database/

Images

If not otherwise stated, all images are created by the author.

Karpathy Named It. I Built One on My Notes.

Paul Iusztin — Tue, 21 Apr 2026 08:00:51 GMT

I’ve been building what Andrej Karpathy calls an LLM Knowledge Base on top of my private data for the past few months — without realizing that was the name for it. Now, seeing it’s such a hot topic, I want to share my own twist on it. Similar to Andrej’s design, but still very different in how I approach the problem.

I keep my notes in Obsidian, my reading in Readwise, and my topical research in NotebookLM. Each tool is excellent in isolation, but no AI can reach across all three.

Whenever I reach for a general-purpose deep-research tool like Perplexity or Gemini Deep Research, it just searches the public web. Every user gets the exact same sources, and the resulting article reads like everyone else’s. What I actually want to research is my own curated thinking.

I want to leverage the books I highlighted, the notes I wrote, and the transcripts I dumped into NotebookLM. That is the edge. That is the signal nobody else has.

To solve this, I built a deep research agent as three Claude Code skills. The /research_create, /research_search, and /research_distill skills run on top of my private data via the obsidian, readwise, and nlm command-line interfaces (CLIs).

The system uses multi-round query expansion with gap analysis between rounds. It outputs a memory/ folder with an index.yaml file that acts as a progressive-disclosure wiki over the source files. We also apply post-processing, including deduplication and re-ranking, to keep the result focused.

There is no vector database and no Retrieval-Augmented Generation (RAG) pipeline. We use the filesystem as state and Markdown, YAML, and JSON as the wire format. If you already keep notes in Obsidian, articles in Readwise, or research in NotebookLM, this is for you.

By the end of this article, you will know exactly how it works, see it run on this very article, and have a blueprint to build your own.

Image 1: From three scattered tools to a queryable research memory to a grounded article. This is the end-to-end loop in one frame.

Here is the system at a glance. We will look at the three skills, three CLI adapters, and one memory folder, before we open the heaviest skill in the next section.

Your Path to Agentic AI Engineering for Production (Product)

The three-skill + memory-folder pattern in this article is one slice of harness engineering. If you want to master the rest, such as orchestration, context engineering, evals, and production deployment, check out my Agentic AI Engineering course, built with Towards AI.

34 lessons. Three end-to-end portfolio projects. A certificate. And a Discord community with direct access to industry experts and me.

Rated 5/5 by 300+ students. The first 6 lessons are free:

Start here

Three Skills, Three CLIs, One Memory Folder

The system relies on three distinct skills. First, /research_create builds a memory/ folder from scratch for a given topic or brain dump. Second, /research_search handles the read side, letting any future agent query an existing memory/ folder via index.yaml with progressive disclosure.

Third, /research_distill takes a finished piece of content and extracts only the sources that were actually used into a single portable research.md appendix.

Image 2: The system at a glance. Claude Code orchestrates three skills that wire CLI adapters into a single memory folder.

The memory/ folder is built around index.yaml. It holds metadata per source, including uri_highlights, uri_full, original_path, and origin. The LLM reads the index first, then picks three to five relevant files based on summaries and reads those directly.

Image 3: The memory/ folder on disk — index.yaml alongside each source’s key-highlights and full-document files.

There are no embeddings, no chunking, and no vector store to maintain, ensuring references stay perfectly traceable. Like OpenClaw, we treat memory as plain Markdown in the agent workspace, where files are the source of truth and the model only remembers what gets written to disk [1].

The Obsidian, Readwise, and NotebookLM files act as the raw, immutable data. We touch them manually as humans, never through this pipeline. On top of that, /research_create produces a local actionable knowledge base for a specific scope, resulting in an ephemeral memory/ folder per topic.

This separation allows the same raw data to feed many different research projects without contamination. The key invariant of this architecture is that the orchestrator never loads source files. Researcher subagents touch the raw files, while the orchestrator only ever sees structured JSON summaries flowing between steps.

We chose CLIs over Model Context Protocol (MCP) servers for three reasons. First, token economics. A skill enters Claude Code’s context at boot at ~100 tokens of metadata, and the body loads only when invoked.

By comparison, Notion’s MCP server dumps roughly 20,000 tokens of self-documenting tools at startup whether you use them or not. That is roughly 200× less context before you have done anything [2].

Second, CLIs compose with bash. The orchestrator can pipe results through tools like jq or redirect output straight to a file, whereas MCP tool calls must round-trip through the LLM.

Third, Markdown is the native language of LLMs. As Simon Willison argues, Markdown with YAML frontmatter is more in the spirit of LLMs than MCP, because you put text in the context and let the LLM pick [3].

Image 4: A skill enters context at ~100 tokens of metadata. An MCP server dumps ~20,000 tokens of self-documenting tools, whether you use them or not.

That is the whole architecture. Now let’s open up the heaviest of the three skills, /research_create, and watch the multi-round research loop in detail, where the orchestrator-never-loads invariant earns its keep.

How `/research_create` Works

The process starts with a brain dump from the user, which can include text, URLs, or local file paths. During the deep research, you confirm three configuration knobs in one prompt: the number of rounds, queries per round, and a topic slug. Seed URIs from the brain dump always land in the output with a relevance score of 1.0, bypassing reranking because they are your explicit picks.

The orchestrator generates queries and dispatches one researcher subagent per query in parallel. Each researcher runs platform-specific searches. For Readwise, this means querying the library, feed, highlights, and document notes. For Obsidian, it means querying the local vault files. For NotebookLM, it means querying the projects and their associated sources and notes.

For Obsidian, we found that using its CLI — which leverages its index — is 10× more efficient than letting the LLM roam around your vault.

The subagent does its own within-agent deduplication by original path. It captures metadata while files are open. It also caps output at a top-15 limit of unique findings.

Between rounds, a gap_analyzer subagent reads the deduplicated findings via jq without full reads. It flags thin or missing themes against the initial key themes and emits the next round’s queries. After all rounds, a reranker subagent scores every candidate between 0.0 and 1.0 using the cheapest sufficient signal. It checks metadata first, then reads the head and tail of the doc, and uses full reads only as a last resort.

Finally, a builder subagent invokes a Python script to emit the YAML deterministically, placing seeds first, then descending by score.

Image 5: The full /research_create pipeline. The orchestrator schedules. Subagents do the heavy reads.

We use this shape because context isolation is our central design choice. Every step that touches real source content runs in an isolated subagent with its own context window. The orchestrator only sees the compacted metadata of each file, while moving the actual file using mv bash commands into the memory folder.

The index.yaml file holds pointers and metadata for every file in the wiki. The orchestrator holds pointers, while subagents hold content. Geoffrey Huntley, creator of Ralph Loops, states that your primary context window should operate as a scheduler, scheduling other subagents to perform expensive allocation-type work [4].

Image 6: The top of an index.yaml file — topic, input summary, and the first few source entries with their summaries and relevance scores.

Subagents compress tens of thousands of input tokens into 1,000–2,000 output tokens before handing back to the orchestrator. That compression ratio is the whole point. The researcher subagents read deeply, and the orchestrator stays light.

Once the memory/ folder exists, anyone can read it without loading source files. We use the /research_search skill to query this index.

How `/research_search` Works

The /research_search skill handles the read side of the system. Any agent can be handed a memory/ folder and query it without loading source files into context. The skill encodes the protocol once so future agents do not have to re-derive it.

The system uses three layers of detail. Layer 1 is the summary field in index.yaml. It contains two to three sentences per source and is always loaded as part of the index. It is enough to answer what you have on a topic or build a table of contents.

Layer 2 is the key-highlights file, which holds the condensed topics of a file. This is extremely powerful when using reader tools such as Readwise, as these highlights are made manually by you, the reader, consisting of huge signal. Thus, not every source has this layer. It’s better not to have it at all than to have an LLM extract it.

Layer 3 is the uri_full file, representing the complete original document. You read it only when key highlights are insufficient or inexistent.

Image 7: Three layers of detail per source. The agent stays at Layer 1 unless it has a reason to descend.

Anthropic notes that models are great at navigating filesystems, and presenting tools as code on a filesystem allows models to read tool definitions on-demand, rather than reading them all up-front [5]. That maps exactly onto index.yaml plus lazy key-highlights loading.

Intuitively, the index.yaml file gives us progressive disclosure — the same pattern used inside skills — so the agent can choose from many options without drowning in information [6].

The agent slices index.yaml by origin, location, relevance-score threshold, tags, author, publication, date range, or NotebookLM notebook.

The most beautiful part? Because index.yaml is structured data, the agent writes code on top of it. It uses jq filters, Python sorts, and awk projections.

Image 8: A single source entry in index.yaml — origin, authors, relevance score, and the URIs that power progressive disclosure.

LlamaIndex’s head-to-head benchmark proves this scales. A filesystem-explorer agent beat a hybrid vector RAG pipeline on correctness (8.4 vs 6.4) and relevance (9.6 vs 8.0) at a sub-60 document scale, precisely because the LLM saw whole files instead of chunks [7].

Portability comes for free. Hand the self-contained memory/ folder to any agent, and they get up to speed instantly. Search lets agents find what is there. But once you have drafted an article, you also need to know what part of the wiki you actually used within your piece. That is /research_distill.

How `/research_distill` Works

Given any piece of content and the memory/ folder used during writing, the skill walks every source in index.yaml. It decides whether the content actually used it by checking for explicit references or traceable ideas. The process is conservative by default. It is better to miss a borderline source than include one that was not actually used.

The output is a single research.md file. It is fully self-contained, meaning you never need to go back to the memory/ folder again. For this very article, /research_distill should match around 15 to 20 of the 62 sources in the memory folder.

This matters because downstream generation loops re-load the research on each iteration. For example, within the evaluator-optimizer pattern, the system generates, critiques, and revises [8]. Keeping the anchor research small is the difference between an article that stays grounded and one that starts hallucinating.

As I explained in my article on Recursive Language Models (RLMs), when the corpus fits in context with progressive disclosure, fancy retrieval is overkill [9].

What’s Next

For personal-scale research involving hundreds of sources, a well-structured memory/ folder with an index.yaml beats a RAG pipeline on every axis. It gives you full lineage back to source URLs, portability to pass the folder to any agent, and lower costs with no embedding model or vector store.

To further optimize the system, making it more context-efficient, I am considering moving the deduplication and re-ranking fully into Python scripts, adding a local cross-encoder reranker to avoid LLM calls for scoring, and extending the researcher with tag-aware filtering.

But here is what I’m wondering:

What data source in your work makes you most want a private deep research agent? Is it your Obsidian vault, your Readwise library, a code repository, or your team’s shared documents?

Click the button below and tell me. I read every response.

Leave a comment

Enjoyed the article? The most sincere compliment is to restack this for your readers.

Whenever you’re ready, here is how I can help you

35 lessons. Pure foundations from scratch. 4 mini-projects. 2 production systems. A certificate and direct access to me & industry experts in our Discord.

Built for software and data professionals transitioning into AI engineering. Rated 5/5 with 300+ students. The first 7 lessons are free:

Start here

Not ready to commit? Start with our free Agent AI Engineering Guide, a 6-day email course on the mistakes that silently break AI agents in production.

References

Govindarajan, V. (2026). OpenClaw Architecture Part 3 - Memory and State Ownership. The Agent Stack. https://theagentstack.substack.com/p/openclaw-architecture-part-3-memory
Talebi, S. (2026). Claude Skills Explained in 23 Minutes. YouTube. https://youtube.com/watch?v=vEvytl7wrGM
Bowne-Anderson, H. (n.d.). Episode 70: 1,400 Production AI Deployments. Vanishing Gradients Podcast. https://read.readwise.io/read/01kh8p44e70a1273g7ykgx7h5y
Huntley, G. (n.d.). Ralph Wiggum as a “software engineer”. ghuntley.com. https://ghuntley.com/ralph/
Anthropic. (n.d.). Building More Efficient AI Agents. Anthropic Blog. https://www.anthropic.com/engineering/building-more-efficient-ai-agents
Griciūnas, A. (n.d.). Agent Skills: Progressive Disclosure as a System Design Pattern. SwirlAI Newsletter. https://newsletter.swirlai.com/p/agent-skills-progressive-disclosure
LlamaIndex. (n.d.). Did Filesystem Tools Kill Vector Search?. LlamaIndex Blog. https://www.llamaindex.ai/blog/did-filesystem-tools-kill-vector-search
Anthropic. (2025). Building Effective AI Agents. Anthropic Blog. https://www.anthropic.com/engineering/building-effective-agents
Iusztin, P. (n.d.). Your RAG Pipeline Is Overkill (RLMs). Decoding AI Magazine. https://www.decodingai.com/p/your-rag-pipeline-is-overkill-rlms

Images

If not otherwise stated, all images are created by the author.

How to Ship a Weekly Article in One Day

Paul Iusztin — Wed, 15 Apr 2026 14:10:24 GMT

I publish one in-depth technical article every week on Decoding AI. That cadence sounds simple until you live it. The article itself eats up the time I should be digging into the Claude Code leak to understand how it works under the hood.

I am a builder first, not a writer. Most weeks, the writing strangles the building. This is the exact trap most weekly writers fall into.

When the writing eats the week, the next article has nothing real underneath it. So writers fill the gap with generics, surface-level takes, or invented examples that add noise to an already noisy internet.

The default fix most people reach for is to let AI write it. That fails for the opposite reason. When you put zero thought into the process, AI just industrializes the noise.

The whole point of writing is to share something you actually thought through, built, and learned. If AI writes for you, you publish nothing of value. If you write everything by hand, you don’t have enough time to build something worth publishing.

Both ends starve the loop that actually feeds the business: research, build, and teach.

What I built instead is an agentic AI workflow that automates ~90% of the manual writing pipeline while keeping me as the irreplaceable seed at the top. I provide the research direction and the brain dump that reflects my personal experience.

AI handles distribution speed. I handle thought, taste, and direction. By the end of this article, you will see exactly how the system works.

Image 1: The full pipeline at a glance. Human seed on the left, automated components in the middle, published article on the right.

We will cover my deep research agent, writing workflow with its evaluator-optimizer loop, image style-transfer step, and title & SEO generator. And for the most important part, you will learn where the human-in-the-loop is irreplaceable.

My Workflow: What Stays Human, What Gets Automated

Before showing any architecture, I want to walk you through the manual workflow exactly as I used to run it. This is the boring, honest version. This is what every weekly technical writer secretly does, even if they pretend otherwise.

I used to research the topic for hours or days while taking notes. Then, I would write a high-level outline of the piece. Next, I sketched the first high-level diagram that helped me better visualize the narrative of the piece.

I expanded each outline section into bullet points, creating what I call the article guideline. After that, I wrote the article, edited it, and created the rest of the visuals. Finally, I wrote the title and SEO and copy-pasted everything into Substack.

Image 2: The nine-step workflow. Research and outline stay human; the rest gets automated with validation gates on the load-bearing steps.

Now, everything gets automated except two things. I still do an in-depth round of research to understand the topic and collect a few high-quality golden source seeds. This is the fun part. These are mostly pulled from my Readwise reading list, which acts as a curated library I built over time while browsing Substack, YouTube, LinkedIn, X, and more. Then, I use this as a high-quality seed for my deep research agent to expand it and fill any potential gaps.

Second, while researching I do a brain dump of everything I consider relevant on the topic. After wrapping up the research, I refactor the brain dump into an outline that follows an engaging narrative. Then, I do a combination of manual and Claude Code work to expand it with bullet points, creating the article guideline.

Together, those two steps are the seed that makes everything downstream mine. Without them, the pipeline produces generic AI mush.

Before the automated pipeline existed, a 3,000-word article like one of my latest pieces, Agentic Harness Engineering, would eat two to three days of my week running this exact nine-step grind by hand. Now, the same piece takes about a day.

Why this works

Writing prose is a translation step. It turns thoughts into words on a page or boxes in a diagram. Translation is exactly the kind of work LLMs excel at, if you already did the thinking.

If you haven’t, no amount of agent orchestration saves you. AI as a writing tool fails when you put zero thought into your process. It becomes a force multiplier when you use it to distribute your thoughts.

Now, let’s look at how the actual system works.

Understanding The System Architecture

The architecture has five big components plus a memory layer. The contract between them is the artifact each one writes to disk, such as the research markdown, the article guideline, the final article, branded image PNGs, and the final HTML bundle.

Here are the five components at a glance:

Deep Research agent (we call it Nova): Takes a topic and golden sources, returning a ranked, structured research file.
Writing Workflow (Brown): Takes the article guideline and research, returning the full styled article via an evaluator-optimizer loop.
Media style transfer: Because the article contains raw Mermaid diagrams, we apply the Decoding AI brand style.
Title and SEO generator: Runs an expand-and-narrow loop to produce the title, subtitle, SEO title, and SEO description.
HTML exporter: Converts the final markdown into platform-ready HTML for Substack, Medium, X, or LinkedIn to easily copy-paste the piece.

The handoff contract between components is the filesystem. Each stage reads and writes plain files in a working directory. Internal per-component state lives in databases: PostgreSQL for Nova, and an SQLite checkpointer for Brown.

The artifacts make the pipeline debuggable, resumable, and human-in-the-loop friendly across stages. The databases make each stage individually resumable mid-run. For example, if the writing workflow fails after generating the first draft, we can easily resume without having to spend tokens on rerunning from scratch.

Also, because everything is managed through files, I can open any artifact at any key step, inspect it, edit it, and re-run downstream.

We will show you how we used this system to write one of our latest popular pieces: Agentic Harness Engineering.

I’ve also used the same process to research and write professional lessons for other educational projects, such as our latest Agentic AI Engineering course, as the pipeline adapts to any type of educational business.

Image 3: End-to-end system architecture: Human-seeded research on the left, evaluator-optimizer writing in the middle, branded media and SEO on the right, finished HTML at the terminus.

Each component is explained in depth in the sections below. Two are MCP servers (Nova and Brown) and three are skills (media style transfer, title & SEO, HTML export).

In terms of concrete economics, the whole process runs at roughly ~$0.30 to $1 per image, mostly in Gemini credits, with the rest of the pipeline costing cents. This article, with 9 images, landed closer to $6, while a leaner piece with a single diagram sits around $1.

Build This Exact Stack Yourself (Product)

Reading about a pipeline is one thing. Building one is another. In my Agentic AI Engineering Course, built with Towards AI, I walk you through this exact stack from scratch.

Nova’s deep research loop, Brown’s evaluator-optimizer built on LangGraph, both served via FastMCP, plus the style-transfer skill, evaluation with Opik, and deployment on Docker, GCP, and GitHub Actions.

34 lessons. Three end-to-end portfolio projects. A certificate. And a Discord community with direct access to industry experts and me.

Rated 5/5 by 300+ students. The first 6 lessons are free:

Start here

Walkthrough: The Artifacts of One Article

Before diving into each component, let’s take a look at the input and output artifacts the pipeline produced while generating the Agentic Harness Engineering article. Here are some trimmed versions of each, as they get pretty large.

outline.md: the hand-written seed, Nova’s input (88 lines)

## Outline
1. Introduction - Why Do We Need a Harness?
	1. Personal story: To be researched
	2. Problem + Agitation: ...
	3. Transformation + Solution: ...
	4. Intuitively, Mitchell Hashimoto has the best definition of a harness: "the idea that anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again."
	5. 200 words
2. What the Hell Is A Harness?
    ...
3. How does a Harness Look?
	1. Key components: LLM, tools, planning loop, context engineering, sandbox, memory, orchestration layer, serving layer, interfaces
	2. The agent loop: Powered by planning techniques like ReAct
    ...
4. Planning & Orchestration
    ...
	4. 200 words
5. Key Tools
    ...
6. Sandbox Environment
    ...
7. Memory
    ...
8. Conclusion - The Future of Harness
    ...

# Resources

1. [My AI Adoption Journey](https://mitchellh.com/writing/my-ai-adoption-journey)
2. ...

The seed is deliberately rough. It contains placeholders like “To be researched”, section blocks that will later be restructured, and hand-picked golden sources that anchor Nova’s first round of research. The idea is to dump ideas without thinking too much about structure while you are in your creative mindset.

research.md: Nova’s output (1377 lines)

Image 4: A trimmed view of Nova’s collapsible-HTML research.md output.

article_guideline.md: Expanded outline, Brown’s input (201 lines, 8 sections).

## What We Are Planning to Share

...

## Why We Think It's Valuable

...

## Point of View

I write the article, Paul Iusztin. I am part of a bigger team known as Decoding AI....

----

## Article Outline

1. Why Do We Need a Harness?
2. What the Hell Is a Harness?
3. The Anatomy of a Harness
4. How the Agent Decides What to Do Next
5. The Tools That Let Agents Act
6. Where Agents Run
7. Memory Is Just the Filesystem
8. What's Next

## Section 1 - Why Do We Need a Harness?

...

## Section 2 - What the Hell Is a Harness?

...

- **Hook:** Start with the horse analogy. A horse is powerful on its own, but useless for farming without a harness — the straps, reins, and attachments that let you direct its strength toward useful work (inspiration from Jonathan Gimick from Manning). Same with LLMs: the model has the intelligence, but without tools, memory, state, guardrails, and orchestration, you can't put it to work reliably.
- **The clean definition:** LangChain's formulation is the clearest — **Agent = Model + Harness**. The harness is "every piece of code, configuration, and execution logic that isn't the model itself." The model provides intelligence. The harness makes that intelligence useful.
...

[GENERATE_DIAGRAM] Three levels of engineering: prompt, context, and harness engineering.
...

- **Transition:** Now that you know what a harness is, let's look at all its components and how they fit together at a high level — before diving deeper into each one.

- **Section length:** 300 words

## Section 3 - The Anatomy of a Harness

...

## Section 8 - What's Next

...

The article guideline is deliberately as structured and detailed as possible. The idea is to have enough or even more detail about each section to fill in the requested word budget to ensure the LLM doesn’t fill in any gaps with generalities or, worse, with hallucinations.

article.md: Brown’s final prose (~3,000 published words, 8 sections)

See the Agentic Harness Engineering article we posted a few weeks ago on Substack.

Now let’s see how Nova, the deep research agent, turns the outline and its golden sources into a structured research file.

Deep Research: How Nova Builds the Knowledge Base

Nova is an MCP server exposing ten specialized tools, orchestrated by the client, which is often a harness such as Claude Code or Cursor.

Here is how the overall deep research architecture works:

Query generation loop: Nova takes the topic and golden sources, runs gap analysis between the outline and the provided sources with Gemini Pro, and generates the next round of research queries based on what is missing. Three rounds hits the cost-versus-coverage sweet spot.
Concurrent retrieval: Each round fans out concurrent Perplexity calls that return only metadata and a summary of each new source.
Two-stage filtering: We full-scrape only the top five by a four-dimensional rubric evaluating trustworthiness, authority, relevance, and quality. For the rest of the sources we keep only the summary, which is enough for providing examples such as Anthropic is implementing compaction in Claude Code.

Image 5: Nova’s deep research loop. Three rounds of gap-driven Perplexity queries, a two-stage filter, and source-specific ingestion produce the structured research file.

Nova ships one purpose-built tool per source family. We scrape web URLs using Firecrawl, while we ingest GitHub repos through gitingest. We ingest YouTube videos using Gemini Pro directly on the URL without a local download.

For example, this is how I used Nova when writing my harness article. I started with a vague topic about what an agent harness is and why it matters. I handed Nova a seed set of golden-source URLs inside the guideline, including the LangChain harness post, the Anthropic long-running agents piece, and Mitchell Hashimoto’s AI adoption journey. Nova extracted these, scraped each one, and wrote the cleaned content into its working memory.

Nova then ran the three-round gap-analysis loop, fanning out concurrent Perplexity queries aimed at topics the seed sources had not covered. Every raw result was appended to the log. Ultimately, each source is filtered using a set of heuristics and LLMs to ensure we keep only high-quality results.

Image 6: Full Nova system architecture — MCP tools, Postgres state, two-stage filter, and source-specific ingestion.

Finally, Nova compiles everything into the collapsible HTML research file.

The client knows how to leverage all of Nova’s MCP tools through a skill that glues together the ingestion, search, and all the other utility tools into the unified deep research algorithm that takes as input the outline.md file and outputs research.md.

Writing Workflow: How Brown Turns an Idea Into an Article

Brown picks up where Nova left off. Brown is a workflow, not an agent, implemented with LangGraph. We chose a workflow over an agent deliberately: prose generation rewards predictability over exploration.

First, we generate all the required Mermaid diagrams for the article using the orchestrator-worker pattern that looks around the article and spins up a specialized Mermaid-diagram agent based on all the user requests found within the article. These are usually flagged within the article guideline explicitly by stating “generate diagram”, “create a diagram”, or [GENERATE_DIAGRAM]. Next, these diagrams are passed downwards through the generation process. We’ll come back to how they get styled in the Branded Images section. The orchestrator-worker pattern can easily be extended to generate other types of media such as images, videos, or audio.

Next, we control Brown’s voice via the system prompt through three large tricks.

The first one is based on defining a set of six profile classes, each targeting a different family of rules. There are four generic profiles, which are static and agnostic to who is using the tool and what they are doing:

Structure Profile: How the prose is physically laid out on the page such as sentence, paragraph, list, and subheading shape.
Mechanics Profile: The grammatical scaffolding the writing must respect such as active voice, point of view, and punctuation rules.
Terminology Profile: What vocabulary is allowed and what filler is banned such as word choice, sentence phrasing, and descriptive language.
Tonality Profile: How the article should feel to the reader such as formality level, voice characteristics, and emotional register.

And two customizable:

Character Profile: Who is writing. For example, I added here my biography. This should be adapted per user.
Article Profile: Special article characteristics such as the structure, referencing, and citations. This can be swapped to a LinkedIn, Reddit, or X profile to adapt the system to different formats.

The second trick is to force the LLM to respect the article guideline and research over anything else, to ensure the user gets what they expect and that Brown adheres only to the research to avoid hallucinations.

The third trick is to add a set of few-shot examples, which beats anything else because showing works better than telling. For the best quality this should be changed when switching article formats and especially when switching content formats.

After we compile our system prompt, we call Gemini at a 0.7 temperature to produce a first draft with more randomness.

Image 7: Brown’s writing loop. Six profiles compose the system prompt; a Generator-Reviewer-Editor loop iterates until the draft passes review.

After the first generation pass, we start an evaluator-optimizer loop running a Reviewer node with 0.0 temperature against the guideline, research, and profiles to ensure the draft respects all the expected requirements. The Reviewer node returns a list of structured review objects via Pydantic. If issues are found, we run an Editor node at a 0.1 temperature that applies all these fixes.

The evaluator-optimizer loop runs for a fixed iteration count, not until a quality score is good enough. Because writing, like any creative work, is highly subjective, a single quality score becomes noisy and unpredictable. Empirically, running the loop for a fixed number of iterations yields better results and gives us more control over cost and latency.

Because the article might not be polished enough, we expose editing tools through the MCP server so the user can kick off another review-edit iteration on demand.

Now let’s see how we transform raw Mermaid output into branded diagrams.

Generating Branded Images

Brown produces Mermaid source for every diagram in the article. Mermaid is fast and predictable to generate with LLMs but visually generic. In theory you can customize them. But let’s be honest. They are ugly. Thus, the job of this stage is to keep the structure of the Mermaid diagrams Brown produced while applying a styling layer on top of them.

We use a skill that leverages Gemini’s Nano Banana for the style transfer. The skill takes a file as input and detects all the Mermaid diagrams in it. Then, for each diagram, it runs parallel subagents. Each invokes the Gemini script on the raw Mermaid text and outputs a styled PNG.

Here is the prompt engineering behind the styling:

The branding is referenced both through a written file with color codes, fonts, and general guidelines, plus a representative image.
2 positive examples containing both the Mermaid inputs and positive styled outputs.

2 negative examples also containing the Mermaid inputs and faulty styled outputs.

When using images as few-shot examples, you should be really careful not to go overboard with them, as they add up in tokens quickly. Also, adding the positive and negative examples on top of just random style images and files was the special sauce for us that made everything work, as it clearly shows Nano Banana how to make the mapping between the two.

Now, let’s see how we generate punchy titles and relevant SEO.

Generating Title & SEO

Title and SEO are the most important components. They decide whether the article gets read at all. Doing it by gut on a Friday night is the worst possible workflow.

The pipeline replaces gut with an expand-and-narrow loop. We generate nine versions from many angles, score ruthlessly, and keep only the top four. Then we repeat this process three times.

The generator creates nine candidate title, subtitle, SEO title, and SEO description packages per round, each from a different angle like personal transformation, curiosity, making bold claims, showing proof of work, and more. The idea is to have a lot of diversity during the expansion round.

The validator scores every candidate on six rubric-anchored dimensions: title, subtitle, SEO title, and SEO description quality, article alignment, and cohesion across the four pieces. It uses hybrid scoring, combining an LLM-judge for the qualitative rubrics and heuristic penalties for the hard constraints such as character count. For example, shorter titles score higher.

Then, based on the scores generated by the validator, we pick the top four winners and use them as seeds for the next round of generation.

The key here is to make the validator a subagent that doesn’t share the same context window as the generator to avoid any type of bias. Fresh eyes prevent self-confirmation bias. This is the same principle Brown uses for its evaluator-optimizer split and Nova uses for its filter step.

Image 10: The expand-and-narrow loop. 9 angles × 3 rounds, scored by an isolated validator on 6 dimensions, narrowing to a top-4 for A/B testing.

So why pick the top four, and not three? When scheduling on Substack, we pick the top four for A/B testing rather than committing to the single highest-scored one. The validator is good but not omniscient, so we let real readers settle close calls.

Exporting to HTML

The last step is to compile the Markdown article into HTML so we can easily copy-paste everything into Substack. Boring, but necessary.

For this step, we created a skill that wraps ’s nb2wb CLI tool that does all the heavy lifting. The tool supports most popular formats such as Substack, Medium, X, and LinkedIn.

Initially, it was built to map Jupyter Notebooks to these formats, but it works amazingly for Markdown files too.

What Stays Irreplaceable

It is not 100% automated. I still follow the original research direction. I still write the outline brain dump. I still validate every artifact. I still write the code that runs the pipeline.

The 90% automation is real, but the 10% is the part that matters most. It’s the part that makes this article stand out as human: the seed, the taste, and the validation are irreplaceable.

What’s Next

You might wonder how well this works. Well... You just read an article created by this exact workflow. In other words, this is an article that talks about itself. It’s not yet perfect, but it will get there.

End-to-end, it took about a day of my time. Without the pipeline, this same article would have taken three days of mostly translation work.

💡 Want to build this exact stack yourself? Nova and Brown built with FastMCP & LangGraph, the style-transfer skill, human-in-the-loop orchestration, evaluation with Opik, and deployment on Docker, GCP, and GitHub Actions. Every line walked through with me and the Towards AI team. That’s exactly what we teach in our Agentic AI Engineering Course.

Otherwise, here is what I’m wondering:

Which step of your own writing workflow do you think is the most dangerous to automate, and which one have you been avoiding automating because you weren’t sure how?

Click the button below and tell me. I read every response.

Leave a comment

Enjoyed the article? The most sincere compliment is to restack this for your readers.

LangChain. The Anatomy of an Agent Harness
Anthropic. Effective Harnesses for Long-Running Agents
Mitchell Hashimoto. My AI Adoption Journey
The Agent Stack. OpenClaw Architecture Part 1
cefboud. How Coding Agents Actually Work: Inside OpenCode

Images

If not otherwise stated, all images are created by the author.

Your RAG Pipeline Is Overkill

Paul Iusztin — Tue, 07 Apr 2026 11:03:14 GMT

We constantly fight a battle against the context window limit. You either compress your data until it loses meaning, or you build a massive infrastructure project just to read a few documents. Today, we look at a third option. We explore a pattern that allows models to read millions of tokens by treating data as an environment rather than an input.

In most AI projects, such as the financial assistant I am working on, there is a constant battle between Retrieval-Augmented Generation (RAG) and Context-Augmented Generation (CAG). Should you implement a heavy RAG architecture up front that might not even work, or does CAG get the job done? For example, in our financial assistant system, we ultimately decided to use RAG only when we really HAVE to, because it introduces zigzag retrieval patterns that require dozens of queries per operation, increasing latency.

Also, while building Brown, my writing agent, I hit another wall. Brown needs to ingest massive amounts of research to anchor its writing process. At 180,000 input tokens, the Gemini API became entirely unreliable.

I faced constant timeouts, disconnections, and infrastructure breakdowns. Huge context windows suffer from API reliability and infrastructure stability issues, as well as performance degradation. But the thing is, I didn’t want to overcomplicate my solution with a RAG layer, so I started looking around for other solutions.

Most engineers face this painful tradeoff when working with large documents. You can stuff everything into the context window, but performance degrades quickly. This causes context rot, which happens when attention degrades over long contexts and earlier information loses its influence [1], [2].

Alternatively, you can build a RAG pipeline. But that requires maintaining vector databases, chunking strategies, and retrieval evaluation infrastructure.

Even the tools we use daily, like Claude Code or Cursor, rely on summarization-based context compression that loses critical information. I just wanted to dump my research into one file and get good answers without the infrastructure breaking. Recursive Language Models (RLMs) solve this exact problem [3].

RLMs use an inference-time pattern that treats your input as an external environment the model interacts with programmatically. You do not need chunking infrastructure or embedding pipelines. The model writes code to explore, filter, and recursively process your data on demand.

Image 1: The three approaches to processing large documents. RAG adds infrastructure complexity. Context stuffing causes degradation. RLMs treat the input as an external environment the model programs against.

This approach scales the effective input and output lengths of LLMs. Researchers tested RLMs up to 10 million tokens across GPT-5 and Qwen3-Coder, showing they easily outperform base models [3]. Base model performance degrades as a function of input length and task complexity, while RLM performance scales with less degradation.

RLMs are also a model-agnostic inference strategy, meaning they work with any model you choose.

However, this architecture has honest downsides you must consider. The inference cost has high variance due to differences in trajectory lengths. The system suffers from code fragility, meaning that if the model writes buggy code, the entire reasoning chain fails.

Errors in sub-calls can compound through the recursive tree, propagating hallucinations. Sequential sub-calls also create latency bottlenecks. This makes RLMs best suited for deep thinking applications rather than real-time chat.

To understand how we bypass these infrastructure limits, we need to examine the specific programming trick that keeps the model’s memory clean.

Here is what you will learn about this pattern:

The mechanism that keeps massive documents outside the context window.
The orchestration loop that drives programmatic data exploration.
The specific use cases where this pattern outperforms retrieval systems.
A practical method to approximate this behavior using Claude Code.

If You Want To Go Deeper Into Production AI (Product)

Patterns like RLMs show that the real challenge isn’t the model, but the infrastructure and systems around it, called the harness. If you want to master that harness, check out my Agentic AI Engineering course, built with Towards AI.

34 lessons. Three end-to-end portfolio projects. A certificate. And a Discord community with direct access to industry experts and me.

Rated 5/5 by 300+ students. The first 6 lessons are free:

Start here

The REPL Trick That Keeps Your Context Window Clean

RLMs introduce a simple core idea. Do not feed the document into the model’s context window. Instead, load it as a variable in a persistent programming environment and let the model write code to interact with it [4].

The model never sees your 10-million-token document directly. In a traditional agent, the prompt goes into the model, completely blowing up your context window. In an RLM, the context stays outside as an external variable, and the model receives only a symbolic handle to it.

The system initializes a Read-Eval-Print Loop (REPL), which is a persistent interactive programming environment where variables and state persist across iterations [3].

The root model receives only metadata, such as the total character count and data structure. It also receives instructions on how to access the REPL. The model then writes code to peek into, filter with regex, chunk, or summarize the data.

When the model identifies a sub-task, it uses a specific primitive such as llm_query(prompt, chunk) to spawn a fresh, isolated worker sub-model [3]. The system pauses, executes this sub-call, and returns the result to the root model’s REPL.

Variables persist across these REPL turns. The model aggregates findings into a buffer, building the response progressively across iterations. Once confident, it calls FINAL(answer) to stop the recursive loop and return the response [5].

Image 2: The RLM mechanism. The document stays outside the context window as a REPL variable. The model writes code to explore, decompose, and recursively process it.

RLMs essentially perform context engineering on autopilot. Traditional context engineering requires you to carefully curate what goes into the context window through retrieval and compression [1]. RLMs automate this by letting the model itself decide what to extract, filter, and process.

Costs and performance stay intact because the model filters the input context without explicitly seeing it. By writing Python scripts, the model processes only the relevant portions through sub-calls. Only constant-size metadata about execution results is appended to the root model’s history, keeping its context window small and clean.

Understanding this mechanical loop allows us to map the pattern directly to production harness engineering.

Turn Any Agent Into a Plan-Execute-Validate Machine

RLMs are an inference-time orchestration pattern that maps directly to production harness engineering. If you have built agent systems, you already know the components: a planning loop, tool execution and validation [7]. RLMs formalize this into a programmable, recursive architecture.

A robust RLM harness uses a multi-tiered architecture. The root controller is a frontier model that acts as the project manager. It plans the reasoning process, writes code, and coordinates execution, but never directly interacts with tools or the full document [8].

Worker sub-models are cheaper, faster models spawned via an operation such as llm_query() to handle specific, localized sub-tasks. This reduces overall costs while maintaining high quality. The aggregation layer is the REPL environment that combines recursive step results into a final structured response via persistent variables.

This setup naturally follows the plan-execute-validate mapping. In the plan phase, the root controller reviews the query, creates a reasoning plan, and decides how to decompose the problem. It might plan to regex-filter a codebase, chunk a document, or batch sub-calls for parallel analysis.

In the execute phase, the model translates the plan into code. It writes Python scripts, issues llm_query() calls, and spawns worker sub-models for parallel execution in isolated REPL environments. External tools, like web search, are provided ONLY to worker sub-models, keeping the root model’s context perfectly clean.

Image 3: The plan-execute-validate loop. The root controller plans, worker sub-models execute, the system validates, and the cycle repeats until FINAL().

After execution, the system enters the validation phase, where results feed back as observations. The root model assesses accuracy, launches verification sub-calls, and handles errors by dynamically adjusting its plan. If the Python code fails, the error traceback is yielded back to the model as an event.

This allows the model to adapt and fix its code on the next turn. The cycle repeats until the model calls FINAL(answer).

Deploying this in the real world requires strict production guardrails. You must configure maxIterations to cap the number of REPL turns, typically between 10 and 50. You need maxDepth to limit the recursive stack depth, where a depth of 1 is usually sufficient.

You also need maxStdoutLength to truncate REPL output returned to the model to prevent context overflow. Finally, permission gating is required to provide sandboxed execution with explicit approval for sensitive operations.

Neither Claude Code nor OpenAI Codex uses true RLM patterns. They rely on summarization-based context compression, file-system state tracking and progressive disclosure techniques [9]. This creates a succession of agents connected by prompts and file state, rather than maintaining a persistent REPL environment with programmatic sub-calls.

With this architecture in place, we can identify the specific real-world scenarios where this pattern outperforms traditional data processing.

Four Scenarios Where RLMs Beat Traditional Approaches

RLMs are best suited for deep thinking applications that require accuracy, multi-step reasoning, and reliability over massive contexts. They are not suited for real-time, low-latency chat applications.

The first scenario is parsing large files without building retrieval infrastructure. Instead of building a hybrid index with vector and graph search, you keep everything in one file or directory and use an RLM agent to extract information on demand.

We can view the relationship between RAG and RLMs as a spectrum. For simple cases, RLMs replace RAG entirely, removing the need for chunking and embeddings. For advanced scenarios, RLMs complement retrieval beautifully.

You use semantic search to find your first pool of candidates, write the results to disk as cached short-term memory, and use an RLM to query that refined dataset on demand.

The retrieval narrows the haystack, and the RLM reasons deeply over what is left. I use this exact workflow for my research, dumping everything into a massive text file and using an RLM to extract relevant information.

Image 4: RLM replaces the entire RAG pipeline for large file parsing. One file, one agent, no retrieval infrastructure.

The second scenario is complex software engineering and codebase comprehension. RLMs ingest massive codebases containing millions of tokens to answer questions about architecture, map dependencies, and perform reviews.

The RLM paper tested this on LongBench-v2 CodeQA using Qwen3-Coder with a Python REPL. The model writes code to break down the codebase, launches sub-queries to smaller language models, and aggregates findings [3].

Image 5: An RLM decomposes a codebase question into parallel sub-queries, each handled by a worker sub-model, then aggregates the results.

The third scenario is enterprise legal and financial analysis. RLMs provide consistent interpretation across thousands of contracts, case files, and policies that would overwhelm a standard context window. They also excel at financial audits and due diligence by tracing, validating, and reasoning through massive financial datasets.

The fourth scenario is deep research and information synthesis. RLMs synthesize research across thousands of files by programmatically filtering, chunking, and summarizing. They enable knowledge graph exploration and multi-hop reasoning over large document dumps.

At scale, RLMs become both more accurate and cheaper than standard long-context approaches. They avoid paying for n-squared attention over massive contexts by having the model process only relevant slices via sub-calls. In all these scenarios, the RLM pattern succeeds because it treats the LLM as a project manager that decides what to look at and delegates sub-tasks to workers.

Knowing these optimal use cases helps us approximate the pattern using tools you likely already have installed.

Build a Naive RLM SKILL in Claude Code

Claude Code does not natively use the RLM pattern. It relies on summarization-based context compression, file-system state tracking, and progressive disclosure. However, you can approximate RLM behavior using Claude Code’s existing harness features to build a naive RLM SKILL.

First, you set up the environment by having the SKILL load the target file or directory as a reference. Instead of feeding it into the context window, it writes the file path and metadata to a prompt for the root agent.

Second, the root Claude Code agent receives only this metadata and a set of instructions for how to interact with it. It uses its Explore subagent type
to examine the data structure, identify relevant sections, and plan its approach.

Third, the SKILL uses Claude Code’s Agent tool to spawn subagents. Each subagent receives a focused prompt to read specific lines and extract mentions, returning a condensed summary of a few thousand tokens. This mirrors the RLM pattern of spawning isolated sub-calls that process slices of the input.

Finally, the root agent collects these subagent results. It aggregates them into a coherent answer and decides whether more exploration is needed or whether to finalize the output.

Here is what this naive RLM SKILL looks like as a SKILL.md file:

---
name: rlm-research-analyzer
description: "Analyze large research files by treating
  them as an external environment. Instead of stuffing
  content into context, the model explores, decomposes,
  and recursively processes the data through subagents."
---

# Analyze Large Research Files Using the RLM Pattern

## Step 1 — Initialize the environment

Accept the target file path as an argument. Do NOT read
the file into context. Instead, run a Bash command to
collect metadata:

wc -l    # total lines
wc -c    # total bytes
head -5   # short prefix

Write the metadata and file path to a temporary prompt
file at /rlm_prompt.md. The root agent
receives ONLY this metadata, never the full content.

## Step 2 — Plan the exploration

Read rlm_prompt.md. Based on the metadata and prefix,
decide how to decompose the file. Use an Explore
subagent to scan the file structure:

- Identify section boundaries, headings, or delimiters
- Estimate which regions are relevant to the query
- Produce a ranked list of target ranges to process

## Step 3 — Delegate to worker subagents

For each target range, spawn an Agent subagent with a
focused prompt:

"Read lines {start}-{end} of {file_path}. Extract all
findings related to {query}. Return a summary under
2000 tokens."

Launch multiple subagents in parallel when ranges are
independent. Write each subagent's output to
/slice_{n}.md.

## Step 4 — Aggregate and finalize

Read all slice files. Synthesize the findings into a
single coherent answer. If gaps remain, return to
Step 3 with new target ranges. Otherwise, write the
final output to /answer.md and present
it to the user.

Notice how the four steps map directly to RLM primitives. Step 1 mirrors REPL initialization, where the data becomes an external variable rather than context input. Step 3 replaces the theoretical llm_query() operation with Claude Code’s Agent tool. Step 4 mirrors the FINAL() call that terminates the recursive loop.

This naive approximation lacks several critical features. It has no true REPL persistence, as Claude Code subagents do not share a persistent variable space. The filesystem serves as a proxy for REPL state, but it is slower and less elegant.

It also lacks sandboxing, as Claude Code runs directly in your environment. Then you miss out on configurable guardrails like max_iterations and max_output_chars, requiring manual limits instead. You get the idea.

Still, I’ve been using a similar technique in all my current projects: instead of stuffing the research into a file, I dump everything into a dir and link everything together in an index.yaml file that contains URIs to all the files, plus metadata such as the title and a 1-2 sentence summary of each source. Like this, through the index.yaml file, Claude Code can efficiently navigate the whole research dump token through progressive disclosure.

My structure looks something like this:

research/
├── index.yaml
├── file_1.md
├── file_2.md
├── ...
└── file_N.md

Also, the only out-of-the-box implementation I found is within the DSPy framework.

The naive SKILL is a useful thought exercise and a practical first step. For production use, you should reference the DSPy framework’s dspy.RLM module.

What’s Next

RLMs represent a fundamental shift in how we process large inputs. We are moving from asking how to fit data in the context window to asking how we let the model interact with it programmatically. This is a great thought exercise on integrating specialized inference-time functionality into your harness.

As models get better at writing code and REPL environments become more sophisticated, the boundary between the model and its infrastructure will blur. The model does not just use tools, it writes the tools on the fly to solve the specific problem in front of it.

Your next practical step is to experiment with our SKILL or with the DSPy framework’s dspy.RLM module on a real problem. Point it at a large codebase you need to understand or a research corpus you need to synthesize. Start with something you have been using RAG or context stuffing on, and see whether the RLM approach is more effective.

But here is what I’m wondering:

How have you been passing large files, such as deep research results or books, to your agents so far? RAG, CAG or other creative techniques?

Click the button below and tell me. I read every response.

Leave a comment

Enjoyed the article? The most sincere compliment is to restack this for your readers.

Whenever you’re ready, here is how I can help you

35 lessons. Pure foundations from scratch. 4 mini-projects. 2 production systems. A certificate and direct access to me & industry experts in our Discord.

Built for software and data professionals transitioning into AI engineering. Rated 5/5 with 300+ students. The first 7 lessons are free:

Start here

Not ready to commit? Start with our free Agent AI Engineering Guide, a 6-day email course on the mistakes that silently break AI agents in production.

References

(n.d.). Effective Context Engineering for AI Agents. Anthropic. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
(n.d.). MIT’s new ‘recursive’ framework lets LLMs process 10 million tokens without context rot. VentureBeat. https://venturebeat.com/orchestration/mits-new-recursive-framework-lets-llms-process-10-million-tokens-without-context-rot/
Zhang, A. L., Kraska, T., & Khattab, O. (2025). Recursive Language Models. arXiv. https://arxiv.org/abs/2512.24601
(n.d.). Recursive Language Models: the paradigm of 2026. Prime Intellect. https://www.primeintellect.ai/blog/rlm
(n.d.). Why Recursive Language Models (RLMs) Beat Long-Context LLMs. Dextra Labs. https://dextralabs.com/blog/recursive-language-models-rlm/
Mansurova, M. (2026, March 30). Going Beyond the Context Window: Recursive Language Models in Action. Towards Data Science. https://towardsdatascience.com/going-beyond-the-context-window-recursive-language-models-in-action/
(2026, March 21). The Anatomy of an Agent Harness. LangChain Blog. https://blog.langchain.com/the-anatomy-of-an-agent-harness/
(2025, December 24). Building Effective AI Agents. Anthropic. https://www.anthropic.com/engineering/building-effective-agents
(2026, March 25). Effective Harnesses for Long-Running Agents. Anthropic. https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents

Images

If not otherwise stated, all images are created by the author.

Agentic Harness Engineering

Paul Iusztin — Tue, 31 Mar 2026 11:03:40 GMT

At the AI start-up I’ve been working at, building a financial personal assistant, we implemented LlamaIndex, added the Model Context Protocol (MCP), and built complex Retrieval-Augmented Generation (RAG) pipelines. Each piece added complexity without adding direct business value.

When we stripped everything back to plain Python, simple API calls, and a custom ReAct engine, things finally worked. What we accidentally built was a harness featuring specialized financial tools, domain-specific guardrails, and purpose-built context engineering.

We did not know the term yet, but the lesson was clear. The model was never the problem. The system and infrastructure around it were.

Most engineering teams obsess over which model to use. They debate GPT-4o versus Claude Opus versus Gemini. They chase LLM benchmark scores and swap models, hoping for better results.

But the model is only half the equation. The system and infrastructure around it determine whether your agent actually works in production.

TerminalBench 2.0 proved this. Changing only the harness moved the DeepAgent from LangChain from outside the top 30 to the top 5 [1].

Image 1: Agent = Model + Harness. The harness is everything that isn’t the model.

This is what usually happens. You have a powerful model. You gave it tools and a prompt. It works in demos.

But shipping it to production means solving problems the model cannot solve alone. You must bridge context windows, recover from failures, serve multiple interfaces, and manage state across sessions.

The solution is harness engineering. This is the discipline of building the infrastructure around the model so it can do useful work reliably. As Mitchell Hashimoto noted, harness engineering is the practice of engineering a solution every time an agent makes a mistake, ensuring it never makes that specific mistake again [2].

By the end of this article, you will learn:

What an agent harness actually is.
The core components powering production AI systems.
How the planning loop dictates agent actions.
The design principles behind an effective toolset.
How to manage memory using the filesystem.

Before we look at all its components and how they fit together, we must first define what a harness actually is.

Your Path to Agentic AI Engineering for Production (Product)

Most engineers know the theory behind agents, context engineering, and RAG. What they lack is the confidence to architect, evaluate, and deploy these systems in production. The Agentic AI Engineering course, built in partnership with Towards AI, closes that gap across 34 lessons (articles, videos, and a lot of code).

By the end, you will have gone from “I built a demo” to “I shipped a production-grade multi-agent system with evals, observability, and CI/CD.” Three portfolio projects, a certificate to back them up in interviews, and a Discord community with direct access to industry experts.

Rated 5/5 ⭐️ by 300+ early students saying “Every AI Engineer needs a course like this” and that is “An excellent bridge from experimental LLM projects to real-world AI engineering.”

Start learning today. The first 6 lessons are free:

Enroll here

So... What the Heck Is a Harness?

While talking with Jonathan Gennick from Manning, he said that the first time he heard about the term “harness” was in the context of horses. Let me explain. A horse is powerful on its own, but useless for farming without a harness. The straps and reins let you direct its strength toward useful work. The same applies to LLMs.

The model has intelligence. But without tools, memory, state, guardrails, and orchestration, you cannot put it to work reliably.

LangChain offers the clearest definition. An agent equals a model plus a harness. The harness is every piece of code, configuration, and execution logic that is not the model itself [1].

A basic agent, as we know it so far, is just a model, a prompt, tools, and a planning loop. A harness extends this by adding memory systems, guardrails, advanced orchestration, context engineering, and multi-agent coordination.

Usually, it also includes a serving layer that connects the agent to various user interfaces, such as terminal apps, web dashboards, IDE plugins, and messaging apps like Telegram.

Ultimately, a harness is a term for building real software applications using LLMs or other models as the operating system. Applications like Claude Code, OpenCode, OpenClaw, and Codex are all harnesses. You could swap the model inside them, but the real engineering value lives in the harness itself.

Image 2: The three levels of engineering: Prompt engineering is crafting instructions, context engineering is managing what the model sees, and harness engineering is the full infrastructure.

This introduces three distinct levels of engineering. Prompt engineering crafts the instructions. Context engineering dictates what goes into the context window and when.

Harness engineering is the full application and infrastructure. It controls when context loads, which tools are available, which actions are allowed, and how failures are handled. Each level encompasses the previous one [3].

Now that you understand what a harness is, the next step is to explore the internal architecture and see how these pieces connect.

The Anatomy of a Harness

A complete harness consists of the LLM, tools, a planning loop, context engineering, a sandbox, memory, an orchestration layer, and a serving layer. In other words, everything that has been hovering within the AI space is finally falling into one beautiful system.

Image 3: The full harness architecture — from the model at the center to the serving layer at the edge.

One of the most distinctive features of modern harnesses is the multi-surface architecture. OpenClaw serves the same agent across a command-line interface (known as TUI), a web UI, desktop apps, Slack and Telegram/WhatsApp through a centralized Gateway using a typed WebSocket protocol.

Codex evolved from a simple terminal tool to an App Server using JSON-RPC over standard input and output. OpenCode uses a Bun JS HTTP server where any client connects via HTTP, utilizing an Event Bus to broadcast results in real-time [4], [5], [6].

This architecture introduces challenges. Multiple messages arrive in parallel from different clients. Users ask questions while the model is still processing.

To solve this, systems use priority queues and message buses. OpenClaw uses a lane-aware FIFO queue to ensure only one active run per session while allowing parallelism across different sessions.

At the core of all this infrastructure, the filesystem is king. As the most foundational harness primitive, it enables durable storage, workspace management, multi-agent collaboration, and versioning.

You heard me right, there is no fancy vector database in place. With AI, we are going back to basics, and nothing is purer than the filesystem itself.

Every production harness uses the filesystem as its primary state mechanism [1].

You might wonder if this is just traditional orchestration like Airflow. It is different in three key ways. The agent loop is non-deterministic, context management is a first-class concern, and the programmer inside the loop is the LLM itself. It is common to add durability to the harness using tools such as Prefect, Temporal or DBOS that natively support dynamic pipelines rather than predefined, rigid DAGs.

Let us zoom in on the first and most fundamental component: the planning loop.

How the Agent Decides What to Do Next

The most common pattern for the planning loop is ReAct, which stands for Reasoning and Acting. The model receives the current state, reasons about what to do next, takes an action via a tool call, and observes the result. This cycle repeats continuously until a strict stopping condition is met [5].

Consider a concrete example. A user asks the agent to fix a failing test. First, the model reads the test output, reasons that the import path is wrong, and edits the file through a tool.

Second, it re-runs the tests, sees a new type mismatch error, and fixes it. Third, it runs the tests again.

They pass, the model reasons the job is done, and it stops. The harness orchestrates this loop, while the model reasons and picks actions.

Image 4: The ReAct loop drives every agent action. For complex tasks, an orchestrator delegates to specialized workers, each with its own context window.

When tasks are too complex for a single agent, harnesses use orchestrator-worker patterns. The orchestrator decomposes a task, delegates subtasks to specialized workers, and aggregates the results.

In OpenCode, a dedicated task tool spawns subagents. Each subagent gets its own session, context window, and restricted tool set [7].

For tasks that span multiple context windows, Claude Code implements Ralph Loops. This harness mechanism intercepts the model’s attempt to exit via a hook. It reinjects the original prompt in a clean context window, forcing the agent to continue against a completion goal using the state persisted on the filesystem [1].

While automating my business with agents, I learned a hard lesson about orchestration. I initially built five specialized agents, each handling one step.

I eventually found that a single agent with memory and smart context engineering outperformed the whole swarm. Always start with one well-harnessed agent before reaching for multi-agent complexity.

Here is a deep dive into planning:

While the planning loop decides the next step, the agent still needs a way to interact with its environment.

The Tools That Let Agents Act

This interaction happens through a specific toolkit designed for autonomous execution.

First, Bash is a general-purpose tool. The agent can run any shell command to execute tests, linters, or builds. This gives the model code execution capabilities so it can design its own tools on the fly rather than being constrained by fixed options.

For example, the agent runs Python code and executes it through python -c "...", generates a script and runs it through python main.py or runs your code as python -m my_module.main.

Second, specialized filesystem tools handle common operations like reading, writing, editing, and searching. Doing file operations via Bash is slow and error-prone.

Specialized tools include safety checks. For instance, a read tool enforces absolute paths and line limits, while an edit tool validates the uniqueness of replacement strings.

Third, state management tools track session-scoped tasks. These give the agent working memory within a single session. For example, OpenCode has ToDoAdd and ToDoRead tools that add/read tasks from a queue to keep track of the plan it has to execute.

Finally, orchestration tools launch subagents with their own isolated prompts and context windows, such as OpenCode’s task tool or Claude Code’s agent tool.

Image 5: The standard harness toolkit organized by design principle — from general-purpose bash to specialized filesystem tools to orchestration.

Feedback loops are the most important principle around tooling. Boris Cherny, the creator of Claude Code, noted that giving the model a way to verify its work improves quality by two to three times. For example, OpenCode integrates the Language Server Protocol (LSP) to get real-time code definitions and diagnostics.

Undefined variables and type errors are fed back to the LLM for immediate correction. These tools do not act on the world. They feed vital information back to the planning loop.

Harnesses also enforce tool access control. In OpenCode, the planning agent cannot call edit tools. This prevents exploratory agents from accidentally modifying your code [5].

Here is a deep dive into tool calling:

Once the agent has its tools, it needs a secure place to use them. In production, this requires strict isolation.

Where Agents Run

Agents execute code, and that code can fail, crash, or delete all your files. I know I want my precious notes protected. Sandboxes isolate agent execution so failures do not affect the host system or other agents. The cherry on top is that they also enable horizontal scaling across parallel environments.

There is a strict tradeoff between security and capability. Not every harness uses the same approach. Codex uses a hard sandbox.

Each task runs in an isolated cloud container preloaded with the repository. This provides maximum safety, but the agent cannot access the host filesystem [6].

Conversely, OpenClaw uses a soft sandbox. The workspace is the default working directory. This grants maximum capability but introduces more risk.

OpenClaw deliberately avoids hard sandboxing to preserve full filesystem access. Most production harnesses sit somewhere between these extremes, depending on the trust model.

When you submit a task to Codex, the harness spins up a fresh cloud container. The agent works inside this container to read files, run tests, and install packages.

It cannot touch your local machine. When the job finishes, the results are extracted, and the container is destroyed.

Along with security, a major benefit of cloud sandbox environments is that they give the agent access to powerful computing resources. For example, if you want to train a model using a GPU, you can ask the agent to implement and run a training pipeline hosted in a sandbox powered by a GPU.

This is similar to manually SSHing to different VMs and running the code manually there. Based on the same principles, you can easily spin up multiple cloud sandboxes and run your agents in parallel.

On the other side of the spectrum, you can also run sandbox environments locally through Docker containers or isolated processes, similar to what Cursor does. Super useful when you want to try something out and give the agent full permissions to avoid having to supervise it.

While sandboxes provide a safe space for execution, they are ephemeral by design.

Memory Is Just the Filesystem

To survive across sessions and context windows, every harness manages state across three distinct memory layers. The first layer is the filesystem. This is the long-term memory.

It is durable and persistent, surviving across sessions. This is where progress files, git history, and session transcripts live.

The second layer is the RAM. This is the short-term memory, also known as the working memory. It holds the conversation history and tool results during an active session. It is fast but volatile.

The third layer is the context window. This is what the model actually sees. It is the strictest constraint, as everything the model knows about the current task must fit here.

Image 6: The three-layer memory dynamics — filesystem as long-term state, RAM as working memory, context window as what the model sees. The cycle repeats: load → process → flush.

The harness orchestrates the dynamics between these layers. On the read path, the harness selectively loads relevant state from the disk into the RAM.

It then assembles the context window using context engineering techniques such as compaction, progressive disclosure, and just-in-time retrieval. On the write path, the harness persists important state back to the disk after processing.

OpenClaw enforces a strict invariant that memory is always flushed to disk before being discarded from context. Rehydration is treated as a tool-shaped action, where the agent searches and then retrieves specific data, rather than dumping everything into the context window [8].

Context engineering makes this possible. When token counts exceed ninety percent of the limit, OpenCode automatically summarizes the conversation. Codex assembles prompts from multiple sources and exploits prompt caching.

Anthropic recommends using structured note-taking files and sub-agent architectures to isolate context [5], [6], [9].

In Anthropic’s long-running agent pattern, an initializer agent creates a script, a progress file, and a feature list. The coding agent reads the git logs and progress files at the start of each session and updates the progress file as it progresses.

The beauty? There is no database or vector store. It is just the filesystem [10].

Here is a deep dive into memory:

Now that you have seen all the pieces, from planning and tools to sandboxes and memory, the question is what this means for how you build software.

What’s Next

We are witnessing a new way of building software. Instead of software engineers building traditional frontend and backend applications, the next generation of production software will be harnesses. Harness engineering is merging software engineering with AI, moving it one level up [3].

Popular tools like Claude Code are just the beginning. In the long run, no company will want to depend entirely on proprietary harnesses. Even open-source solutions like OpenCode will not cover every specific use case.

Companies will inevitably build their own. As we experienced at ZTRON, custom systems and infrastructure are what finally make an agent work in production.

However, we must be honest about current limitations. Memory still breaks across long sessions. Validation loops still miss edge cases. Furthermore, orchestrating hundreds of parallel agents on shared codebases remains an open research problem.

Harness engineering is real engineering. Your harness becomes its own product with its own bugs, its own drift, and its own maintenance burden.

What’s your opinion? Do you agree, disagree, or is there something I missed?

Leave a comment

Enjoyed the article? The most sincere compliment is to share our work.

Whenever you’re ready, here is how I can help you

35 lessons. Pure foundations from scratch. 4 mini-projects. 2 production systems. A certificate and direct access to me & industry experts in our Discord.

Built for software and data professionals transitioning into AI engineering. Rated 5/5 with 300+ students. The first 7 lessons are free:

Start here

Not ready to commit? Start with our free Agent AI Engineering Guide, a 6-day email course on the mistakes that silently break AI agents in production.

References

LangChain. (2026, March 21). The Anatomy of an Agent Harness. LangChain Blog. https://blog.langchain.com/the-anatomy-of-an-agent-harness/
Hashimoto, M. (2026, March 25). My AI Adoption Journey. Mitchell Hashimoto. https://mitchellh.com/writing/my-ai-adoption-journey
Bouchard, L. (2026, March 25). What Harness Engineering Actually Means. What’s AI by Louis-François Bouchard. https://youtube.com/watch?v=zYerCzIexCg
Govindarajan, V. (2026, March 21). OpenClaw Architecture Part 1 - The Agent Stack. The Agent Stack. https://theagentstack.substack.com/p/openclaw-architecture-part-1-control
Abboud, M. (2026, March 17). How Coding Agents Actually Work: Inside OpenCode. Moncef Abboud. https://cefboud.com/posts/coding-agents-internals-opencode-deepdive/
ByteByteGo. (2026, March 26). How OpenAI Codex Works. ByteByteGo. https://blog.bytebytego.com/p/how-openai-codex-works
Anthropic. (2025, December 24). Building Effective AI Agents. Anthropic. https://www.anthropic.com/research/building-effective-agents
Govindarajan, V. (2026, March 24). OpenClaw Architecture Part 3: Memory and State Ownership. The Agent Stack. https://theagentstack.substack.com/p/openclaw-architecture-part-3-memory
Anthropic. (2025, October 22). Effective Context Engineering for AI Agents. Anthropic. https://www.anthropic.com/engineering/effective-context-engineering
Anthropic. (2026, March 25). Effective Harnesses for Long-Running Agents. Anthropic. https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents

Images

If not otherwise stated, all images are created by the author.

From 12 Agents to 1

Paul Iusztin — Thu, 26 Mar 2026 12:01:52 GMT

It is 2026. People across the industry still mix up words like workflows, agents, tools, and multi-agent systems. Beyond terminology, this confusion has led to massively overengineered solutions.

Teams jump to multi-agent architectures because it sounds impressive and helps raise money. In reality, a simple workflow would have been faster to build, cheaper to run, and easier to debug. The result is bloated systems, wasted tokens, and debugging nightmares.

Our goal is to provide a clear mental model of what architecture to choose for your AI project: workflows vs. single agents vs. multi-agent systems.

from Towards AI has been working on this exact problem with his clients and distilled his decision framework into two YouTube videos: Stop Overengineering: Workflows vs AI Agents Explained and From Workflows to Multi-Agent Systems: How to Choose. He allowed me to take that framework and turn it into this article. Kudos to Louis-François!

This decision framework is a spectrum from simple to complex that tells you exactly what to build based on your actual constraints. The goal is to stay as far left on the complexity spectrum as possible while still solving your problem.

Here is what you will learn:

The fundamental difference between an agent and a workflow.
How to use the complexity spectrum to make architecture decisions.
When to rely on simple workflows for predictable tasks.
Why a single agent with tools is often enough for dynamic problems.
The exact breaking points that justify moving to a multi-agent system.

To apply this spectrum effectively, you must first define the terms. Here are the core misconceptions that lead to bad architecture decisions.

Before we continue, a quick word from the Decoding AI team. ↓

Go Deeper: Your Path to Agentic AI for Production

34 lessons from first principles to production. Learn about context engineering, workflows, agents, evals, and the design of AI systems.

Rated 4.9/5 ⭐️ by 300+ early students saying ”Every AI Engineer needs a course like this” and that is ”an excellent bridge from experimental LLM projects to real-world AI engineering.”

Start learning today

↓ Now, back to the article.

Clarifying the Confusion: Not Everything Is an Agent

The first major misconception is that every LLM application is an agent. The key difference is autonomy. In a workflow, you control the flow.

You decide the steps and their order. In an agent, the model controls the flow. It decides what to do next based on the goal you give it.

If you can write down the exact sequence of steps in advance, you are building a workflow. You are not building an agent.

Image 1: A side-by-side comparison of a predetermined workflow and an autonomous agent, highlighting who controls the flow.

The second misconception is that tools are agents. A tool is a capability. It can be a calculator, a database query, a web browser, a validator, or an API call.

It can even be another LLM. An agent is the decision maker who chooses which tools to use and when.

If someone tells you they built a multi-agent system, but it is actually one model calling ten different APIs, that is not multi-agent. That is a single agent with ten tools.

Image 2: A visual showing the distinction between tools and agents, with a central agent utilizing various tools.

This distinction matters. It defines how you architect, debug, and scale your system. It drives your core architecture choice between a workflow, a single agent with tools, or multiple agents.

The Complexity Spectrum: A Mental Model for Architecture Decisions

To make this architecture choice easier, we use a complexity spectrum. It is a slider going from the most control to the most autonomy. Your goal is to stay as far left as possible while still solving the problem.

Level 1 represents workflows. Here, you chain multiple LLM calls together in a predefined sequence. You control every step.

Level 2 represents a single agent with tools. The model makes decisions about what to do next. You have one decision maker and multiple capabilities.

Level 3 represents multi-agent systems. Here, you have multiple decision makers who need to coordinate with each other.

Image 3: A horizontal spectrum showing three levels of autonomy with increasing cost and complexity.

The core principle is straightforward. Move right on this spectrum only when you absolutely have to. Each step to the right increases costs, latency, and debugging complexity.

More LLM calls mean more tokens, more traces to follow, and more places where things can go wrong.

In practice, start simple and escalate only where things break. Write a prompt first. Test it.

Implement it with minimal complexity. Measure the results. Add what is missing.

If the model lacks information, add retrieval. If it needs calculations, add a tool. Only when you genuinely need autonomous decision-making should you reach for an agent.

Even then, start with one. The best AI systems are the simplest ones that reliably solve the problem. That usually means starting with workflows.

When a Workflow Is the Right Answer

Workflows are the right answer when your steps are known and stable. If the process is largely the same each time, regardless of input, a workflow is almost always the best choice.

Workflows win because they are predictable. They are easy to test because you can write unit tests for each step. They are easy to debug because you can trace exactly what happened when something goes wrong.

They are also cheap because you are not burning tokens on the model, figuring out what to do next.

Consider a support ticket system. A ticket comes in. You classify it.

You route it to the right team. You draft a response from templates and context. You validate it against the policy.

Finally, you send it. Each step might involve an LLM call, but the model does not need to decide whether to classify before routing. That is always the order.

Building this as an agent adds overhead without adding capability.

Image 4: A horizontal flowchart illustrating the support ticket workflow with six sequential steps.

Do not underestimate workflows. They are not limited to simple sequential chains. They can include routing to pick different models based on input.

They can use parallel execution with majority voting to aggregate answers. They can also use generator-evaluator loops where one LLM generates and another validates until quality criteria are met. They can even leverage tools in designs like the orchestrator-worker. These patterns handle complex tasks without any agent overhead.

If you can write down the exact sequence of steps in advance, like a recipe, it is a workflow.

When a Single Agent with Tools Wins

Sometimes the order of work is not fixed. You genuinely cannot write down the steps in advance. This happens when the path changes depending on what you discover along the way.

Maybe the first API call fails, and you need to try an alternative. Maybe the retrieved data is incomplete, and you need clarification. This is what agents handle well.

When is an agent worth the risk? Anthropic offers a useful framework. Agents make sense when the task is complex enough to need autonomous decisions and delivers real value.

Critically, the cost of errors and the cost of discovering those errors must be low. This is why AI coding agents are great. A human reviews the code before production, so mistakes are cheap to fix.

A purchasing agent who accidentally buys the wrong hardware makes an expensive error. You must match your architecture to your error tolerance [3].

The rule is to always start with one agent. A single agent with tools works best when tasks are tightly coupled and mostly sequential. It works well when global context matters, meaning step one affects step five.

It is also ideal when you need fewer than twenty tools and face strict budget or latency constraints.

Take a marketing content platform from Louis-François’s client work at Towards AI. The client wanted AI-assisted content generation for emails, text messages, and promotional materials. Their initial specification called for a multi-agent setup with over a dozen specialized agents.

They wanted an orchestrator, a request analyzer, a content generator, and many others. On paper, it looked clean with specialists doing specialist work.

Image 5: Comparison of initial multi-agent setup versus actual single-agent solution for a marketing platform.

A single agent was the right call. The tasks were tightly coupled and sequential. The template choice affects the content.

Personalization depends on both content and contact data. Splitting this across multiple decision makers creates information silos and handoff errors. They did not need parallelism.

The flow was to plan, generate, validate, and fix if needed.

The key insight is that tools can be smart. A tool can have its own system prompt and use a different model. The validation tool can use its own LLM with instructions to catch errors.

The text message tool can treat character limits as deterministic engineering constraints instead of prompting problems. You get specialists, but you keep one brain to maintain context and make final decisions.

Image 6: An agentic loop diagram showing how a single agent plans, executes, and reflects.

This results in a system that is faster to build, cheaper to run, and easier to debug. You get the same capabilities without the coordination overhead.

The Tool Count Problem: When One Agent Isn’t Enough

As your tool list grows, tool selection gets harder. This is one of the main ways agent systems quietly break down. It is also one of the clearest signals that splitting into multiple agents might be worth it.

Every tool has a name, description, and schema that the model needs in context to use correctly. The more tools you add, the more of your context budget you burn before the agent even starts thinking about the actual task. You also have to add system instructions, a few-shot examples, retrieved documents, and conversation history on top of that.

A single agent tends to work best with roughly 10 to 20 tools. Past that threshold, tool selection degrades. The agent has to choose among too many options in an already packed context.

This mechanism is known as context rot. LLM performance measurably degrades as context grows, well before hitting the advertised limit. Two forces drive this issue.

First, more context means more noise competing for the model’s attention. Second, models suffer from loss in the middle. They tend to attend more to the beginning and end of their context, underweighting information in the middle.

As your tool schemas and instructions pile up, the model gets worse at picking the right tool.

Image 7: The context window budget problem, more tools mean less room for actual task reasoning.

Managing context can reduce history and retrieved content, but not the tool schema load. Those definitions must always be there. The only approach that actually reduces how many tool definitions the model sees per call is splitting across agents.

If one agent sees only email tools and another only sees validation tools, each call stays smaller. Tool selection gets easier. Once you split tools across agents to keep calls small, you enter multi-agent territory.

When Multi-Agent Is Actually the Right Call

Specific reasons justify multiple agents, not because the architecture sounds impressive. There are four legitimate reasons to go multi-agent. First, you need true parallelism where tasks are genuinely independent and run simultaneously.

Second, you face context overload where instructions and tools degrade performance. Third, you need modularity to connect with third-party agent systems you do not control. Fourth, you have hard separation requirements like security boundaries or sensitive data handling.

Consider the professional article generation system that Louis-François and I built as one of the projects for our Agentic AI Engineering course. We started with a single agent for research and writing but had to pivot because the two phases have fundamentally different needs.

The research phase is exploratory and dynamic. It needs flexibility and broad tool access across web search, video transcription, and document processing. The agent searches, reads, pivots based on what it finds, and iterates based on human feedback.

The writing phase is constrained and deterministic. It needs focused constraints, consistent style enforcement, and iterative refinement against fixed rubrics.

These agents communicate through explicit artifacts. The research agent produces a structured markdown file that the writer agent consumes as context. There is no complex runtime orchestration.

It is just a sequential handoff with a clear contract between them. Each agent has its own optimized context without the bloat of carrying the other’s tools.

Image 8: The article generation multi-agent system with a Research Agent, a Writing Agent, and an artifact handoff.

If you do go multi-agent, we recommend the plan-and-execute combined with the orchestrator-worker pattern. You do not want everyone talking to everyone. One orchestrator maintains the main context and delegates specific tasks to worker agents.

Then, it synthesizes the results. This prevents the information silos that kill multi-agent systems.

Image 9: The Orchestrator-Worker pattern with no direct communication between workers.

Multi-agent systems can simplify individual contexts and enable specialization. However, they increase coordination costs. You will face more token usage, added latency, more failure points, and handoff complexity.

Only accept those costs when you hit a real constraint that simpler architectures cannot solve.

To Wrap Up

To build reliable AI applications, you must stay as far left on the complexity spectrum as possible while still solving your problem.

Keep these key takeaways in mind:

Not every LLM application is an agent, and not every tool is an agent.
Always start with workflows because they are predictable, cheap, and testable.
Use one agent when the path cannot be predetermined, but keep the tool count manageable.
Move to multi-agent architectures only when you hit a real constraint like true parallelism or context overload.

Each step right on the spectrum increases cost, latency, and debugging complexity. The simplest system that reliably solves the problem is always the best system.

💡 If you want a step-by-step framework to help you decide what architecture to pick for your next project, Louis-François and the Towards AI team put together a free cheatsheet that walks you through the decision process from workflows to multi-agent systems.

What’s your opinion? Do you agree, disagree, or is there something I missed?

Leave a comment

Enjoyed the article? The most sincere compliment is to share our work.

Whenever you’re ready, here is how I can help you

35 lessons. Pure foundations from scratch. 4 mini-projects. 2 production systems. A certificate and direct access to me & industry experts in our Discord.

Built for software and data professionals transitioning into AI engineering. Rated 5/5 with 300+ students. The first 7 lessons are free:

Start here

Not ready to commit? Start with our free Agent AI Engineering Guide, a 6-day email course on the mistakes that silently break AI agents in production.

Images

If not otherwise stated, all images are created by the author.

The AI Evals Roadmap I Wish I Had

Paul Iusztin — Tue, 24 Mar 2026 12:04:28 GMT

Welcome to the AI Evals & Observability series: A 7-part journey from shipping AI apps to systematically improving them. Made by busy people. For busy people.

AI Evals is the topic most AI engineers know they should invest in, but do not know where to start. I remember struggling with this myself.

I did not know how to properly integrate evals into my app until I understood there are three core layers: optimization during development, regression testing before merging, and production monitoring on live traffic. Once that clicked, everything else fell into place.

I did not know how to build LLM judges and evaluators that I could actually trust and use. Every guide I found either hand-waved the details or dumped a generic “helpfulness” metric and moved on. Instead, I needed evaluators grounded in my actual business requirements.

I did not know how to gather custom datasets without wasting too much time. I tried generating hundreds of synthetic test cases up front, but the real unlock came from learning how to organically grow a high-quality dataset from production data, starting small and letting the error-analysis flywheel do the heavy lifting.

The information was scattered across blog posts, talks, and vendor docs. Most of it focused on isolated techniques without showing how everything connects. I built this series as the structured, end-to-end guide I wish I had.

This 7-lesson series breaks it all down from first principles. By the end, you will know how to integrate AI evaluations that actually track and improve your product's performance. No vibe checking required.

The series follows a natural progression. You start by understanding where evals fit. Then, you build the dataset.

Next, you design and validate the evaluators. Finally, you handle specialized domains like RAG and see how it all works in production.

You can read front-to-back for the full journey. Alternatively, jump to the lesson that matches your current pain point. Each lesson stands on its own but references the others.

Without more yada, yada, here are the 7 lessons of the series:
(Scroll down to find more about each lesson individually.)

Everything is completely free, without any hidden costs, thanks to our sponsor, Opik ↓

Opik: Open-Source LLMOps Platform (Sponsored)

This AI Evals & Observability series is brought to you by Opik, the LLMOps open-source platform used by Uber, Etsy, Netflix, and more.

We use Opik daily across our courses and AI products. Not just for observability, but as our end-to-end evaluation harness, all from the same platform.

Try Opik for free here (25k spans/month free)

This series teaches you how to build evals from scratch (custom datasets, LLM judges, optimization loops, and production monitoring), while Opik gives you the platform to run everything at scale.

Here is how we use it:

Custom LLM judges: Build evaluators by defining your criteria, adding a few-shot examples, and running them across hundreds of traces automatically.
Run experiments, compare results: Test different prompts, models, or parameters from your AI app side by side. Opik scores each variant with your evaluators and shows you which one wins.
Plug evaluators into production: The same LLM judges you design for offline testing run on live traces too. Set up alarms when scores drop below your threshold so you catch regressions before users do.

Opik is fully open-source and works with custom code and with every popular AI framework or tool (including OpenClaw). You can also use the managed version for free (with 25K spans/month on their generous free tier):

Try Opik for free

↓ Now, let’s move back to the article.

Lesson 1: Integrating AI Evals Into Your AI App

To build a reliable system, you first need to know where evaluation fits into the development lifecycle.

Most teams start by “vibe checking” their AI app. They manually test a few inputs and eyeball whether the outputs look right. That works for the first version.

But the moment you start adding features, onboarding real users, or trying to improve existing capabilities, vibe checking collapses. This first article gives you the holistic map of where AI Evals fit, so you never feel lost again.

Here is what you will learn:

The three core scenarios where evals matter: optimization during development, regression testing before merging, and production monitoring on live traffic.
The difference between guardrails and evaluators. Confusing them leads to gaps in your system.
The minimum viable tech stack required to start: a custom annotation tool and an LLMOps platform.

Go to Lesson 1

Lesson 2: Build an AI Evals Dataset from Scratch

Once you understand where evals fit, the next step is gathering the data required to measure performance.

You cannot evaluate what you cannot measure. You cannot measure without data. Most teams either skip this step entirely or fire off a generic prompt to create 100 test cases and call it done.

This article teaches the error analysis framework. It is a practical flywheel that turns 20-50 real production traces into a growing, high-quality evals dataset.

Here is what you will learn:

The error analysis flywheel: sample traces, label manually, build evaluators iteratively, perform error analysis, and create specialized evaluators.
Why one “benevolent dictator” should own labeling consistency across your team.
How to graduate from generic to specialized evaluators as your understanding deepens.

Go to Lesson 2

Lesson 3: Generate Synthetic Datasets for AI Evals

Production traces alone have limits. You need traffic to get data, and that traffic rarely covers every scenario. What about before you have users?

What about rare failure modes you have never seen in production? Yet! Synthetic data solves the cold start problem and fills coverage gaps.

Here is what you will learn:

Why you should generate only inputs, not outputs, and let your real app produce the outputs.
How to think in dimensions like persona, feature, scenario, and input modality to avoid mode collapse.
Tester agents for simulating multi-turn conversations.
The reverse workflow for RAG: generate questions from your knowledge base, not the other way around.

Go to Lesson 3

Lesson 4: How to Design Evaluators

You have the dataset. Now you need evaluators who can actually tell you whether your app is working. This is where most teams make their biggest mistake.

They grab a generic helpfulness metric off the shelf and call it done. This article teaches you how to design evaluators grounded in your actual business requirements.

Here is what you will learn:

The evaluation harness: the infrastructure that automates running evaluators across your dataset.
When to use fast, deterministic code-based evaluators versus flexible, nuanced LLM judges.
Common design mistakes
Advanced designs for multi-turn conversations and agentic workflows.

Go to Lesson 4

Lesson 5: How to Evaluate the Evaluator

You built an evaluator. It says everything is great. But is it?

An evaluator that validates every output is worse than no evaluator at all. It gives you false confidence. This article teaches you how to validate your evaluator against human judgment and close the gap when they disagree.

Here is what you will learn:

The iterative refinement loop: measure alignment, diagnose disagreements, adjust few-shot examples, and re-measure.
Dealing with non-determinism: why LLM judges give different answers on the same input, and how to stabilize them.

Go to Lesson 5

Lesson 6: RAG Evaluation: The Only 6 Metrics You Need

After mastering general evaluators, you can apply these principles to specific architectures like RAG.

RAG evaluation feels overwhelming because everyone proposes different metrics. But it does not have to be complicated. This article proves that there are exactly three variables in any RAG system: Question, Context, and Answer.

There are exactly six possible relationships between them. That is it. Every RAG metric maps to one of these six relationships.

Here is what you will learn:

The three RAG variables and six exhaustive relationships.
Tier 1: Retrieval metrics. If retrieval is broken, nothing else matters.
Tier 2: The three core RAG metrics you always need.
Tier 3: When core metrics cannot explain the failure.

Go to Lesson 6

Lesson 7: Lessons from 6 Months of Evals on a Production AI Companion

Theory and isolated metrics are useful. But the ultimate test is running this entire system on live user traffic.

The first six articles teach you how to build the system. This final article shows you what it looks like after six months of running it in production.

Written as a guest post by , Senior Data Engineer at Workpath, it shares the real lessons. We cover what worked, what failed, and what they wish they had known from the start.

Here is what you will learn:

The three observability problems most teams hit: falling for generic metrics, skipping manual annotation, and not treating AI agents as data products.
How to use Opik’s architecture, including traces, spans, threads, and prompt versioning, for production monitoring and evals.
How to reverse-engineer evaluation criteria from real traces instead of guessing upfront.

Go to Lesson 7

How to Take the Course?

After completing these seven articles, you will have the complete mental model for AI Evals. You will understand everything from strategy to production.

As the course is 100% free, with no hidden costs or registration required, taking it is a no-brainer.

Each lesson is a free article hosted on the Decoding AI Magazine.

Just open each lesson in the order provided by us, and you are good to go:

Each lesson will guide you through the required steps.

Enjoy!

Now What?

After completing these lessons, if you want the information to stick, you have to put everything into practice by building a cool project!

I am sorry to say there is no other way to make learning worthwhile. Pick one problem and get your hands dirty with a project.

💡 Want to share your work on my socials with my 140k+ audience? If you build a project you are excited about, I will be too. Trust me! I love seeing people build cool stuff. To share it, you can contact me here.

See you next Tuesday.

Paul Iusztin

What’s your opinion? Do you agree, disagree, or is there something I missed?

Leave a comment

Enjoyed the article? The most sincere compliment is to share our work.

Whenever you’re ready, here is how I can help you

35 lessons. Pure foundations from scratch. 4 mini-projects. 2 production systems. A certificate and direct access to me & industry experts in our Discord.

Built for software and data professionals transitioning into AI engineering. Rated 5/5 with 300+ students. The first 7 lessons are free:

Start here

Not ready to commit? Start with our free Agent AI Engineering Guide, a 6-day email course on the mistakes that silently break AI agents in production.

Thanks again to Opik for sponsoring the series and keeping it free!

Try Opik for free here (25k spans/month free)

If you want to monitor, evaluate and optimize your AI workflows and agents:

Try Opik for free

Images

If not otherwise stated, all images are created by the author.

Agentic AI Engineering Guide

Paul Iusztin — Thu, 19 Mar 2026 12:03:14 GMT

I have spent the past two years building and breaking AI agents in production. Along the way, I have seen the same patterns destroy systems over and over. This happens not because the models are bad, but because the system design is wrong.

Most agents fail silently. They work well in demos but drift unpredictably in production. Costs spike with no clear explanation.

Behavior becomes erratic, and every release feels risky. Ultimately, teams end up stuck in PoC purgatory, unable to ship, debug, or trust their own system.

The root cause is almost never the model. It is subtle system design mistakes that individually look small but compound into production disasters.

To fix this, together with , we created a diagnostic framework for six specific mistakes that cause agentic systems to break in production. Each has a clear problem, a reason why it happens, and a proven fix. Once you know what to look for, you can trace most production failures back to one of these patterns.

The first and most common failure starts right at the input level, where engineers mishandle the context window.

Mistake #1: Treating the Context Window as an Afterthought

When something breaks, the instinct is to add more context. Engineers add more rules, more history, more tools, and more examples. The assumption is that if the model sees everything, it will behave better.

But this turns the context window into a dumping ground instead of a carefully scoped working memory. As the context grows, the model starts to ignore instructions and apply constraints inconsistently. It hallucinates more and drifts across runs.

Latency spikes and costs compound. This is the lost in the middle problem. Many teams respond by splitting one giant prompt into dozens of smaller ones.

But that introduces its own problems, such as more LLM calls, higher latency, and harder debugging.

💡 Treat the context window as a scarce resource.

Every LLM call should have one clearly scoped job. You must curate context aggressively by selecting, compressing, and pruning before every call. Move persistence into a memory layer.

The context window holds only what matters for the next decision, and everything else lives in memory, which you write to and read from continuously.

As a rule of thumb, start with a single prompt. If it works, stop. If it fails, do not jump to agents.

Introduce a small number of specialized steps and tune until you hit the balance. Context engineering is about deliberate selection.

Once the context window is secure, the next trap is overengineering the architecture before the problem demands it.

Mistake #2: Starting with Complicated Solutions

You have a clear problem, so you immediately reach for multi-agent architectures or heavy frameworks. You build RAG pipelines, hybrid retrieval, multiple databases, or adopt new protocols like MCP. You do this not because the problem demands it, but because it feels like the right way to build serious AI.

Every layer adds a hidden tax. You get more dependencies, higher latency, higher costs, and harder debugging. Complexity compounds operational pain.

Teams end up spending months building infrastructure and shipping nothing.

At our startup, ZTRON, we built a multi-index RAG system. We had OCR pipelines, separate embedding pipelines, hybrid retrieval, and agentic RAG loops.

It worked, but simple queries took 10 to 15 seconds. Costs climbed, and debugging was a nightmare.

When we finally asked if we actually needed all this, the answer was no. Our data fit within modern context windows. We replaced agentic RAG with cache-augmented generation (CAG) for most workflows.

This gave us fewer LLM calls, lower latency, fewer errors, and an easier system to debug.

Start with the simplest solution that could work. Prove the core task works first. Only add memory, tools, retrieval, or multiple agents when the problem demands it.

Production-grade AI is built by engineers who ship simple systems first and scale complexity intentionally.

Earning complexity often means realizing that you do not need an agent at all, which brings us to the third mistake.

Mistake #3: Building Agents When a Workflow Will Do

Predictable tasks like data ingestion, summarization, or report generation need predictable execution. That is a workflow. Open-ended tasks like deep research or dynamic decision-making under uncertainty may need autonomy.

Agents handle these open-ended scenarios. Most teams treat predictable problems as if they need agents. When you use an agent for a structured task, you pay for autonomy you do not need.

You get unpredictable behavior, variable latency, higher token usage, and inconsistent outputs. The system works 80% of the time and fails when it matters most.

Workflows and agents are not binary choices. They sit on a spectrum known as the autonomy slider. More autonomy buys flexibility but costs predictability, cost control, and debuggability.

You must set the slider intentionally.

Adopt a workflow-first approach. Start with prompt chaining, routing, parallelization, or an orchestrator-worker pattern. Introduce agents only when the system must autonomously plan, explore unknown paths, or recover from failures dynamically.

For vertical AI agents, use a hybrid approach. Route known patterns to workflows and open-ended requests to agents.

Whether you use a workflow or an agent, you must handle the data they produce, which exposes a flaw in how engineers process outputs.

Mistake #4: Fragile Parsing of LLM Outputs

You ask the model for something structured, and it responds with something that looks structured. You parse it with regex, string splitting, or custom logic. It works in staging.

Then one day, a missing comma or different bullet style crashes production. LLMs are non-deterministic. Even with identical prompts, output can drift due to context changes, model updates, or variations in tool outputs.

Fragile parsing is a time bomb. Many teams respond by prompting the model to output JSON. That is better than free-form text, but it still is not a contract.

You still get missing keys, wrong types, and drifting nested fields.

Stop treating LLM outputs like text and treating them like data. Define a schema, enforce it at generation time, validate at runtime, and fail fast when wrong. Use Pydantic as the bridge between probabilistic generation and deterministic code.

But only use structured outputs when structure is required. If you only need a plain string, accept a string and keep schemas shallow and minimal.

If you have secured your context, simplified your architecture, chosen the right autonomy, and enforced output schemas, you are ready to build an agent. However, many teams still fail by omitting actual planning from their loops.

Mistake #5: Forgetting Agents Need Planning

You give a model tools, let it pick one, feed the tool output back, and repeat. At a glance, it looks agentic, but it is just a workflow with randomness. The system is reacting to the last tool output, not driving toward a goal.

Without embedded planning, the loop cannot decompose tasks into meaningful steps. It cannot evaluate progress or choose next actions intentionally. The result is random behavior, unnecessary tool calls, infinite loops, and shallow reasoning.

Copying ReAct or Plan-and-Execute from blog posts without adapting them to your domain makes it worse.

You must embed planning into the loop. Before calling a tool, require a reasoning step. Ask what the goal is, what the next best action is, and what evidence you need.

Add progress checks and stop conditions like max steps, token budgets, and escalation when stuck. Make planning use-case specific, because generic ReAct is not a product. Tailor planning to your tools, data, constraints, and failure modes.

Even a well-planned agent will degrade over time if you do not measure its performance continuously.

Mistake #6: Not Starting with AI Evals from Day Zero

You build features without tracking how well your AI behaves. You have no tests, no evaluation metrics, and no defined success criteria. Every new feature is a gamble, and teams silently ship regressions.

AI systems do not fail all at once. They decay. A prompt change, a new tool, or a model upgrade causes subtle behavior shifts.

Without evals, nobody can answer whether a change made the system better or worse. Teams get stuck relying on vibe evals, which are manual, gut-feel testing that does not scale. Many teams think they are doing evals, but rely on generic scores like helpfulness or 1-5 star scales.

A score of 3.7 helpfulness tells you nothing about what to fix.

Use evals as your north star. Define task-specific, binary metrics tied to real system behavior and business requirements from day one. Use evals to drive the optimization flywheel.

Integrate evals into your development workflow to catch regressions before users do.

Recognizing these six mistakes is the first step to escaping PoC purgatory.

Conclusion

These six mistakes are not exotic edge cases. They are the exact patterns that repeatedly break real agentic systems. Individually, they look small, but in production, they compound into disasters.

Each of these mistakes deserves a deeper breakdown with real examples and production-tested fixes. That is why we turned them into a free 6-day email course. We cover one mistake per day, with the exact patterns and solutions we use in production.

💡 If you want the complete breakdown, sign up here.

Otherwise, see you next Tuesday.

Paul Iusztin

What’s your opinion? Do you agree, disagree, or is there something I missed?

Leave a comment

Enjoyed the article? The most sincere compliment is to share our work.

Whenever you’re ready, here is how I can help you

35 lessons. Pure foundations from scratch. 4 mini-projects. 2 production systems. A certificate and direct access to me & industry experts in our Discord.

Built for software and data professionals transitioning into AI engineering. Rated 5/5 with 300+ students. The first 7 lessons are free:

Start here

Not ready to commit? Start with our free Agent AI Engineering Guide, a 6-day email course on the mistakes that silently break AI agents in production.

Images

If not otherwise stated, all images are created by the author.

Why RAG Has Exactly 6 Failure Modes. No More, No Less.

Paul Iusztin — Tue, 17 Mar 2026 12:03:16 GMT

Welcome to the AI Evals & Observability series: A 7-part journey from shipping AI apps to systematically improving them. Made by busy people. For busy people.

🧐 Everyone says you need AI evals. Few explain how to actually build them and answer questions such as…

How do we avoid creating evals that waste our time and resources? How do we build datasets and design evaluators that matter? How do we adapt them for RAG? ...and most importantly, how do we stop “vibe checking” and leverage evals to actually track and optimize our app?

This 7-article series breaks it all down from first principles:

Integrating AI Evals Into Your AI App
Build an AI Evals Dataset from Scratch
Generate Synthetic Datasets for AI Evals
How to Design Evaluators
How to Evaluate the Evaluator
RAG Evaluation: The Only 6 Metrics You Need ← You are here
Lessons from 6 Months of Evals on a Production AI Companion

By the end, you’ll know how to integrate AI evals that actually track and improve the performance of your AI product. No vibe checking required!

Let’s get started.

RAG Evaluation: The Only 6 Metrics You Need

In our previous article, we covered how to validate your AI judges. We measured agreement with human judgment and iterated until alignment was high. Thus, you can now deploy with confidence.

However, a specialized challenge exists that general-purpose grading tools do not fully address. Evaluating RAG systems introduces a third variable, specifically the retrieved context. With this new element comes a distinct set of failure modes requiring their own metrics.

I am currently building a financial personal assistant at the stealth AI startup I work for. The application runs heavily on RAG. It pulls financial data from Postgres and integrates with external services such as email, Customer Relationship Management (CRM) tools, and cloud drives.

When it came time to evaluate the system, building the dataset proved harder than choosing metrics. Fortunately, we had a domain expert on the team who manually tested the application from the start. Therefore, we translated all of that Quality Assurance (QA) work into our AI evals collection using the error analysis workflow from Article 2.

Evaluating RAG systems introduces a unique difficulty. Each data sample required the correct context to be loaded into the database. We solved this by coupling every test case with a Postgres SQL export.

This file contained documents, chunks, embeddings, and metadata. We injected it directly into the storage system. This effectively created a cache that bypassed the ingestion pipeline during evals.

Once the data was in place, implementing the core RAG metrics became straightforward. We used tools like Opik and foundational models like Gemini Pro as the LLM judge. We had the context, the query, and the answer, which is everything you need.

What surprised me was that not every capability needed this level of dissection. For our report generation feature, we expect an exact format with specific values pulled from the storage. Checking the final document against a ground truth served as a better proxy than tracing every retrieval step.

Sometimes assessing the destination matters more than checking the route.

RAG evaluation feels needlessly complex. Vendors have an incentive to make it difficult. Every framework ships with many metrics and a dashboard, making you feel like you need a PhD to know if your system works.

Underneath all the complexity, RAG systems possess exactly three core components. These are the Question (Q), the retrieved Context (C), and the generated Answer (A). Furthermore, with these elements, there are exactly six possible relationships between them. When your RAG system fails, it breaks along one of these six relationships every single time.

The beauty of this framework is its exhaustive nature. There are no hidden variables.

You do not always need to evaluate all six individually. For core conversational features, you need the primary metrics because there are many silent failure modes. However, for structured output tasks, an end-to-end check against expected results can be sufficient.

Image 1: The six exhaustive relationships between the three RAG variables — Question, Context, and Answer.

Here is what you will learn in this article:

The only six relationships that exist in a RAG system.
How to evaluate your retrieval step before looking at generation.
The three core metrics every RAG application needs.
Advanced metrics for diagnosing subtle hallucinations.
How to match evaluation frequency and strictness to your domain.
How to collect and prepare the data your evaluators need.

Before digging into the article, a quick word from our sponsor, Opik. ↓

Opik: Open-Source LLMOps Platform (Sponsored)

This AI Evals & Observability series is brought to you by Opik, the LLMOps open-source platform used by Uber, Etsy, Netflix, and more.

We use Opik daily across our courses and AI products. Not just for observability, but to design and run the exact RAG evaluators this article teaches. All from the same platform.

Try Opik for free here (25k spans/month free)

This article shows you how to evaluate RAG systems. Opik gives you the harness to run those evaluations at scale. Here is how we use it:

Custom LLM judges with rubrics — Build the evaluators this article describes: define your criteria, add few-shot examples, and run them across hundreds of traces automatically.
Run experiments, compare results — Test different prompts, models, or configurations side by side. Opik scores each variant with your evaluators and shows you which one wins.
Plug evaluators into production — The same LLM judges you design for testing run on live traces too. Set up alarms when scores drop below your threshold so you catch regressions before users do.

Opik is fully open-source and works with custom code or most AI frameworks. You can also use the managed version for free (with 25K spans/month on their generous free tier):

Try Opik for free

↓ Now, let’s move back to the article.

The Only 6 RAG Evaluation Metrics That Can Exist

Jason Liu properly articulated the framework I am about to walk you through [1]. Since I wrote the LLM Engineer’s Handbook two years ago, I have watched many RAG evaluation tools emerge. They overcomplicate everything with proprietary metric suites.

Through all of that, I already internalized that only three variables matter in any RAG system. Testing the combinations between them is the only thing you should actually do. Jason Liu gave a clean, formal articulation to what I had in mind.

He nailed the structure and deserves the recognition for that.

Every RAG system has three variables. We define Q as the user’s question, C as the retrieved context, and A representing the generated answer. Thus, we use the notation X|Y to mean the quality of X given Y.

There are exactly six relationships between these variables:

C|Q (Context Relevance) asks if the retrieved context addresses the question. This measures your retriever, because if it pulls irrelevant passages, the generator cannot fix the issue.
A|C (Faithfulness) checks if the answer sticks to what is in the context. This measures your generator to see if the model hallucinated or stayed grounded in the documents.
A|Q (Answer Relevance) verifies if the response actually addresses the prompt. This is the end-to-end user experience metric. Even if the context is good and the reply is faithful, it must help the person asking.
C|A (Context Support) ensures the retrieved text contains everything needed to support every claim in the answer. This checks if the provided information was sufficient.
Q|C (Question Answerability) evaluates if the prompt can even be resolved with this context. This determines whether the system should attempt to reply at all.
Q|A (Self-Containment) asks if someone can infer the original question from reading the answer alone. This measures whether the output provides enough background to stand on its own.

This framework is exhaustive. Three components produce exactly six conditional relationships. There are no hidden factors.

Therefore, when your RAG system fails, one of these six metrics is broken.

Image 2: The complete grid of six RAG relationships — each mapped to the component it diagnoses (Retriever, Generator, or End-to-End).

Not all six relationships matter equally in every context. We organize them into three tiers. Let us start with retrieval metrics as the prerequisite foundation.

Tier 1: If Retrieval Is Broken, Nothing Else Matters

RAG is first and foremost a retrieval problem. If the search mechanism does not retrieve the right documents, nothing downstream can save you. The generator will either hallucinate or produce irrelevant answers based on whatever junk it received.

Before evaluating any of the six RAG relationships, you need to know if your retriever even works. You can use classic information retrieval metrics that measure how well you find relevant documents before generation starts. They are fast to compute and do not require LLM judges.

Hence, these measurements give quick feedback for tuning your retriever.

You must establish ground-truth labels to compute these metrics. For each query, you must know which text blocks are actually relevant. You can build this dataset using the reverse workflow presented in depth in Article 3.

As a quick recap, you start from your knowledge base of document chunks. Then, based on a set of closely related chunks, you generate realistic questions that can only be answered using that unique set of chunks.

Because the prompt derives from the source material, you know exactly which segment should be retrieved. This gives you a perfectly aligned ground-truth triplet: (question, answer, context). Thus, it becomes straightforward to check whether your search tool actually surfaces the right information.

There are four main metrics. Precision@K measures the fraction of the top K retrieved chunks that are actually relevant. If your retriever returns 5 chunks but only 2 are useful, your precision is 40%. Recall@K asks: of all the relevant chunks that exist in your entire corpus, how many did your retriever actually find in the top K? If the database has 4 chunks that could answer the question but you only retrieved 2 of them, your recall is 50%.

In addition, Mean Average Precision (MAP@K) averages precision across multiple queries, rewarding retrievers that consistently rank relevant chunks early. It works by computing precision at every position where a relevant item appears, then averaging those values. Here is a step-by-step example where the truly relevant items are A and C:

Average Precision for this query = (1.0 + 0.66) / 2 = 0.83. We only average the precision values at positions where a relevant item appeared (ranks 1 and 3). MAP@K then takes this score and averages it across all your queries.

Finally, Mean Reciprocal Rank (MRR@K) focuses on the position of the first relevant match. If the first relevant chunk appears at position 3, the reciprocal rank is 1/3; if it appears at position 1, it is 1/1. Higher is better.

Use these for daily development. These indicators are great for tuning embeddings and chunk sizes, while also being perfect for A/B testing retrieval strategies. No LLM is needed, making the process cheap and fast.

These numbers tell you if the search phase works, as illustrated in Image 3. The six RAG relationships tell you if the whole system functions properly, meaning you need both.

Image 3: Retrieval metrics applied to a financial assistant query — checking whether the retriever surfaces the right chunks.

With the retrieval confirmed working, you can evaluate the generation step. Let us look at the three core RAG relationships that every system needs.

Tier 2: The Three RAG Metrics You Always Need

These three metrics directly assess how well your RAG system functions. Most evaluation frameworks prioritize these specific measurements. They map to the three most critical of the six relationships.

First, we have Context Relevance (C|Q). This checks if the retrieved text actually addresses the prompt’s information needs. Therefore, it measures your search component similar to the metrics from Tier 1, but only looking at the dynamics between the context and question, without any ground truth.

Suppose we have a query about recent payouts from Q4. A good example is when the retrieved data contains the user’s dividend payment records from Q4, which passes. On the other side, a bad scenario occurs when the system returns general information about how these distributions work and their tax implications.

This represents the most common RAG failure mode. In our financial assistant, this often happened when the search tool pulled educational content instead of actual account data.

Second, we have Faithfulness (A|C). This asks if the reply restricts itself to claims that can be verified from the provided text. Hence, it measures whether your generator hallucinates or not.

In our use case, a good example is when the source contains a CRM record showing a client meeting scheduled for portfolio rebalancing. If the response states exactly that, it passes. A bad example happens when the model adds hallucinated agenda items like tax-loss harvesting strategies, resulting in a failure.

Third, we have Answer Relevance (A|Q). This checks if the output directly addresses the specific query from the prompt. This serves as the end-to-end user experience metric.

A good example is when a person asks how much their investments grew last month. The reply provides the specific percentage change and absolute dollar amount. A bad scenario is when the text discusses general market performance without mentioning the actual account.

We measure all three metrics using LLM judges as designed in Article 4 and validated in Article 5.

Image 4: The three core RAG metrics illustrated with financial assistant examples — each measures a critical relationship between Question, Context, and Answer.

These three metrics cover the most common failure modes. For specific domains and failure cases, we have to dig deeper into the next 3 metrics.

Tier 3: When the Core Metrics Can’t Explain the Failure

The last three metrics provide deeper diagnostic insights usually required in sensitive domains or use cases.

First, we have Context Support (C|A). This checks if the retrieved context contains all the information needed to fully back every claim in the response. While this sounds similar to Faithfulness (A|C), the direction is different. Faithfulness asks: “did the answer deviate from the context?” , where you look at the answer and check if it introduced claims that aren’t there. Context Support asks: “was the context sufficient to support the answer?”, where you look at the context and check if it actually contains everything the answer needs.

Here is a concrete example. Suppose the answer says your total Q4 dividend income was 2,340 across 5 holdings, with the largest payout from MSFT at 890. Now look at the context: it only contains the total dividend amount of $2,340. The per-holding breakdown is nowhere in the retrieved documents. The context was insufficient. It had the total but not the details. The LLM produced a plausible breakdown, but the context could not support it. This is low-context support.

Second, we have Question Answerability (Q|C). This asks if the user's question can even be resolved with the given information.

Suppose the user asks about crypto portfolio performance, but the retrieved documents only contain equity data. This makes the request unanswerable. The system should refuse rather than guess. This metric is important when you want to validate that your agent answers with “I don’t know” instead of confidently hallucinating an answer due to insufficient context.

In our financial assistant, this was important because some queries can only be resolved if the agent has permissions to access the right external tool first.

Third, we have Self-Containment (Q|A). This checks if someone can infer the original prompt from the reply alone.

A response stating your portfolio’s return is 12.4% stands alone. A reply stating just 12.4% does not. Prioritize this metric when outputs are forwarded via email, logged in CRM notes, or read without the original conversation.

Image 5: Faithfulness catches obvious hallucinations where the answer deviates from context. Context Support catches the subtler case where the context was insufficient, and the LLM silently filled the gaps.

You now know what to measure at each tier. Two questions remain. How often should you run each one? Which metrics deserve the most attention for your specific domain?

Matching Frequency and Strictness to Your Domain

Each tier maps to a different running frequency depending on how fast and cheap you can run the evaluations. It also depends on their overall impact on the system.

Start with Tier 1 on a daily basis. Implement fast retrieval metrics for everyday development and to tune your retrieval component. These are the cheapest to execute as they do not require LLM judges.

Furthermore, they provide quick feedback cycles. Use them for the improvement flywheel with synthetic data from day zero, focusing on these basic indicators before moving to more complex approaches.

Move to Tier 2 on a weekly basis. Implement the three primary RAG connections. These core metrics directly assess how well your system functions.

Use LLM-based grading for a more nuanced assessment of these interactions.

Incorporate Tier 3 on a monthly basis. Introduce advanced metrics when you need deeper insights. Run a full evaluation to identify prompts that the application should not be answering.

Image 6: The tiered evaluation cadence — cheapest and fastest at the top, deepest and most expensive at the bottom.

Here, we focused only on RAG-related measurements. However, this actually applies to any type of AI evals layer. You could implement Tier 1 checks in your CI/CD pipeline to execute on each commit.

You can trigger Tier 2 evaluations manually before merging your code from your feature branch. Finally, manually run Tier 3 metrics before major releases and strategic decisions.

There is another dimension to consider when choosing metrics for your use case, which is the good old domain.

Different domains require emphasis on distinct indicators. What matters most depends on the severity of the use case.

High-severity domains include finance, medical, and legal applications. In these fields, Faithfulness (A|C) and Context Support (C|A) are non-negotiable because every claim must be traceable. Answerability (Q|C) is also critical, meaning the application must refuse rather than guess.

Thus, you want precision over recall, which is the exact profile we use for our financial assistant.

Medium severity domains include customer support and technical documents. Answer Relevance (A|Q) leads here, as the output must be helpful and correct. Answerability (Q|C) helps you know when to hand off to a human, and you generally want recall over precision in retrieval.

Low-severity domains include research, writing, and content generation, where synthesis and creative reframing are expected. Context Relevance (C|Q) and Answer Relevance (A|Q) is primary, while Faithfulness (A|C) thresholds remain lower. The generator is supposed to add value beyond the raw text.

Therefore, you want high recall in the search phase to cast a wide net across sources.

You know what to measure, when, and what to prioritize. None of this works without the right data and infrastructure. Let us explore how to build the evaluation harness.

Building the RAG Evaluation Harness

RAG evaluation requires inputs, outputs, and the retrieved context. You need the full triplet.

The most common blind spot involves treating RAG testing like any other LLM assessment. Teams measure the final reply’s quality, but never capture what background data the generator actually worked with. Without that information, half the metrics in this article are impossible to compute.

Next, you should ground your RAG dataset in real human judgment. In our financial assistant, we had a domain specialist on the team who manually QA’d the application from the start. They ran queries, checked whether the right data was retrieved, and verified that the replies made sense.

We translated all of that manual work into our AI evals collection using the error analysis workflow from Article 2.

Also, building RAG datasets introduces a unique difficulty. Each test case needs the right documents, chunks, and embeddings available in the database. Otherwise, the search tool has nothing to work with.

Running the full ingestion pipeline for every evaluation run is slow and introduces variability.

We solved this by coupling each data point with a Postgres SQL export containing the relevant documents, chunks, embeddings, and metadata. We loaded this file directly into the storage system for each test, effectively creating a context cache. This made the process fast and reproducible.

We inject the records, run the query, evaluate the trace, reset the environment, and move to the next item. Image 7 illustrates these data preparation paths.

Image 7: Two paths for building your RAG evaluation dataset — manual expert QA (Article 2) and synthetic reverse workflow (Article 3) — both requiring proper context preparation.

If you do not have enough production data or expert QA samples, you can create synthetic RAG evaluation sets. Use the reverse workflow from Article 3 by starting from your knowledge base. Use an LLM to extract key facts from specific passages.

Then, formulate realistic user questions that can only be answered using that exact text block.

Because the prompt derives directly from the source material, the input, expected retrieval context, and expected reply are perfectly aligned by construction. This gives you a complete ground-truth triplet. Furthermore, this technique is especially powerful for bootstrapping coverage across your entire document corpus.

Include unanswerable queries in your collection. Do not only formulate prompts that the application should resolve correctly. Instead, create scenarios where the context deliberately lacks the information needed, forcing the agent to refuse or say it does not know.

Without these negative examples, your testing suite is one-sided. Your evals will optimize for always attempting a reply, whereas adding them directly exercises the Answerability metric from Tier 3.

Next, if your RAG architecture integrates with external services, the retrieval path is not just a vector database search. Your agent needs to decide which tool to call first. Should it query Postgres, search the CRM, or check the user’s email?

The best retrieval metrics will not help if your model invoked the wrong data source entirely.

In our financial assistant, this was critical. A query about a client meeting should hit the CRM, not the transaction database. Therefore, we added code-based checks for tool selection alongside our RAG metrics.

Another important trick is to run separate graders per RAG dimension. Do not ask one LLM to evaluate context relevance, faithfulness, and answer relevance in a single prompt. Isolated checks with dimension-specific rubrics produce more consistent results than a unified evaluation.

Ultimately, you need to log specific data for every trace using tools such as Opik. Record retrieved chunks to see what the generator had access to. If faithfulness fails, check whether the reply used information that was not provided. Track metadata such as document IDs and scores, because when context relevance fails, you need to know which items ranked highest. This represents the same observability infrastructure from Article 1.

Next Steps

RAG evaluation is not complex. It is just three variables and six relationships. When your RAG system fails, one of these specific links is broken.

Fix that exact issue and ignore the complexity theater.

Start with Tier 1 retrieval checks as daily prerequisites. Add Tier 2 primary indicators weekly. Extend to Tier 3 when specific failure modes demand it.

Ultimately, match your evaluation priorities to your domain’s risk profile.

Next time you see a vendor dashboard with dozens of RAG metrics, map each one back to the six relationships. If an indicator does not clearly measure one of the core links, it is noise. Drop it and focus on what actually diagnoses failures.

Next up is the final piece in the series. We will explore real-world lessons from months of running evals on a production AI companion. We will discuss what worked, what failed, and what the team would do differently.

Also, remember that this article is part of a 7-piece series on AI Evals & Observability. Here’s what’s ahead:

Integrating AI Evals Into Your AI App
Build an AI Evals Dataset from Scratch
Generate Synthetic Datasets for AI Evals
How to Design Evaluators
How to Evaluate the Evaluator
RAG Evaluation: The Only 6 Metrics You Need ← You just finished this one
Lessons from 6 Months of Evals on a Production AI Companion

See you next Tuesday.

Paul Iusztin

What’s your opinion? Do you agree, disagree, or is there something I missed?

Leave a comment

Enjoyed the article? The most sincere compliment is to share our work.

Whenever you’re ready, here is how I can help you

35 lessons. Pure foundations from scratch. 4 mini-projects. 2 production systems. A certificate and direct access to me & industry experts in our Discord.

Built for software and data professionals transitioning into AI engineering. Rated 5/5 with 300+ students. The first 7 lessons are free:

Start here

Not ready to commit? Start with our free Agent AI Engineering Guide, a 6-day email course on the mistakes that silently break AI agents in production.

Thanks again to Opik for sponsoring the series and keeping it free!

Try Opik for free here (25k spans/month free)

If you want to monitor, evaluate and optimize your AI workflows and agents:

Try Opik for free

References

Liu, J. (2025, May 19). There Are Only 6 RAG Evals. jxnl.co. https://jxnl.co/writing/2025/05/19/there-are-only-6-rag-evals/
Grace, M., Hadfield, J., Olivares, R., & De Jonghe, J. (2026, January 09). Demystifying Evals for AI Agents. Anthropic Engineering Blog. https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents

Images

If not otherwise stated, all images are created by the author.

Why Most RAG Tutorials Fail You

Priya — Thu, 12 Mar 2026 12:02:03 GMT

Paul: Today, the stage belongs to Priya, a Senior Software Architect who’s spent years shipping production-scale systems at Publicis Sapient and Tesco.

She’s deconstructing RAG with a production-first mindset, skipping the theoretical demos to focus on building for architectural reliability.

This one is packed. Let’s get into it 👀 ↓

The “Deer in the Headlights” Moment

I’ve navigated many shifts since the early days of the web, from monoliths to cloud-native microservices and SOAP to REST. But the AI wave felt different. I found myself in a “deer in the headlights” moment, completely unsure of what to learn or even where to start. Should I dive into neural network math, focus on model training, or master context engineering (AI moves quickly)?

Eventually, the path became clear when I realized my real value lay in applying AI to complex business problems. In an enterprise context, that led me straight to RAG. It isn’t just about the model, it’s also about the robust system you build around it. It felt like a return to architecture, a concrete problem to solve where using AI could make a profound difference. However, as I started building, I hit a second roadblock...

Why Most RAG Tutorials Didn’t Help Me Learn RAG

Most RAG tutorials optimize for one outcome: getting an answer out of a model as quickly as possible. That’s fine for demos. It’s a poor way to learn how RAG systems behave in production.

I’m not new to building production software. I’ve spent decades shipping and maintaining systems where debuggability, operability, and failure modes matter. What’s new to me here is RAG, not the discipline of building systems that survive contact with reality. While learning RAG, I wanted to internalize the constraints I’d eventually face anyway: inspectability, idempotent ingestion, debuggable retrieval, and controllable generation. That meant resisting framework-managed chains and owning the control flow early, even if it slowed me down.

This post documents how I’m teaching myself RAG by building a production-grade system in deliberate phases, using frameworks as utilities rather than architecture.

That approach was heavily influenced by, and indeed, inspired by Paul Iusztin’s From 100+ AI Tools to 4: My Production Stack, especially this idea:

AI frameworks are good utilities. They should not dictate the architecture or control flow of your system.

That became my guiding principle.

Before we continue, a quick word from the Decoding AI team. ↓

Go Deeper: Your Path to Agentic AI for Production

By the end, you will have gone from “I built a demo” to “I shipped a production-grade multi-agent system with evals, observability, and CI/CD.” Three portfolio projects, a certificate to back it up in interviews, and a Discord community with direct access to industry experts and Paul Iusztin.

34 lessons from first principles to production — context engineering, workflows, agents, evals, and deployment

Rated 4.9/5 ⭐️ by 290+ early students — ”Every AI Engineer needs a course like this” and ”an excellent bridge from experimental LLM projects to real-world AI engineering.”

Start learning today

↓ Now, back to the article.

The Architecture

Before diving into the details, here is the end-to-end architecture of the RAG system. This diagram serves as a reference model, and we’ll walk through each layer and the production considerations that shaped these choices.

Phase 1. Ingestion: Own the Data

What I built: a pipeline that discovers files → loads documents → normalizes text → chunks → embeds → stores everything in Postgres.

From experience building production systems, ingestion pipelines are where complexity quietly accumulates if they lack idempotence, i.e., the ability to safely re-run without ending up in an inconsistent state, such as duplicate data, partial updates, or stale artifacts. The same applies to traceability, i.e., the ability to trace exactly what happened, to which data, and when. I assumed the same risks would apply here.

What I didn’t account for was how the nature of debugging would differ so vastly from what I was used to in the past. It wasn’t just about emitting log and error information at the right places anymore. A bad chunk doesn’t throw an exception, it just hallucinates an answer three steps later.

Single database, many uses

Instead of introducing a separate vector database, I used Postgres + pgvector. Chunks, embeddings, and metadata live together. That decision buys me a lot:

I can inspect ingestion results with plain SQL
I can join vectors with relational metadata
I can reproduce retrieval behavior outside the application

That inspectability matters when you’re still learning, and having less infrastructure to maintain pays off long after.

Frameworks as utilities, not architecture

I use LangChain’s document loaders (TextLoader, PyMuPDFLoader) for format handling. But the control flow is explicit and mine:

for file_info in discover_files(folder_path):

    raw_docs = load_document(file_info.file_path)

    clean_text = normalize_text(raw_docs)

    chunks = chunk_text(clean_text, chunk_size=512)

    embeddings = await embed_chunks(chunks)

    await save_to_postgres(file_info, chunks, embeddings)

Each step is isolated. Each step can be logged, rerun, or replaced independently. When something breaks, I debug my code, not a framework-managed chain. For instance, during my initial tests, I used PyPDFLoader for the document loading phase. When I inspected the chunking, I realised the chunks had incorrect spaces due to kerning (e.g., ”P r e - C h u n k”). This was easy to address just by swapping PyPDFLoader with PyMuPDFLoader, which handled the complex layouts better.

Idempotence and safe re-runs

I mentioned earlier that pipelines break down when they lack idempotence. Here’s how I addressed it.

Every file’s contents are hashed. If the content hash matches what’s already stored, the file is skipped, no wasted compute, no risk. If the content has changed, its old chunks and embeddings are completely removed before the new ones are written. The database never ends up with a mix of old and new states for the same source.

During development, it makes experimentation safe. For instance, I can tweak chunk sizes, swap embedding models, or change preprocessing logic, then re-run the entire pipeline and trust the result. Without this, every experiment would mean manually cleaning up the database first, or worse, not realizing stale data was still there, silently affecting retrieval quality. More importantly, though, in production, it makes the pipeline resilient to failure. If ingestion crashes halfway through, I can simply restart it. Files already processed are skipped, and the rest pick up where they left off. No manual cleanup, no risk of corruption.

Phase 2. Retrieval: Make Failure Visible

Retrieval is where the quality of your results is determined, which makes debugging discipline more important than clever algorithms.

What I built: query preprocessing → embedding → similarity search → optional reranking.

Most LangChain tutorials show you how to build a RAG pipeline as a “chain,” i.e., a single call where the framework retrieves context, sends it to the LLM, and returns the answer. I chose not to do that. Consistent with the architecture philosophy above, retrieval is an explicit phase, and every step in the retrieval pipeline is an explicit function call I control and invoke directly:

async def retrieve(query: str, top_k: int = 5, rerank: bool = False):

    processed_query = preprocess_query(query)

    query_embedding = embed_query(processed_query)

    results = await search_similar_chunks(query_embedding, top_k)

    if rerank:

        results = rerank_results(query, results, top_k)

    return RetrievalResponse(query=query, results=results)

Keeping retrieval explicit makes failures legible. When an answer is wrong, I can tell whether the issue came from:

query preprocessing
embedding quality
recall
ranking

Because vectors live in Postgres, I can reproduce retrieval behavior with SQL alone.

That inspectability is invaluable when learning.

Retrieval → Generation boundary

This is the boundary where many RAG systems start to blur failure modes. But they are fundamentally different problems.

Retrieval, including reranking, decides what context is allowed to reach the model. It is a search problem. It fails by missing relevant information (poor recall) or burying it in noise (poor precision).

Generation decides what the model does with the provided context. It is a reasoning problem. It fails by misinterpreting the context, hallucinating facts, or ignoring instructions.

Keeping this boundary explicit helps you immediately diagnose which problem you effectively have. If the answer is wrong but the context contains the truth, you fix the prompt. If the context is missing the truth, you fix the search.

Phase 3. Generation: Treat the LLM as an Unreliable Dependency

What I built: context formatting → LLM invocation with retries → response assembly.

LLMs fail in ways traditional dependencies don’t. They are non-deterministic, occasionally unavailable, and can return plausible but wrong outputs. I treated the model as an unreliable dependency from day one, something to isolate, observe, and swap, not something to trust implicitly.

Swappable LLMs via a factory

A simple factory pattern makes experimentation cheap:

def get_llm():

    if provider == “openai”:

        return OpenAIChat(...)

    if provider == “gemini”:

        return GeminiChat(...)

Switching providers requires only configuration changes. Call sites don’t care. This is exactly where frameworks like LangChain shine: as an abstraction layer. They handle the messy API differences between providers so that OpenAIChat and GeminiChat can expose the same interface to your application. Using them here makes swapping models trivial, without letting them dictate your control flow.

Explicit orchestration over chains

Generation is intentionally step-by-step:

async def generate_answer(request):

    retrieval_response = await retrieve(query=request.query, ...)

    context_text = format_docs(retrieval_response)

    messages = get_rag_prompt().format_messages(

        context=context_text,

        question=request.query,

    )

    llm = get_llm()

    ai_message = await _invoke_llm_with_retry(llm, messages)

    return GenerateResponse(answer=ai_message.content, ...)

I avoided using LangChain’s expression language (LCEL) or runnable abstractions to build this flow. While powerful, they can hide what’s happening. Explicit orchestration is easier to debug, instrument, and reason about, especially while learning. This resonated with me even more since I’m used to a hands-on approach where I can write code and truly understand how the logic flows.

Retries are operational, not semantic

LLM calls fail for mundane reasons: transient network issues, provider-side throttling, or brief outages. I treat those as operational failures, not model behavior, and handle them explicitly.

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(

    stop=stop_after_attempt(3),

    wait=wait_exponential(multiplier=1, min=1, max=10),

)

async def _invoke_llm_with_retry(llm, messages):

    return await llm.ainvoke(messages)

Retries don’t make the model correct, they make the system resilient.

Phase 4. Serving: Thin Adapters, Shared Core

What I built: two interfaces over the same RAG core: a REST API and an MCP server.

In many RAG implementations, the retrieval logic is tightly coupled to the web framework (e.g., defined inside a FastAPI route). This makes it hard to test the logic in isolation or reuse it in different contexts (like a CLI or an evaluation script).

Instead, I treated my RAG system as a standalone library. The core function ‘generate_answer’ takes a pure Pydantic object and returns one. It knows nothing about HTTP, headers, or JSON.

This architecture allowed me to treat serving as a thin adapter pattern.

Adapter 1: REST API (FastAPI)

The REST adapter serves traditional software systems that need deterministic access to the retrieval layer. This includes web applications, backend services, internal tooling, evaluation pipelines, and automation jobs. These are environments where the caller decides exactly when and how the capability should be invoked.

The web layer itself does no extra work. It merely deserializes JSON, calls the core, and serializes the result.

@router.post(”“, response_model=GenerateResponse)

async def query(request: GenerateRequest) -> GenerateResponse:

    return await generate_answer(request)

Adapter 2: MCP Server (Capability Interface for Tool-Using LLMs)

Exposing the same core through the Model Context Protocol (MCP) transforms my RAG pipeline from an application-bound feature into a standardized capability.

MCP standardizes how capabilities are exposed to tool-using LLMs, regardless of whether the caller is a chat assistant, a coding copilot, or an autonomous agent.

I’m used to abstraction via careful refactoring, and it didn’t take long to understand that MCP was just another way of achieving this in the context of AI.

MCP-compatible clients such as Claude Desktop, Cowork, or Cursor can connect to the server and invoke the query_rag tool directly. This allows the underlying LLM to ground its responses in private data without requiring custom integrations, plugins, or connector logic.

Direct tool access is useful, but the MCP interface becomes far more valuable as agents adopt skills to carry out knowledge work and other multi-step tasks. For example, a “Market Research Skill” might combine web search, financial data lookup, and document retrieval. By exposing my RAG system as an MCP Tool, it becomes a standardized block that these skills can easily include in their workflows, without needing custom code.

@mcp.tool()

async def query_rag(query: str, top_k: int = 5, rerank: bool = True) -> dict:

    request = GenerateRequest(query=query, top_k=top_k, rerank=rerank)

    response = await generate_answer(request)

    return response.model_dump()

Both interfaces share the same core logic, thus avoiding duplication. Serving is an adapter problem, not a RAG problem.

Data lineage & traceability

Traceability isn’t new. Long before LLMs, production systems relied on lineage and identifiers to make failures debuggable. LLM non-determinism makes that discipline more important, not less.

Debugging RAG systems almost always means reasoning backward, from an answer, to retrieved chunks, to embeddings, and finally to source files.

In practice, this meant persisting identifiers at every step. Retrieved results carry chunk IDs forward. Generation logs include the IDs of the chunks used as context. When an answer looks wrong, I can trace it deterministically back to its source.

Without lineage, every bad answer looks like a model problem. With it, failures become diagnosable and fixable.

Vendor-neutral observability

This isn’t RAG specific. It’s the same observability discipline I’ve applied in other production systems. I deliberately kept it vendor-neutral, following a pattern I’ve used before to keep core logic decoupled from tooling.

Beyond tracing execution paths, tools like Opik let me reason about operational realities: latency per phase, token usage, and cost per request. Being able to see which model was invoked, how many tokens were consumed, and where time was spent turns performance and cost from assumptions into measurable signals.

def track(name: str = None, phase: Phase = None):

    def decorator(func):

        tags = [f”phase:{phase.value}”] if phase else []

        @opik.track(name=name, tags=tags)

        def wrapper(*args, **kwargs):

            return func(*args, **kwargs)

        return wrapper

    return decorator

If I ever switch observability tools, business code doesn’t change.

What I’m Exploring Next

Next steps include:

Adding durable workflow orchestration (DBOS or Prefect)
Implementing systematic evaluation for retrieval quality and faithfulness
Exploring more advanced retrieval patterns

Each will be added deliberately, one constraint at a time.

Closing Thoughts

Moving from keyword search to semantic and multimodal understanding is a massive leap in how we solve problems. While this technology introduces an ambiguity that contrasts with the deterministic systems I’ve built before, the incredible advantages and sheer problem-solving power it offers make the challenge truly exciting.

Building RAG this way slowed me down, deliberately.

What I have now is a system I can inspect, rerun, and reason about when something goes wrong. For me, that’s a better foundation than a faster demo.

I’m still learning RAG. But I’m learning it with the same instincts that shaped the rest of my career: make systems observable, design for failure, and own the control flow before adding abstraction.

Code: https://github.com/CalvHobbes/rag-101

Inspired by: From 100+ AI Tools to 4: My Production Stack by Paul Iusztin

See you next time.

Priya

What’s your opinion? Do you agree, disagree, or is there something I missed?

Leave a comment

Enjoyed the article? The most sincere compliment is to share our work.

Whenever you’re ready, here is how I can help you

35 lessons. Pure foundations from scratch. 4 mini-projects. 2 production systems. A certificate and direct access to me & industry experts in our Discord.

Built for software and data professionals transitioning into AI engineering. Rated 5/5 with 300+ students. The first 7 lessons are free:

Start here

Not ready to commit? Start with our free Agent AI Engineering Guide, a 6-day email course on the mistakes that silently break AI agents in production.

Images

If not otherwise stated, all images are created by the author.

Decoding AI Magazine

Your Second Brain Is a Graveyard. Make It Agent Memory.

Want your AI work featured across three platforms?

Why a Bigger Context Window Won’t Save You

The Deep-Research Loop, Version 1: Mining the Public Web

Version 2: Point the Loop at Your Second Brain

Version 3: From a Static Pile to a Living Wiki

A Memory Layer Built From Plain Files and No Database

Querying the Wiki, and Why It Never Freezes

Scope It to a Project With PARA

See It Run: Three Demos

Where This Is Going

What’s Next

Whenever you’re ready, here is how I can help you

Explore Next

Images

From Harness Lock-In to Portable Context Layer

The Architecture of Your Context Layer

Building a Unified Memory for Continual Learning

Build It Three Ways

Is MongoDB Enough?

Using Your Context Layer With Any Agent

What’s Next

Whenever you’re ready, here is how I can help you

Explore Next

Images

How Evaluation-Driven Development (EDD) Works

The Develop-a-Feature Workflow

From an architectural perspective, we have:

Two Modes: Manual Quick Check vs. Automated Experiments

Scope the Change and Simulate Its Traces

Context Population: Mocking Production State

On-Demand Datasets

Define the Judge

Run and Compare Experiments

Don’t Run Online Evals

🎥 Watch the full conversation between Alejandro Aboy and me

Final Thoughts

Whenever you’re ready, here is how I can help you

Images

Build, Configure, or Use As-Is: The Agentic Harness

The 80% Every Harness Shares

The Tools

The Agent Catalog Is Just a Config File

A Subagent Is a New Loop

Skills

Memory Is the Layer You Actually Build

The Sandbox: One Jail, Many Remote Workers

The Permission Layer Has Almost No AI in It

What’s Next

Whenever you’re ready, here is how I can help you

Images

How to Keep Your AI Agent's Knowledge Graph Clean

One Pipeline, Five Steps

Entity Resolution: “What Should We Call This?”

Deduplication: “Is This the Same Entity?”

When Confidence Lands in the Gray Zone

Cleaning the Graph While It Sleeps

What’s Next

Whenever you’re ready, here is how I can help you

References

Images

Stop Chasing the Perfect Ontology

What Is an Ontology?

The Overkill Trap: Why My Knowledge Graphs Never Shipped

The POLE+O Data Model

Preferences: The Things a Noun Likes

Facts: The Trick You Haven’t Thought Of

What’s Next

Whenever you’re ready, here is how I can help you

References

Images

Inside Neo4j's Agent Memory

What’s Inside neo4j-labs/agent-memory

Short-Term, Long-Term, Reasoning Memory

The Ontology

Extraction: From Raw Text to Typed Entities

When Two Mentions Are the Same Entity (And When They Aren’t)

Zooming into the Retrieval Algorithm

What’s Next

What’s Inside `neo4j-labs/agent-memory`

How `/research_create` Works

How `/research_search` Works

How `/research_distill` Works