<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Decoding AI Magazine]]></title><description><![CDATA[Join for content on designing, building, and shipping AI software. Learn AI engineering, end-to-end, from idea to production. Every Tuesday.]]></description><link>https://www.decodingai.com</link><image><url>https://substackcdn.com/image/fetch/$s_!k2ig!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00bc74e0-3601-49ce-8ab9-4c7b499ce597_1280x1280.png</url><title>Decoding AI Magazine</title><link>https://www.decodingai.com</link></image><generator>Substack</generator><lastBuildDate>Wed, 27 May 2026 07:58:49 GMT</lastBuildDate><atom:link href="https://www.decodingai.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Paul Iusztin]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[decodingai@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[decodingai@substack.com]]></itunes:email><itunes:name><![CDATA[Paul Iusztin]]></itunes:name></itunes:owner><itunes:author><![CDATA[Paul Iusztin]]></itunes:author><googleplay:owner><![CDATA[decodingai@substack.com]]></googleplay:owner><googleplay:email><![CDATA[decodingai@substack.com]]></googleplay:email><googleplay:author><![CDATA[Paul Iusztin]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Stop Chasing the Perfect Ontology]]></title><description><![CDATA[Start with a fixed, generic base and extend only when your data demands it.]]></description><link>https://www.decodingai.com/p/ship-a-knowledge-graph-ontology-in-5-minutes</link><guid isPermaLink="false">https://www.decodingai.com/p/ship-a-knowledge-graph-ontology-in-5-minutes</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Tue, 26 May 2026 05:00:51 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!kNpy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84fca35c-39fc-48e6-86d9-fb50fcee9e7b_1400x1000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kNpy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84fca35c-39fc-48e6-86d9-fb50fcee9e7b_1400x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kNpy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84fca35c-39fc-48e6-86d9-fb50fcee9e7b_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!kNpy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84fca35c-39fc-48e6-86d9-fb50fcee9e7b_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!kNpy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84fca35c-39fc-48e6-86d9-fb50fcee9e7b_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!kNpy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84fca35c-39fc-48e6-86d9-fb50fcee9e7b_1400x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kNpy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84fca35c-39fc-48e6-86d9-fb50fcee9e7b_1400x1000.png" width="1400" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/84fca35c-39fc-48e6-86d9-fb50fcee9e7b_1400x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;From files to a living graph: start with a fixed generic base, extend only when your data demands it.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="From files to a living graph: start with a fixed generic base, extend only when your data demands it." title="From files to a living graph: start with a fixed generic base, extend only when your data demands it." srcset="https://substackcdn.com/image/fetch/$s_!kNpy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84fca35c-39fc-48e6-86d9-fb50fcee9e7b_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!kNpy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84fca35c-39fc-48e6-86d9-fb50fcee9e7b_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!kNpy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84fca35c-39fc-48e6-86d9-fb50fcee9e7b_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!kNpy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84fca35c-39fc-48e6-86d9-fb50fcee9e7b_1400x1000.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>For a while now I&#8217;ve been trying to build a proper memory layer on top of my research, writing, and content creation. Today it all lives in my Second Brain in Obsidian, where the primitives are files like notes, videos, and articles.</p><p>What I actually want is to shift those primitives from files to entities and relationships, such as people, locations, objects, topics, preferences, and facts. I want the memory to get closer to reality so I can watch how things evolve over time. I want a knowledge graph.</p><p>Everyone agrees knowledge graphs and GraphRAG provide a more performant substrate for a unified agent memory layer than plain RAG. But kicking one off is far harder. The resistance always collapses to the same wall: how you model your data. Your ontology is the hardest part of the system.</p><p>If you can&#8217;t define your ontology properly for your domain, the graph won&#8217;t represent the reality you want. The right entities and relationships simply aren&#8217;t there. As a result, GraphRAG ends up performing worse than the simple RAG you were trying to beat.</p><p>This translates straight to a memory layer. There&#8217;s no dodging it. Even if you stay file-only (a &#8220;virtual knowledge graph,&#8221; like an LLM knowledge base over your notes), you still hit the same data-modelling question: which primitives, and which entities, do you even extract?</p><p>The instinctive reaction is to design the perfect, complete ontology upfront. That&#8217;s exactly the trap that freezes the project.</p><p>The strategy is a not-overkill ontology. You need something flexible enough to kick off with almost no friction before you really know your domain, extending it with domain-specific detail as you explore your data.</p><p>Concretely, you use a small, fixed, generic, but extendable noun data model, known as POLE+O. Plus two core primitives, Preferences and Facts, for everything that doesn&#8217;t fit into the nouns.</p><p>You ship something that works, then add subtypes as a lightweight data-exploration step shows you where the generic types clash with your real data.</p><p>This approach lets you stand up a knowledge-graph memory layer for your own assistant without burning weeks on schema design. To build this, we first need to understand what an ontology actually is and why targeted models beat exhaustive ones.</p><div class="callout-block" data-callout="true"><h2><a href="https://academy.towardsai.net/pages/free-lesson-offer-2?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Start Your Transition Into AI Engineering (Product)</a></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XTiA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F977ee5b6-01a9-4bf9-a923-d092a8f5ac28_1114x1175.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XTiA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F977ee5b6-01a9-4bf9-a923-d092a8f5ac28_1114x1175.png 424w, https://substackcdn.com/image/fetch/$s_!XTiA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F977ee5b6-01a9-4bf9-a923-d092a8f5ac28_1114x1175.png 848w, https://substackcdn.com/image/fetch/$s_!XTiA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F977ee5b6-01a9-4bf9-a923-d092a8f5ac28_1114x1175.png 1272w, https://substackcdn.com/image/fetch/$s_!XTiA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F977ee5b6-01a9-4bf9-a923-d092a8f5ac28_1114x1175.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XTiA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F977ee5b6-01a9-4bf9-a923-d092a8f5ac28_1114x1175.png" width="463" height="488.3527827648115" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/977ee5b6-01a9-4bf9-a923-d092a8f5ac28_1114x1175.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1175,&quot;width&quot;:1114,&quot;resizeWidth&quot;:463,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XTiA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F977ee5b6-01a9-4bf9-a923-d092a8f5ac28_1114x1175.png 424w, https://substackcdn.com/image/fetch/$s_!XTiA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F977ee5b6-01a9-4bf9-a923-d092a8f5ac28_1114x1175.png 848w, https://substackcdn.com/image/fetch/$s_!XTiA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F977ee5b6-01a9-4bf9-a923-d092a8f5ac28_1114x1175.png 1272w, https://substackcdn.com/image/fetch/$s_!XTiA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F977ee5b6-01a9-4bf9-a923-d092a8f5ac28_1114x1175.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This article showed how to design the ontology your knowledge-graph memory needs. My <a href="https://academy.towardsai.net/pages/free-lesson-offer-2?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering course</a> shows the harness around it. I just released a free preview to build and run a working agent in 5 minutes.</p><p>You build a multi-agent system with two MCP servers (Research Agent + Writing Workflow), a deep research algorithm, an evaluator-optimizer loop, observability, and LLM-as-judge evals. Patterns required to ship AI.</p><p>Built for software, data engineers or scientists transitioning into AI engineering.</p><p>7 free lessons, 2 MCP agents ready for your GitHub portfolio. Part of our 35-lesson course. Rated 5/5 by 300+ students.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/pages/free-lesson-offer-2?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Start the free preview &#8594;&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.towardsai.net/pages/free-lesson-offer-2?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Start the free preview &#8594;</span></a></p></div><h2>What Is an Ontology?</h2><p>An ontology is the formal answer to 1 question. When you read the world, what do you write down as nodes, and what do you draw as edges? It specifies the kinds of things that exist in your domain, their properties, and how they relate to each other.</p><p>The ontology&#8217;s job is to map a targeted slice of the real world into the digital world. A good ontology is highly targeted to the problem you actually want to solve. If you over-model, you drown in noise and never ship. Plus, it get&#8217;s extremely expensive to extract and maintain the knoweldge graph. If you under-target, the graph doesn&#8217;t reflect the reality you care about.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!shkp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6da9e68c-1cec-4f99-afe9-ada0003fd270_1400x1202.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!shkp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6da9e68c-1cec-4f99-afe9-ada0003fd270_1400x1202.png 424w, https://substackcdn.com/image/fetch/$s_!shkp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6da9e68c-1cec-4f99-afe9-ada0003fd270_1400x1202.png 848w, https://substackcdn.com/image/fetch/$s_!shkp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6da9e68c-1cec-4f99-afe9-ada0003fd270_1400x1202.png 1272w, https://substackcdn.com/image/fetch/$s_!shkp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6da9e68c-1cec-4f99-afe9-ada0003fd270_1400x1202.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!shkp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6da9e68c-1cec-4f99-afe9-ada0003fd270_1400x1202.png" width="1400" height="1202" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6da9e68c-1cec-4f99-afe9-ada0003fd270_1400x1202.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1202,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;An ontology is a deliberately narrow funnel from the real world into a queryable graph.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="An ontology is a deliberately narrow funnel from the real world into a queryable graph." title="An ontology is a deliberately narrow funnel from the real world into a queryable graph." srcset="https://substackcdn.com/image/fetch/$s_!shkp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6da9e68c-1cec-4f99-afe9-ada0003fd270_1400x1202.png 424w, https://substackcdn.com/image/fetch/$s_!shkp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6da9e68c-1cec-4f99-afe9-ada0003fd270_1400x1202.png 848w, https://substackcdn.com/image/fetch/$s_!shkp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6da9e68c-1cec-4f99-afe9-ada0003fd270_1400x1202.png 1272w, https://substackcdn.com/image/fetch/$s_!shkp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6da9e68c-1cec-4f99-afe9-ada0003fd270_1400x1202.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>An ontology is a deliberately narrow funnel from the real world into a queryable graph.</em></figcaption></figure></div><p>Look at concrete, shipped ontologies for real-world proof. The <a href="https://create-context-graph.dev/docs/reference/domain-catalog">create-context-graph</a> domain catalog made by Neo4j publishes 22 ready-made domain ontologies. Every single one lands at exactly 10 to 12 entity types. They use a shared 5-noun base plus only 5 to 7 domain-specific nouns.</p><p>For example, the Personal Knowledge domain models the world as Note, Contact, Project, Topic, Bookmark, and JournalEntry. The Agent Memory uses Agent, Conversation, Memory, ToolCall, and Session. The lesson here is that real ontologies are small on purpose. They capture only the entities required to answer the questions the system is designed for.</p><p>So if targeted and small is the goal, why does everyone &#8212; me included &#8212; reach for big and perfect first? That&#8217;s the trap.</p><h2>The Overkill Trap: Why My Knowledge Graphs Never Shipped</h2><p>When I first encountered the ontology concept, I assumed I had to study my domain in depth. I thought I needed to model all of finance, for example, and design the ideal ontology before working with any real data. You can&#8217;t actually do that before you have a system running and data to look at. You just pile up assumptions that mostly turn out wrong.</p><p>I got frozen. Every knowledge-graph solution I started stayed on my laptop and never got used, because I was waiting on an ideal ontology I could never reach. Without understanding the ontology, I couldn&#8217;t even write a decent extraction step to populate it. I was deadlocked, bringing 0 value.</p><p>The breakthrough was realizing I need a couple of models that let me start generic and extend over time. As I get more data, analyze it, and actually understand my problem, the schema evolves. Let&#8217;s meet the base model that lets you start in 5 minutes instead of 5 weeks.</p><h2>The POLE+O Data Model</h2><p>POLE+O is a tiny, fixed, top-level vocabulary that can classify almost anything you pull out of text. It stands for Person, Object, Location, Event, and Organization <a href="https://neo4j.com/labs/agent-memory/explanation/poleo-model/">[2]</a>. It originated in law-enforcement and intelligence analysis. The Organization type was added for general-purpose entity extraction. The point of a fixed base is queryability. There are always exactly 5 base nouns to filter on, so the graph stays answerable no matter how it grows underneath.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wK0q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34af01f1-f989-43ca-a1ca-f0e76cfa57fd_1400x1159.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wK0q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34af01f1-f989-43ca-a1ca-f0e76cfa57fd_1400x1159.png 424w, https://substackcdn.com/image/fetch/$s_!wK0q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34af01f1-f989-43ca-a1ca-f0e76cfa57fd_1400x1159.png 848w, https://substackcdn.com/image/fetch/$s_!wK0q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34af01f1-f989-43ca-a1ca-f0e76cfa57fd_1400x1159.png 1272w, https://substackcdn.com/image/fetch/$s_!wK0q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34af01f1-f989-43ca-a1ca-f0e76cfa57fd_1400x1159.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wK0q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34af01f1-f989-43ca-a1ca-f0e76cfa57fd_1400x1159.png" width="1400" height="1159" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/34af01f1-f989-43ca-a1ca-f0e76cfa57fd_1400x1159.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1159,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;5 fixed base nouns, each extensible with optional subtypes &#8212; the base never changes, so every refinement is additive.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="5 fixed base nouns, each extensible with optional subtypes &#8212; the base never changes, so every refinement is additive." title="5 fixed base nouns, each extensible with optional subtypes &#8212; the base never changes, so every refinement is additive." srcset="https://substackcdn.com/image/fetch/$s_!wK0q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34af01f1-f989-43ca-a1ca-f0e76cfa57fd_1400x1159.png 424w, https://substackcdn.com/image/fetch/$s_!wK0q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34af01f1-f989-43ca-a1ca-f0e76cfa57fd_1400x1159.png 848w, https://substackcdn.com/image/fetch/$s_!wK0q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34af01f1-f989-43ca-a1ca-f0e76cfa57fd_1400x1159.png 1272w, https://substackcdn.com/image/fetch/$s_!wK0q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34af01f1-f989-43ca-a1ca-f0e76cfa57fd_1400x1159.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>5 fixed base nouns, each extensible with optional subtypes</em></figcaption></figure></div><p>Person covers people, aliases, and personas. Object covers physical or digital things. Location covers places, addresses, and regions. Event covers meetings, transactions, and incidents. Organization covers companies, teams, and institutions. Two or three of these catch the overwhelming majority of what a personal assistant needs.</p><p>Here are POLE+O&#8217;s five base types and the default subtypes each one ships with:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yyhn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc57663-f5d0-4f97-b81a-60c1bf2c34b9_1920x883.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yyhn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc57663-f5d0-4f97-b81a-60c1bf2c34b9_1920x883.png 424w, https://substackcdn.com/image/fetch/$s_!yyhn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc57663-f5d0-4f97-b81a-60c1bf2c34b9_1920x883.png 848w, https://substackcdn.com/image/fetch/$s_!yyhn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc57663-f5d0-4f97-b81a-60c1bf2c34b9_1920x883.png 1272w, https://substackcdn.com/image/fetch/$s_!yyhn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc57663-f5d0-4f97-b81a-60c1bf2c34b9_1920x883.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yyhn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc57663-f5d0-4f97-b81a-60c1bf2c34b9_1920x883.png" width="1456" height="670" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7cc57663-f5d0-4f97-b81a-60c1bf2c34b9_1920x883.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:670,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;table&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="table" title="table" srcset="https://substackcdn.com/image/fetch/$s_!yyhn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc57663-f5d0-4f97-b81a-60c1bf2c34b9_1920x883.png 424w, https://substackcdn.com/image/fetch/$s_!yyhn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc57663-f5d0-4f97-b81a-60c1bf2c34b9_1920x883.png 848w, https://substackcdn.com/image/fetch/$s_!yyhn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc57663-f5d0-4f97-b81a-60c1bf2c34b9_1920x883.png 1272w, https://substackcdn.com/image/fetch/$s_!yyhn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc57663-f5d0-4f97-b81a-60c1bf2c34b9_1920x883.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here&#8217;s the beauty of this approach. You extend the base nouns with your own subtypes, and that&#8217;s how you tailor a generic ontology to your specific domain. It works exactly like object-oriented programming. You start from base classes you adopt without thinking. Then you subclass into specifics as your use case clarifies.</p><p>You can kick off with nothing extended and add concrete types only as you understand your data better. Neo4j&#8217;s <a href="https://github.com/neo4j-labs/agent-memory">agent-memory</a> library uses precisely this approach. POLE+O is its default, swappable ontology.</p><p>The data-exploration workflow runs in a simple loop. First, kick off with generic POLE+O. Second, run an exploration extraction over your real data. Forget production reliability. You only care about understanding what&#8217;s there. Third, inspect the graph for clashes where the generic model lies about your data. Fourth, add or rename subtypes to fix each clash. Finally, repeat the process. You won&#8217;t get it perfect, and that&#8217;s the point. You iterate like any other AI app instead of freezing.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pksj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cbabfc9-cba1-433b-8ed2-b5d593ef3c81_1400x1351.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pksj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cbabfc9-cba1-433b-8ed2-b5d593ef3c81_1400x1351.png 424w, https://substackcdn.com/image/fetch/$s_!pksj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cbabfc9-cba1-433b-8ed2-b5d593ef3c81_1400x1351.png 848w, https://substackcdn.com/image/fetch/$s_!pksj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cbabfc9-cba1-433b-8ed2-b5d593ef3c81_1400x1351.png 1272w, https://substackcdn.com/image/fetch/$s_!pksj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cbabfc9-cba1-433b-8ed2-b5d593ef3c81_1400x1351.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pksj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cbabfc9-cba1-433b-8ed2-b5d593ef3c81_1400x1351.png" width="1400" height="1351" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3cbabfc9-cba1-433b-8ed2-b5d593ef3c81_1400x1351.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1351,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;You don't theorize subtypes &#8212; you discover them by watching where generic POLE+O mislabels your real data, then patch the clash and loop.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="You don't theorize subtypes &#8212; you discover them by watching where generic POLE+O mislabels your real data, then patch the clash and loop." title="You don't theorize subtypes &#8212; you discover them by watching where generic POLE+O mislabels your real data, then patch the clash and loop." srcset="https://substackcdn.com/image/fetch/$s_!pksj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cbabfc9-cba1-433b-8ed2-b5d593ef3c81_1400x1351.png 424w, https://substackcdn.com/image/fetch/$s_!pksj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cbabfc9-cba1-433b-8ed2-b5d593ef3c81_1400x1351.png 848w, https://substackcdn.com/image/fetch/$s_!pksj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cbabfc9-cba1-433b-8ed2-b5d593ef3c81_1400x1351.png 1272w, https://substackcdn.com/image/fetch/$s_!pksj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cbabfc9-cba1-433b-8ed2-b5d593ef3c81_1400x1351.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>You discover subtypes by watching where generic POLE+O mislabels your real data, then patch the clash and loop.</em></figcaption></figure></div><p>Look at named examples from real extraction runs. Claude Code comes back tagged as a Person when it&#8217;s clearly an Object. The &#8220;AI Engineer&#8221; conference lands as an Event when you wanted an Organization. DeepSeek is tagged a Person, not an Object.</p><p>Portugal and New York both get a flat Location label even though one&#8217;s a country and one&#8217;s a city. An agentic harness shows up as a generic Object when, for knowledge work, you&#8217;d rather have a Topic type. Each clash is a signal to add 1 subtype, not to redesign the whole schema.</p><p>POLE+O nouns and their subtypes cover the things in your world. But to fill in the gaps there are two specials tricks we have to go over.</p><h2>Preferences: The Things a Noun Likes</h2><p>Preferences are the second family of entities you attach to the graph. They are things a noun likes or dislikes. A Preference is a characteristic of an entity. It represents a stance. The canonical case is a person who likes, prefers, or dislikes something.</p><p>Concretely, a Preference entity looks like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3KUZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a9cbc22-bb97-4bd4-99fa-a86ad59e6c60_2120x819.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3KUZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a9cbc22-bb97-4bd4-99fa-a86ad59e6c60_2120x819.png 424w, https://substackcdn.com/image/fetch/$s_!3KUZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a9cbc22-bb97-4bd4-99fa-a86ad59e6c60_2120x819.png 848w, https://substackcdn.com/image/fetch/$s_!3KUZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a9cbc22-bb97-4bd4-99fa-a86ad59e6c60_2120x819.png 1272w, https://substackcdn.com/image/fetch/$s_!3KUZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a9cbc22-bb97-4bd4-99fa-a86ad59e6c60_2120x819.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3KUZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a9cbc22-bb97-4bd4-99fa-a86ad59e6c60_2120x819.png" width="1456" height="562" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3a9cbc22-bb97-4bd4-99fa-a86ad59e6c60_2120x819.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:562,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;code&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="code" title="code" srcset="https://substackcdn.com/image/fetch/$s_!3KUZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a9cbc22-bb97-4bd4-99fa-a86ad59e6c60_2120x819.png 424w, https://substackcdn.com/image/fetch/$s_!3KUZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a9cbc22-bb97-4bd4-99fa-a86ad59e6c60_2120x819.png 848w, https://substackcdn.com/image/fetch/$s_!3KUZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a9cbc22-bb97-4bd4-99fa-a86ad59e6c60_2120x819.png 1272w, https://substackcdn.com/image/fetch/$s_!3KUZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a9cbc22-bb97-4bd4-99fa-a86ad59e6c60_2120x819.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><code>category</code> groups the preference, <code>preference</code> is the statement itself, and <code>context</code> optionally records when or where it applies. <code>confidence</code> runs from 0 to 1. The <code>embedding</code> makes it semantically searchable.</p><p>Make it concrete. &#8220;Loves Italian food&#8221;, &#8220;prefers dark mode&#8221;, and &#8220;dislikes long meetings&#8221; are clear examples. Each is a stable stance the assistant should remember and adapt to.</p><p>By default, a Preference hangs off the Person. That&#8217;s the most common and useful case. You can extend preferences to other objects, like an Organization&#8217;s policies, a car&#8217;s settings, or an Event&#8217;s dress code.</p><p>Because I&#8217;m building a personal assistant, I start by attaching Preferences only to the Person. This keeps the graph clean, low-noise, and small. I&#8217;ll extend it later only when a concrete use case demands it.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hSPb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76d5574d-28fd-47a0-94e4-4a735f05dbd4_1400x879.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hSPb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76d5574d-28fd-47a0-94e4-4a735f05dbd4_1400x879.png 424w, https://substackcdn.com/image/fetch/$s_!hSPb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76d5574d-28fd-47a0-94e4-4a735f05dbd4_1400x879.png 848w, https://substackcdn.com/image/fetch/$s_!hSPb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76d5574d-28fd-47a0-94e4-4a735f05dbd4_1400x879.png 1272w, https://substackcdn.com/image/fetch/$s_!hSPb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76d5574d-28fd-47a0-94e4-4a735f05dbd4_1400x879.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hSPb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76d5574d-28fd-47a0-94e4-4a735f05dbd4_1400x879.png" width="1400" height="879" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/76d5574d-28fd-47a0-94e4-4a735f05dbd4_1400x879.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:879,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Start simple &#8212; preferences attached only to the user; the dotted edges are extensions you add only when you need them.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Start simple &#8212; preferences attached only to the user; the dotted edges are extensions you add only when you need them." title="Start simple &#8212; preferences attached only to the user; the dotted edges are extensions you add only when you need them." srcset="https://substackcdn.com/image/fetch/$s_!hSPb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76d5574d-28fd-47a0-94e4-4a735f05dbd4_1400x879.png 424w, https://substackcdn.com/image/fetch/$s_!hSPb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76d5574d-28fd-47a0-94e4-4a735f05dbd4_1400x879.png 848w, https://substackcdn.com/image/fetch/$s_!hSPb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76d5574d-28fd-47a0-94e4-4a735f05dbd4_1400x879.png 1272w, https://substackcdn.com/image/fetch/$s_!hSPb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76d5574d-28fd-47a0-94e4-4a735f05dbd4_1400x879.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Preferences attached only to the user. The dotted edges are extensions you add only when you need them.</em></figcaption></figure></div><p>Preferences are the personalization layer. They act as the memory of the user&#8217;s stances. They are the &#8220;sweet sauce&#8221; that makes every future response feel tailored.</p><p>There is one issue. Plenty of useful knowledge is just an atomic fact. Forcing all of that into the ontology is how graphs explode in complexity. The fix is a deliberately generic primitive.</p><h2>Facts: The Trick You Haven&#8217;t Thought Of</h2><p>The Facts entity is the fallback for everything that doesn&#8217;t cleanly fit a noun or a Preference. You drop the claim into a generic Fact. This is the move that keeps the ontology small and stops you from over-thinking the schema.</p><p>A Fact is the closest thing to a classic-RAG chunk. An LLM produces each Fact during extraction. Each Fact holds a single, atomic concept which works like a charm via semantic search.</p><p>The beauty is that with facts you avoid the usual chunking errors, such as splits mid-thought, mixed concepts, and arbitrary boundaries. In reality, a Fact is a triplet. A subject, predicate, and object like &#8220;Eiffel Tower / is / 330m tall&#8221; gets embedded and stored as 1 granular unit.</p><p>Here is the shape of a Fact entity:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!g51f!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa162972e-7ad0-4275-94ec-4a9e654d287a_2120x819.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!g51f!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa162972e-7ad0-4275-94ec-4a9e654d287a_2120x819.png 424w, https://substackcdn.com/image/fetch/$s_!g51f!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa162972e-7ad0-4275-94ec-4a9e654d287a_2120x819.png 848w, https://substackcdn.com/image/fetch/$s_!g51f!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa162972e-7ad0-4275-94ec-4a9e654d287a_2120x819.png 1272w, https://substackcdn.com/image/fetch/$s_!g51f!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa162972e-7ad0-4275-94ec-4a9e654d287a_2120x819.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!g51f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa162972e-7ad0-4275-94ec-4a9e654d287a_2120x819.png" width="1456" height="562" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a162972e-7ad0-4275-94ec-4a9e654d287a_2120x819.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:562,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;code&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="code" title="code" srcset="https://substackcdn.com/image/fetch/$s_!g51f!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa162972e-7ad0-4275-94ec-4a9e654d287a_2120x819.png 424w, https://substackcdn.com/image/fetch/$s_!g51f!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa162972e-7ad0-4275-94ec-4a9e654d287a_2120x819.png 848w, https://substackcdn.com/image/fetch/$s_!g51f!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa162972e-7ad0-4275-94ec-4a9e654d287a_2120x819.png 1272w, https://substackcdn.com/image/fetch/$s_!g51f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa162972e-7ad0-4275-94ec-4a9e654d287a_2120x819.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The triplet &#8212; <code>subject</code>, <code>predicate</code>, <code>object</code> &#8212; is the whole fact. <code>valid_from</code> and <code>valid_until</code> give it optional bi-temporal validity. The <code>embedding</code>, computed over the concatenated triplet, is what makes the fact retrievable by semantic search.</p><p>It&#8217;s confusing that we have a triplet stored as a node. But this is what it makes it flexible. We don&#8217;t worry about modeling these one-off triplets directly into the ontology, but the LLM extracts them as-is from the text.</p><p>Facts are usually wired to nothing. They have no relationships to other entities. They are retrieved only via semantic search and text search. A Fact stays in the graph but is independent of it. This works because a graph store runs vector search and graph traversal in the same query engine <a href="https://neo4j.com/labs/agent-memory/explanation/graph-architecture/">[4]</a>. Which means facts are retrieved only via semantic/text search.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-Wtw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc37ed613-1a16-40f7-9607-2ed492a787cb_1400x1138.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-Wtw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc37ed613-1a16-40f7-9607-2ed492a787cb_1400x1138.png 424w, https://substackcdn.com/image/fetch/$s_!-Wtw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc37ed613-1a16-40f7-9607-2ed492a787cb_1400x1138.png 848w, https://substackcdn.com/image/fetch/$s_!-Wtw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc37ed613-1a16-40f7-9607-2ed492a787cb_1400x1138.png 1272w, https://substackcdn.com/image/fetch/$s_!-Wtw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc37ed613-1a16-40f7-9607-2ed492a787cb_1400x1138.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-Wtw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc37ed613-1a16-40f7-9607-2ed492a787cb_1400x1138.png" width="1400" height="1138" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c37ed613-1a16-40f7-9607-2ed492a787cb_1400x1138.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1138,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Facts are atomic triplets retrieved by similarity and wired to nothing; POLE+O entities are reached by walking the graph. Same store, two retrieval modes.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Facts are atomic triplets retrieved by similarity and wired to nothing; POLE+O entities are reached by walking the graph. Same store, two retrieval modes." title="Facts are atomic triplets retrieved by similarity and wired to nothing; POLE+O entities are reached by walking the graph. Same store, two retrieval modes." srcset="https://substackcdn.com/image/fetch/$s_!-Wtw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc37ed613-1a16-40f7-9607-2ed492a787cb_1400x1138.png 424w, https://substackcdn.com/image/fetch/$s_!-Wtw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc37ed613-1a16-40f7-9607-2ed492a787cb_1400x1138.png 848w, https://substackcdn.com/image/fetch/$s_!-Wtw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc37ed613-1a16-40f7-9607-2ed492a787cb_1400x1138.png 1272w, https://substackcdn.com/image/fetch/$s_!-Wtw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc37ed613-1a16-40f7-9607-2ed492a787cb_1400x1138.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Facts are atomic triplets retrieved by similarity and wired to nothing; POLE+O entities are reached by walking the graph. Same store, two retrieval modes.</em></figcaption></figure></div><p>Facts let you ship a memory layer before you have the perfect ontology. Anything you can&#8217;t yet model degrades gracefully into a searchable atomic node instead of blocking the build. Early on, you lean on Facts. As the graph matures, claims migrate toward typed entities and edges. It costs nothing to schema and nothing to maintain when entities merge or get deleted.</p><h2>What&#8217;s Next</h2><p>The takeaway is the posture. An ontology is a living artifact you bootstrap from a fixed generic base and grow through a data-exploration loop, exactly like any other AI application.</p><p>If you want to see the whole strategy implemented, the fastest path is to play with Neo4j&#8217;s <a href="https://github.com/neo4j-labs/agent-memory">agent-memory</a> SDK or its MCP server. It uses POLE+O as a swappable default, subtypes as cheap extensions, and Preferences and Facts as first-class primitives. Studying it is what made all of this finally click for me.</p><p>I&#8217;m actively migrating my own Obsidian Second Brain toward the POLE+O, Preferences, and Facts primitives. This turns thousands of files into a graph I can actually traverse, visualize, and watch evolve over time.</p><p><em>But here is what I&#8217;m wondering:</em></p><blockquote><p><em><strong>If you worked with Knowledge Graphs, what was your process in discovering your own ontology?</strong></em></p></blockquote><p><em>Click the button below and tell me. I read every response.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/ship-a-knowledge-graph-ontology-in-5-minutes/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/ship-a-knowledge-graph-ontology-in-5-minutes/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to restack this for your readers.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/ship-a-knowledge-graph-ontology-in-5-minutes?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/ship-a-knowledge-graph-ontology-in-5-minutes?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><div class="callout-block" data-callout="true"><h4>Whenever you&#8217;re ready, here is how I can help you</h4><p>If you want to go from zero to shipping production-grade AI agents, check out my <strong><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering course</a></strong>, built with Towards AI.</p><p>35 lessons. Three end-to-end portfolio projects. A certificate. And a Discord community with direct access to industry experts and me.</p><p>Built for software, data engineers or scientists transitioning into AI engineering.</p><p><em>Rated 5/5 by 300+ students. The first 7 lessons are free:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Start here&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Start here</span></a></p><p><em>Not ready to commit?</em> Start with our <strong><a href="https://email-course.towardsai.net/?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">free Agentic AI Engineering Guide</a></strong>, a 6-day email course on the mistakes that silently break AI agents in production.</p></div><div><hr></div><h2>References</h2><ol><li><p>Create Context Graph. (n.d.). Domain Catalog. create-context-graph. https://create-context-graph.dev/docs/reference/domain-catalog</p></li><li><p>Neo4j Labs. (n.d.). POLE+O Data Model. Neo4j Agent Memory. https://neo4j.com/labs/agent-memory/explanation/poleo-model/</p></li><li><p>Neo4j Labs. (n.d.). Neo4j Agent Memory. GitHub. https://github.com/neo4j-labs/agent-memory</p></li><li><p>Neo4j Labs. (n.d.). Why Neo4j? Graph-Native Memory Architecture. Neo4j Agent Memory. https://neo4j.com/labs/agent-memory/explanation/graph-architecture/</p></li></ol><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[Inside Neo4j's Agent Memory]]></title><description><![CDATA[The knowledge-graph patterns that turn one-shot conversations into compounding intelligence.]]></description><link>https://www.decodingai.com/p/understanding-neo4j-graph-agent-memory-system</link><guid isPermaLink="false">https://www.decodingai.com/p/understanding-neo4j-graph-agent-memory-system</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Tue, 19 May 2026 08:55:51 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!dE_c!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10ce2a6-84a5-44ba-b5b5-61af99c83e87_1400x1000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dE_c!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10ce2a6-84a5-44ba-b5b5-61af99c83e87_1400x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dE_c!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10ce2a6-84a5-44ba-b5b5-61af99c83e87_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!dE_c!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10ce2a6-84a5-44ba-b5b5-61af99c83e87_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!dE_c!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10ce2a6-84a5-44ba-b5b5-61af99c83e87_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!dE_c!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10ce2a6-84a5-44ba-b5b5-61af99c83e87_1400x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dE_c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10ce2a6-84a5-44ba-b5b5-61af99c83e87_1400x1000.png" width="1400" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e10ce2a6-84a5-44ba-b5b5-61af99c83e87_1400x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Knowledge-graph memory: an agent that doesn't start every conversation from zero.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Knowledge-graph memory: an agent that doesn't start every conversation from zero." title="Knowledge-graph memory: an agent that doesn't start every conversation from zero." srcset="https://substackcdn.com/image/fetch/$s_!dE_c!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10ce2a6-84a5-44ba-b5b5-61af99c83e87_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!dE_c!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10ce2a6-84a5-44ba-b5b5-61af99c83e87_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!dE_c!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10ce2a6-84a5-44ba-b5b5-61af99c83e87_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!dE_c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10ce2a6-84a5-44ba-b5b5-61af99c83e87_1400x1000.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I already have a second brain setup based on Obsidian, Readwise, NotebookLM, and Claude Code. I dump all my notes, research, and highlights there. Whenever I want to create content, I create a scoped wiki targeted toward the topic. I gather information from my second brain using a deep research algorithm on top of my private data and external resources via NotebookLM. The wiki is structured like the LLM Knowledge Base presented by Andrej Karpathy.</p><p>This setup fails to extract and maintain shared entities, preferences, and facts across the wiki as the knowledge base grows. For example, if the topic &#8220;Claude Code&#8221; is mentioned in 10 documents, I want to extract all the metadata about it into its dedicated folder. I want to see what other entities it relates to, such as Anthropic, San Francisco, Codex, or Gemini CLI. I also want to see how many documents mention it to rank frequency. You can do that with a pure file-based system and Obsidian, but performance degrades when your data scales past 50 documents.</p><p>The same concept applies to any unstructured knowledge base. You need a way to extract and connect knowledge from your conversations, documents, and images. This becomes essential between conversations so your agent doesn&#8217;t forget you. Instead, it provides a personalized experience. It&#8217;s also critical for context engineering to inject the right context at the right time and keep the LLM focused on relevant facts.</p><p>Most teams default to one of two memory approaches. Both collapse under real use. A file system gives you append-only logs that the agent re-reads from scratch, which fragments and rots context.</p><p>A vector index gives you fuzzy semantic recall but no merge, no identity, and no way to know if this is the same Karpathy you knew yesterday. Durable AI memory requires a structured graph to track identity and relationships <a href="https://www.linkedin.com/posts/tonyseale_this-week-anthropic-dropped-claude-sonnet-activity-7379787334398926848-iVOE/">[1]</a>. Without this structure, the assistant forgets past interactions and fails to build compounding intelligence.</p><p>Knowledge-graph memory is the next step on the arc from Retrieval-Augmented Generation (RAG) to agentic RAG to agent memory <a href="https://www.leoniemonigatti.com/blog/from-rag-to-agent-memory.html">[2]</a>. Building a unified knowledge-graph memory system is hard, so most teams skip it.</p><p>During my research, I stumbled upon <code>neo4j-labs/agent-memory</code>. It&#8217;s a masterpiece. Who knows more about knowledge graphs (KGs) than Neo4j?</p><p>After I spent 2 days playing with it and understanding the codebase, I realized it was the perfect mental model for any agent memory system powered by KGs.</p><p>In this article, I&#8217;ll walk through the core architectural patterns of <code>neo4j-labs/agent-memory</code>. It features 1 graph, 3 memory tiers, the POLE+O ontology, a 3-stage extraction pipeline, a composite resolver, and the SAME_AS pattern.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cxdW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44a2dc7c-3947-44f5-88fd-1a5aef196b8d_1744x1398.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cxdW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44a2dc7c-3947-44f5-88fd-1a5aef196b8d_1744x1398.png 424w, https://substackcdn.com/image/fetch/$s_!cxdW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44a2dc7c-3947-44f5-88fd-1a5aef196b8d_1744x1398.png 848w, https://substackcdn.com/image/fetch/$s_!cxdW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44a2dc7c-3947-44f5-88fd-1a5aef196b8d_1744x1398.png 1272w, https://substackcdn.com/image/fetch/$s_!cxdW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44a2dc7c-3947-44f5-88fd-1a5aef196b8d_1744x1398.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cxdW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44a2dc7c-3947-44f5-88fd-1a5aef196b8d_1744x1398.png" width="1456" height="1167" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/44a2dc7c-3947-44f5-88fd-1a5aef196b8d_1744x1398.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1167,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:390376,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/197969180?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44a2dc7c-3947-44f5-88fd-1a5aef196b8d_1744x1398.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cxdW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44a2dc7c-3947-44f5-88fd-1a5aef196b8d_1744x1398.png 424w, https://substackcdn.com/image/fetch/$s_!cxdW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44a2dc7c-3947-44f5-88fd-1a5aef196b8d_1744x1398.png 848w, https://substackcdn.com/image/fetch/$s_!cxdW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44a2dc7c-3947-44f5-88fd-1a5aef196b8d_1744x1398.png 1272w, https://substackcdn.com/image/fetch/$s_!cxdW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44a2dc7c-3947-44f5-88fd-1a5aef196b8d_1744x1398.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>By the end, you&#8217;ll have a concrete mental model. You can ship on top of their Software Development Kit (SDK) or hook it into your agent via their Model Context Protocol (MCP) server. Alternatively, you can steal the patterns and ship the same architecture on Postgres or MongoDB if a full graph database in production doesn&#8217;t make sense for your use case.</p><div class="callout-block" data-callout="true"><h2><a href="https://academy.towardsai.net/pages/free-lesson-offer-2?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Start Your Transition Into AI Engineering (Product)</a></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BjfO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5d0b9ec-af0d-4a21-9110-f5a7d1c4a742_1285x1074.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BjfO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5d0b9ec-af0d-4a21-9110-f5a7d1c4a742_1285x1074.png 424w, https://substackcdn.com/image/fetch/$s_!BjfO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5d0b9ec-af0d-4a21-9110-f5a7d1c4a742_1285x1074.png 848w, https://substackcdn.com/image/fetch/$s_!BjfO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5d0b9ec-af0d-4a21-9110-f5a7d1c4a742_1285x1074.png 1272w, https://substackcdn.com/image/fetch/$s_!BjfO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5d0b9ec-af0d-4a21-9110-f5a7d1c4a742_1285x1074.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BjfO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5d0b9ec-af0d-4a21-9110-f5a7d1c4a742_1285x1074.png" width="1285" height="1074" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e5d0b9ec-af0d-4a21-9110-f5a7d1c4a742_1285x1074.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1074,&quot;width&quot;:1285,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:225430,&quot;alt&quot;:&quot;Build and run a working agent in 5 minutes &#8212; free preview of the Agentic AI Engineering course&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Build and run a working agent in 5 minutes &#8212; free preview of the Agentic AI Engineering course" title="Build and run a working agent in 5 minutes &#8212; free preview of the Agentic AI Engineering course" srcset="https://substackcdn.com/image/fetch/$s_!BjfO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5d0b9ec-af0d-4a21-9110-f5a7d1c4a742_1285x1074.png 424w, https://substackcdn.com/image/fetch/$s_!BjfO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5d0b9ec-af0d-4a21-9110-f5a7d1c4a742_1285x1074.png 848w, https://substackcdn.com/image/fetch/$s_!BjfO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5d0b9ec-af0d-4a21-9110-f5a7d1c4a742_1285x1074.png 1272w, https://substackcdn.com/image/fetch/$s_!BjfO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5d0b9ec-af0d-4a21-9110-f5a7d1c4a742_1285x1074.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This article shows the memory layer your agent needs. My <a href="https://academy.towardsai.net/pages/free-lesson-offer-2?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering course</a> shows the harness around it, and I just released a free preview that lets you build and run a working agent in 5 minutes.</p><p>You build a multi-agent system with two MCP servers (Research Agent + Writing Workflow), a deep research algorithm, an evaluator-optimizer loop, observability, and LLM-as-judge evals. The production patterns behind agents that actually ship.</p><p>Built for software, data engineers or scientists transitioning into AI engineering.</p><p>7 free lessons, 2 MCP agents ready for your GitHub portfolio. Part of the 35-lesson course. Rated 5/5 by 300+ students.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/pages/free-lesson-offer-2?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Start the free preview &#8594;&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.towardsai.net/pages/free-lesson-offer-2?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Start the free preview &#8594;</span></a></p></div><h2>What&#8217;s Inside <code>neo4j-labs/agent-memory</code></h2><p>The SDK takes natural-language interactions on the write side and returns a fused memory context on the read side. Everything anchors to a single Neo4j graph. For our scoped wiki, notes and Readwise highlights about Claude Code flow in. A structured pull of what the agent knows about Claude Code, how it relates to Anthropic, and its frequency across 50 documents comes out.</p><p>At its core, there is 1 graph and 3 memory tiers joined by typed edges: short-term conversations, long-term typed entities, and reasoning traces. They&#8217;re stitched together by <code>:MENTIONS</code>, <code>:TOUCHED</code>, and <code>:INITIATED_BY</code> relationships <a href="https://neo4j.com/labs/agent-memory/explanation/memory-types/">[3]</a>.</p><p>The architecture contains 8 small, single-responsibility modules. The <code>models/</code> module holds Pydantic schemas. The <code>schema/</code> module handles Cypher migrations. The <code>extraction/</code> module runs the Named Entity Recognition (NER) pipeline. The <code>resolution/</code> module holds the composite resolver. The <code>dedup/</code> module manages the SAME_AS pattern. The <code>core/</code> module provides <code>MemoryClient.get_context()</code>. The <code>mcp/</code> module runs the FastMCP server with 15 tools. The <code>integrations/</code> module holds 9 framework adapters for tools like LangChain and LlamaIndex.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Z_M-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb76b9b07-fb46-40c4-81a1-61e5aff42cc0_1400x1311.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Z_M-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb76b9b07-fb46-40c4-81a1-61e5aff42cc0_1400x1311.png 424w, https://substackcdn.com/image/fetch/$s_!Z_M-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb76b9b07-fb46-40c4-81a1-61e5aff42cc0_1400x1311.png 848w, https://substackcdn.com/image/fetch/$s_!Z_M-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb76b9b07-fb46-40c4-81a1-61e5aff42cc0_1400x1311.png 1272w, https://substackcdn.com/image/fetch/$s_!Z_M-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb76b9b07-fb46-40c4-81a1-61e5aff42cc0_1400x1311.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Z_M-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb76b9b07-fb46-40c4-81a1-61e5aff42cc0_1400x1311.png" width="1400" height="1311" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b76b9b07-fb46-40c4-81a1-61e5aff42cc0_1400x1311.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1311,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The 8 modules sit between an MCP / framework interface and a single Neo4j graph that holds all three memory tiers.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The 8 modules sit between an MCP / framework interface and a single Neo4j graph that holds all three memory tiers." title="The 8 modules sit between an MCP / framework interface and a single Neo4j graph that holds all three memory tiers." srcset="https://substackcdn.com/image/fetch/$s_!Z_M-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb76b9b07-fb46-40c4-81a1-61e5aff42cc0_1400x1311.png 424w, https://substackcdn.com/image/fetch/$s_!Z_M-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb76b9b07-fb46-40c4-81a1-61e5aff42cc0_1400x1311.png 848w, https://substackcdn.com/image/fetch/$s_!Z_M-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb76b9b07-fb46-40c4-81a1-61e5aff42cc0_1400x1311.png 1272w, https://substackcdn.com/image/fetch/$s_!Z_M-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb76b9b07-fb46-40c4-81a1-61e5aff42cc0_1400x1311.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>The 8 modules sit between an MCP / framework interface and a single Neo4j graph that holds all three memory tiers.</em></figcaption></figure></div><p>Consider an end-to-end scenario. You drop a Readwise highlight about Claude Code into your scoped wiki. The <code>extraction/</code> module pulls Claude Code as an Object, Anthropic as an Organization, and Codex as an Object. The <code>resolution/</code> module canonicalizes each against existing nodes. The <code>dedup/</code> module checks vector similarity and either auto-merges or flags a pending <code>:SAME_AS</code> edge. The <code>schema/</code> module commits <code>:MENTIONS</code> edges from the note to each entity.</p><p>Later, <code>MemoryClient.get_context()</code> pulls fused context across the same graph in one call. This matters concretely for the scoped-wiki agent. You can ask what you discussed last session, what you know about Claude Code, and why the agent surfaced a Codex comparison last Tuesday. The SDK answers all three against the same graph. It uses the same Cypher dialect with no cross-store join logic.</p><h2>Short-Term, Long-Term, Reasoning Memory</h2><p>The SDK splits memory into three layers that all live on the same Neo4j graph <a href="https://neo4j.com/labs/agent-memory/explanation/memory-types/">[3]</a>. Short-term memory is the linear message sequence. It uses ordered <code>:Message</code> nodes chained by <code>:NEXT</code> edges, scoped to a <code>:Conversation</code>. Long-term memory is the typed entity graph. It uses deduplicated <code>:Entity</code> nodes with vector embeddings and arbitrary domain relationships.</p><p>Reasoning memory is a tree per agent run. It uses a <code>:ReasoningTrace</code> root with child <code>:ReasoningStep</code> nodes capturing thoughts and tool calls. For the scoped-wiki agent, short-term memory holds your current chat. Long-term memory holds the canonical Claude Code entity plus its relations to Anthropic, San Francisco, Codex, and Gemini CLI. Reasoning memory holds the trace of how the agent picked those specific notes to answer you.</p><p>Three relationships do the entire stitching. The <code>:MENTIONS</code> edge joins short-term to long-term memory. The <code>:INITIATED_BY</code> edge joins reasoning to short-term memory. The <code>:TOUCHED</code> edge joins reasoning to long-term memory. These three edges make provenance a one-hop query rather than a log-reconstruction project.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_7Cv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbbcbfc0-e8bd-484a-8ec6-eb4105fcc620_1400x1370.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_7Cv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbbcbfc0-e8bd-484a-8ec6-eb4105fcc620_1400x1370.png 424w, https://substackcdn.com/image/fetch/$s_!_7Cv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbbcbfc0-e8bd-484a-8ec6-eb4105fcc620_1400x1370.png 848w, https://substackcdn.com/image/fetch/$s_!_7Cv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbbcbfc0-e8bd-484a-8ec6-eb4105fcc620_1400x1370.png 1272w, https://substackcdn.com/image/fetch/$s_!_7Cv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbbcbfc0-e8bd-484a-8ec6-eb4105fcc620_1400x1370.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_7Cv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbbcbfc0-e8bd-484a-8ec6-eb4105fcc620_1400x1370.png" width="1400" height="1370" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cbbcbfc0-e8bd-484a-8ec6-eb4105fcc620_1400x1370.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1370,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Three tiers, one graph &#8212; the typed edges make every cross-tier question a one-hop query.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Three tiers, one graph &#8212; the typed edges make every cross-tier question a one-hop query." title="Three tiers, one graph &#8212; the typed edges make every cross-tier question a one-hop query." srcset="https://substackcdn.com/image/fetch/$s_!_7Cv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbbcbfc0-e8bd-484a-8ec6-eb4105fcc620_1400x1370.png 424w, https://substackcdn.com/image/fetch/$s_!_7Cv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbbcbfc0-e8bd-484a-8ec6-eb4105fcc620_1400x1370.png 848w, https://substackcdn.com/image/fetch/$s_!_7Cv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbbcbfc0-e8bd-484a-8ec6-eb4105fcc620_1400x1370.png 1272w, https://substackcdn.com/image/fetch/$s_!_7Cv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbbcbfc0-e8bd-484a-8ec6-eb4105fcc620_1400x1370.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Three tiers, one graph &#8212; the typed edges (</em><code>:MENTIONS</code><em>, </em><code>:INITIATED_BY</code><em>, </em><code>:TOUCHED</code><em>) make every cross-tier question a one-hop query.</em></figcaption></figure></div><p>Reasoning memory is the novelty from this architecture. By storing past successful or failed thinking patterns into the memory, the agent can one-shot future similar requests or at least know not to repeat similar mistakes. Intuitively, it&#8217;s similar to Reinforcement Learning (RL), but instead of baking the optimizations into the weights, you do it at the database level.</p><p>The most important part of this architecture is the ontology.</p><h2>The Ontology</h2><p>The long-term memory uses a closed five-type vocabulary for its ontology known as POLE+O. It uses Person, Object, Location, Event, and Organization, borrowed from intelligence-analysis taxonomies <a href="https://neo4j.com/labs/agent-memory/explanation/poleo-model/">[5]</a>. Every entity is exactly one of these five types. Subtypes are open, but the top-level vocabulary is fixed.</p><p>In the personal assistant, Karpathy is a Person. Claude Code is an Object. Anthropic is an Organization. Your Tuesday deep-research run is an Event. San Francisco is a Location.</p><p>Type and subtype materialize as multi-tier Neo4j labels. The query builder sanitizes and PascalCases them into labels like <code>:Entity:Person:Individual</code>. You can search by type or subtype, making this solution highly efficient.</p><p>Using this strategy, you can extend each core type from POLE+O with your own custom domain. Other defaults are: <code>:Entity:Location:City</code>, <code>:Entity:Event:Concert</code>, <code>:Entity:Organization:Company</code>, etc. <a href="https://create-context-graph.dev/docs/reference/domain-catalog">Here</a> is a catalog of over 20 domains such as Data Journalism, Gaming, Personal Knowledge, and Product Management.</p><p>Entities modeled via POLE+O are nouns. The SDK adds 2 other node types beyond entities.</p><p><code>:Fact</code> nodes hold every claim mentioned in the text. They&#8217;re intentionally generic so the ontology doesn&#8217;t get over-specified. They serve as a fallback when nothing else fits. You can intuitively see them as chunks of text that contain only 1 concept.</p><p>Then there are <code>:Preference</code> nodes that store user preferences via a <code>SUPERSEDED_BY</code> relationship. As agent memory is user-centric, this provides the WOW effect where the agent remembers past preferences and learns from them over time.</p><p>For the scoped wiki, &#8220;Anthropic developed Claude Code&#8221; is an edge. &#8220;Claude Code 1.0 shipped in 2025&#8221; is a <code>:Fact</code>. &#8220;I prefer agent-harness comparisons over pure benchmarks&#8221; is a <code>:Preference</code>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!m-oO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96ae330f-3585-4cc7-9dc5-831f2c778da4_1400x1357.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!m-oO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96ae330f-3585-4cc7-9dc5-831f2c778da4_1400x1357.png 424w, https://substackcdn.com/image/fetch/$s_!m-oO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96ae330f-3585-4cc7-9dc5-831f2c778da4_1400x1357.png 848w, https://substackcdn.com/image/fetch/$s_!m-oO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96ae330f-3585-4cc7-9dc5-831f2c778da4_1400x1357.png 1272w, https://substackcdn.com/image/fetch/$s_!m-oO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96ae330f-3585-4cc7-9dc5-831f2c778da4_1400x1357.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!m-oO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96ae330f-3585-4cc7-9dc5-831f2c778da4_1400x1357.png" width="1400" height="1357" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/96ae330f-3585-4cc7-9dc5-831f2c778da4_1400x1357.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1357,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;A scoped-wiki graph built from the five POLE+O types.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A scoped-wiki graph built from the five POLE+O types." title="A scoped-wiki graph built from the five POLE+O types." srcset="https://substackcdn.com/image/fetch/$s_!m-oO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96ae330f-3585-4cc7-9dc5-831f2c778da4_1400x1357.png 424w, https://substackcdn.com/image/fetch/$s_!m-oO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96ae330f-3585-4cc7-9dc5-831f2c778da4_1400x1357.png 848w, https://substackcdn.com/image/fetch/$s_!m-oO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96ae330f-3585-4cc7-9dc5-831f2c778da4_1400x1357.png 1272w, https://substackcdn.com/image/fetch/$s_!m-oO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96ae330f-3585-4cc7-9dc5-831f2c778da4_1400x1357.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>A scoped-wiki graph built from the five POLE+O types &#8212; every node is exactly one of Person, Object, Location, Event, Organization, and every typed relationship is a </em><code>:RELATED_TO</code><em> edge with the semantic name carried as a property.</em></figcaption></figure></div><h2>Extraction: From Raw Text to Typed Entities</h2><p>The SDK runs entity extraction as a speed-versus-accuracy ladder. It uses spaCy for fast statistical NER. It uses GLiNER and GLiREL for zero-shot extraction. It uses an LLM stage for cases that need real semantics and to extract the relationships between them <a href="https://neo4j.com/labs/agent-memory/explanation/extraction-pipeline/">[6]</a>.</p><p>Each stage maps its outputs back to POLE+O types. It uses explicit merge strategies when 2 extractors disagree. When you drop a Readwise highlight about Claude Code into your scoped wiki, spaCy lifts proper nouns like Anthropic and San Francisco. GLiNER catches domain entities like Claude Code and Gemini CLI. The LLM stage only fires when the previous 2 stages leave ambiguity, or when the model needs to extract relationships.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sw5N!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ad32c39-16d8-4231-b576-69087bff7511_1400x1369.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sw5N!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ad32c39-16d8-4231-b576-69087bff7511_1400x1369.png 424w, https://substackcdn.com/image/fetch/$s_!sw5N!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ad32c39-16d8-4231-b576-69087bff7511_1400x1369.png 848w, https://substackcdn.com/image/fetch/$s_!sw5N!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ad32c39-16d8-4231-b576-69087bff7511_1400x1369.png 1272w, https://substackcdn.com/image/fetch/$s_!sw5N!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ad32c39-16d8-4231-b576-69087bff7511_1400x1369.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sw5N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ad32c39-16d8-4231-b576-69087bff7511_1400x1369.png" width="1400" height="1369" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3ad32c39-16d8-4231-b576-69087bff7511_1400x1369.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1369,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;From raw text to a clean graph &#8212; the three-zone SAME_AS pattern is what stops the same entity from becoming three nodes.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="From raw text to a clean graph &#8212; the three-zone SAME_AS pattern is what stops the same entity from becoming three nodes." title="From raw text to a clean graph &#8212; the three-zone SAME_AS pattern is what stops the same entity from becoming three nodes." srcset="https://substackcdn.com/image/fetch/$s_!sw5N!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ad32c39-16d8-4231-b576-69087bff7511_1400x1369.png 424w, https://substackcdn.com/image/fetch/$s_!sw5N!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ad32c39-16d8-4231-b576-69087bff7511_1400x1369.png 848w, https://substackcdn.com/image/fetch/$s_!sw5N!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ad32c39-16d8-4231-b576-69087bff7511_1400x1369.png 1272w, https://substackcdn.com/image/fetch/$s_!sw5N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ad32c39-16d8-4231-b576-69087bff7511_1400x1369.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em> From raw text to a clean graph &#8212; the three-zone SAME_AS pattern is what stops the same entity from becoming three nodes.</em></figcaption></figure></div><p>Routing every mention through an LLM would multiply extraction cost massively for marginal recall on rare entities. The ladder pushes high-confidence cases to cheap models. It escalates only ambiguous mentions to the zero-shot models and reserves the LLM stage for when real semantics matter.</p><p>The real problem is at the normalization step.</p><h2>When Two Mentions Are the Same Entity (And When They Aren&#8217;t)</h2><p>Resolution and deduplication are 2 different problems. Resolution sets a canonical string property on an existing reference. Deduplication decides whether a new node gets created at all. Conflating them is how graphs end up with 3 Anthropic nodes that none of your queries find together <a href="https://neo4j.com/labs/agent-memory/explanation/resolution-deduplication/">[7]</a>.</p><p>Resolution runs 3 strategies on the name field in cost order. Exact matches existing canonical strings. Fuzzy uses RapidFuzz string similarity for surface variants like &#8220;A. Karpathy&#8221; and &#8220;Karpathy, Andrej&#8221;. Semantic falls back to embedding similarity for cases like &#8220;the founder of Eureka Labs&#8221;. It only matches between nodes of the same type, meaning a Person only resolves against Person candidates.</p><p>After resolution runs, two mentions like &#8220;Apple&#8221; and &#8220;Apple Inc.&#8221; end up with different surface names but the same canonical name. That&#8217;s why a second step is needed. Deduplication looks at the semantics, not just the name.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!joRt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f4cbb67-cf5c-481b-b0e9-f7fa3f45eb1a_1400x1363.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!joRt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f4cbb67-cf5c-481b-b0e9-f7fa3f45eb1a_1400x1363.png 424w, https://substackcdn.com/image/fetch/$s_!joRt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f4cbb67-cf5c-481b-b0e9-f7fa3f45eb1a_1400x1363.png 848w, https://substackcdn.com/image/fetch/$s_!joRt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f4cbb67-cf5c-481b-b0e9-f7fa3f45eb1a_1400x1363.png 1272w, https://substackcdn.com/image/fetch/$s_!joRt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f4cbb67-cf5c-481b-b0e9-f7fa3f45eb1a_1400x1363.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!joRt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f4cbb67-cf5c-481b-b0e9-f7fa3f45eb1a_1400x1363.png" width="1400" height="1363" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2f4cbb67-cf5c-481b-b0e9-f7fa3f45eb1a_1400x1363.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1363,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Same name, three outcomes &#8212; high similarity auto-merges, the middle band defers to a human, and low similarity creates two nodes that share a canonical name but live as separate referents.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Same name, three outcomes &#8212; high similarity auto-merges, the middle band defers to a human, and low similarity creates two nodes that share a canonical name but live as separate referents." title="Same name, three outcomes &#8212; high similarity auto-merges, the middle band defers to a human, and low similarity creates two nodes that share a canonical name but live as separate referents." srcset="https://substackcdn.com/image/fetch/$s_!joRt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f4cbb67-cf5c-481b-b0e9-f7fa3f45eb1a_1400x1363.png 424w, https://substackcdn.com/image/fetch/$s_!joRt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f4cbb67-cf5c-481b-b0e9-f7fa3f45eb1a_1400x1363.png 848w, https://substackcdn.com/image/fetch/$s_!joRt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f4cbb67-cf5c-481b-b0e9-f7fa3f45eb1a_1400x1363.png 1272w, https://substackcdn.com/image/fetch/$s_!joRt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f4cbb67-cf5c-481b-b0e9-f7fa3f45eb1a_1400x1363.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Same name, three outcomes: High similarity auto-merges, the middle band defers to a human, and low similarity creates 2 nodes that share a canonical name but live as separate referents.</em></figcaption></figure></div><p>For deduplication, the SDK uses vector and fuzzy similarity across the entire node content. This ensures the node is actually the same, not just a name coincidence. In other words, this avoids false positives. Using vector and fuzzy search, the SDK computes a score.</p><p>Scores at or above 0.95 trigger an auto-merge. Scores below 0.85 create a new node. Scores between 0.85 and 0.95 don&#8217;t silently merge. Instead, they create a <code>:SAME_AS</code> edge with a pending status. This flags the edge for a human or downstream agent to resolve later. This pattern stops &#8220;Jensen Huang the NVIDIA CEO&#8221; from merging with &#8220;Jensen Huang the Taipei dermatologist&#8221; just because their embeddings landed 0.91 apart <a href="https://neo4j.com/labs/agent-memory/explanation/resolution-deduplication/">[7]</a>.</p><p>A false merge is silent and unrecoverable. A false split is noisy but recoverable. You can&#8217;t undo a false merge without re-ingesting from the raw source data. That&#8217;s why you should leave uncertainty to a human.</p><h2>Zooming into the Retrieval Algorithm</h2><p>Because all three tiers live on one graph, a single retrieval can compose vector similarity over <code>:Entity</code> embeddings, multi-hop expansion over typed relationships, time-ordered <code>:NEXT</code> conversation walks, and reasoning-trace lookups via <code>:INITIATED_BY</code> and <code>:TOUCHED</code> joins. All of these run as steps in the same Cypher query. Neo4j 5.20 introduces <code>db.index.vector.queryNodes</code>, making vector similarity a first-class graph operation <a href="https://neo4j.com/labs/agent-memory/explanation/graph-architecture/">[4]</a>.</p><p>When you ask what you know about Claude Code, how it relates to Codex and Gemini CLI, and why you looked at it last week, the agent fuses three things in one pull. It uses vector similarity over your Readwise highlights to surface relevant passages. It uses a multi-hop traversal of <code>:DEVELOPED_BY</code> and <code>:COMPETES_WITH</code> edges to bring in Anthropic and Codex neighbors. Finally, it uses an <code>:INITIATED_BY</code> jump back to the prior conversation that discussed agent harnesses. There&#8217;s no cross-store join logic and no orchestrator.</p><p>From our tests, the library leaves the context construction to the user of the SDK. In other words, you get the whole output from the graph, and it&#8217;s your responsibility to further compress it before passing it to the LLM.</p><h2>What&#8217;s Next</h2><p>The <a href="https://github.com/neo4j-labs/agent-memory">neo4j-labs/agent-memory</a> architecture is more complex than what this article covers, but this is the core idea behind it. I&#8217;ll cover other components in more depth in future articles, including designing the ontology and keeping your knowledge graph clean over time.</p><p>I think this open-source repository is a perfect blueprint you can take to build your own agent memory solution, even with Postgres or MongoDB, to avoid keeping multiple databases in production. Still, Neo4j is probably the best choice for data mining and exploration.</p><p>For small to medium-scale projects with thousands of nodes and short hop traversals, I&#8217;d probably build my own agent memory solution from scratch on top of Postgres or MongoDB. I&#8217;d reach for Neo4j as an internal tool within my organization, or when the scale or complexity becomes too large for Postgres or MongoDB.</p><p><em>But here is what I&#8217;m wondering:</em></p><blockquote><p><em><strong>How are you handling agent memory today? Flat files, a vector index, a knowledge graph, or something stranger?</strong></em></p></blockquote><p><em>Click the button below and tell me. I read every response.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/understanding-neo4j-graph-agent-memory-system/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/understanding-neo4j-graph-agent-memory-system/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to restack this for your readers.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/understanding-neo4j-graph-agent-memory-system?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/understanding-neo4j-graph-agent-memory-system?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><div class="callout-block" data-callout="true"><h4>Whenever you&#8217;re ready, here is how I can help you</h4><p>If you want to go from zero to shipping production-grade AI agents, check out my <strong><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering course</a></strong>, built with Towards AI.</p><p>35 lessons. Three end-to-end portfolio projects. A certificate. And a Discord community with direct access to industry experts and me.</p><p>Built for software, data engineers or scientists transitioning into AI engineering.</p><p><em>Rated 5/5 by 300+ students. The first 7 lessons are free:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Start here&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Start here</span></a></p><p><em>Not ready to commit?</em> Start with our <strong><a href="https://email-course.towardsai.net/?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">free Agentic AI Engineering Guide</a></strong>, a 6-day email course on the mistakes that silently break AI agents in production.</p></div><div><hr></div><h2>References</h2><ol><li><p>Seale, T. (n.d.). This week Anthropic dropped Claude Sonnet 4.5. LinkedIn. <a href="https://www.linkedin.com/posts/tonyseale_this-week-anthropic-dropped-claude-sonnet-activity-7379787334398926848-iVOE/">https://www.linkedin.com/posts/tonyseale_this-week-anthropic-dropped-claude-sonnet-activity-7379787334398926848-iVOE/</a></p></li><li><p>Monigatti, L. (n.d.). The Evolution From RAG to Agentic RAG to Agent Memory. Leonie Monigatti. <a href="https://www.leoniemonigatti.com/blog/from-rag-to-agent-memory.html">https://www.leoniemonigatti.com/blog/from-rag-to-agent-memory.html</a></p></li><li><p>Neo4j Labs. (n.d.). Understanding the Three Memory Types. Neo4j Agent Memory. <a href="https://neo4j.com/labs/agent-memory/explanation/memory-types/">https://neo4j.com/labs/agent-memory/explanation/memory-types/</a></p></li><li><p>Neo4j Labs. (n.d.). Why Neo4j? Graph-Native Memory Architecture. Neo4j Agent Memory. <a href="https://neo4j.com/labs/agent-memory/explanation/graph-architecture/">https://neo4j.com/labs/agent-memory/explanation/graph-architecture/</a></p></li><li><p>Neo4j Labs. (n.d.). POLE+O Data Model. Neo4j Agent Memory. <a href="https://neo4j.com/labs/agent-memory/explanation/poleo-model/">https://neo4j.com/labs/agent-memory/explanation/poleo-model/</a></p></li><li><p>Neo4j Labs. (n.d.). How Entity Extraction Works. Neo4j Agent Memory. <a href="https://neo4j.com/labs/agent-memory/explanation/extraction-pipeline/">https://neo4j.com/labs/agent-memory/explanation/extraction-pipeline/</a></p></li><li><p>Neo4j Labs. (n.d.). Entity Resolution and Deduplication. Neo4j Agent Memory. <a href="https://neo4j.com/labs/agent-memory/explanation/resolution-deduplication/">https://neo4j.com/labs/agent-memory/explanation/resolution-deduplication/</a></p></li></ol><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[From Vibe Coding to a Real Engineering Team]]></title><description><![CDATA[My Claude Code agentic coding setup that ships features end-to-end]]></description><link>https://www.decodingai.com/p/squid-my-agentic-coding-setup-may-2026</link><guid isPermaLink="false">https://www.decodingai.com/p/squid-my-agentic-coding-setup-may-2026</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Tue, 12 May 2026 11:04:20 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!e0Qp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f80fe78-e6b2-4fc9-b2e1-e4484cb9d42b_1400x1000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!e0Qp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f80fe78-e6b2-4fc9-b2e1-e4484cb9d42b_1400x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!e0Qp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f80fe78-e6b2-4fc9-b2e1-e4484cb9d42b_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!e0Qp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f80fe78-e6b2-4fc9-b2e1-e4484cb9d42b_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!e0Qp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f80fe78-e6b2-4fc9-b2e1-e4484cb9d42b_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!e0Qp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f80fe78-e6b2-4fc9-b2e1-e4484cb9d42b_1400x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!e0Qp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f80fe78-e6b2-4fc9-b2e1-e4484cb9d42b_1400x1000.png" width="1400" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7f80fe78-e6b2-4fc9-b2e1-e4484cb9d42b_1400x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Squid &#8212; a six-agent team where no single agent both writes the code and decides if it's correct.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Squid &#8212; a six-agent team where no single agent both writes the code and decides if it's correct." title="Squid &#8212; a six-agent team where no single agent both writes the code and decides if it's correct." srcset="https://substackcdn.com/image/fetch/$s_!e0Qp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f80fe78-e6b2-4fc9-b2e1-e4484cb9d42b_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!e0Qp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f80fe78-e6b2-4fc9-b2e1-e4484cb9d42b_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!e0Qp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f80fe78-e6b2-4fc9-b2e1-e4484cb9d42b_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!e0Qp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f80fe78-e6b2-4fc9-b2e1-e4484cb9d42b_1400x1000.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I needed a TypeScript harness for my latest book code. It required a Terminal User Interface (TUI), an agent loop, tools, Model Context Protocol (MCP) support, skills, and slash commands. I will be honest with you. I first tried to vibe code this project.</p><p>As I knew what I was looking for, it worked. Until it didn&#8217;t. The code was working until you started looking more closely at the details. Only the first 20 characters were rendering inside the TUI, and the skills weren&#8217;t invoked by the agent loop.</p><p>So I deleted the whole code base and started over with a new strategy.</p><p>The cost of vibe coding isn&#8217;t abstract. It&#8217;s the next feature you can&#8217;t ship because you&#8217;re debugging a slash-command renderer that looked finished. This is what most people get wrong. Output that compiles and looks done breaks the moment you reach for the rough edges.</p><p>I divided the harness into tasks. I one-shotted the barebones version, which was just a TUI plus an agent loop with <code>bash</code>, <code>grep</code>, and a todo tool. Then I layered MCP, skills, and slash commands as separate features.</p><p>You can&#8217;t one-shot whole applications. You can one-shot big features if you scope them right and run them through a real engineering process.</p><p>This is known as agentic coding. Not vibe coding. You&#8217;re using agents to write the whole codebase, but you are still the mastermind behind everything.</p><p>But I wanted more. I wanted to automate this process. But with a single constraint in mind: &#8220;the code should HAS to be good&#8221;.</p><p>That&#8217;s why I built Squid. It&#8217;s an opinionated six-agent Claude Code setup available at <a href="https://github.com/iusztinpaul/squid">iusztinpaul/squid</a>. It ships features the way a real software team ships them.</p><p>Squid has already shipped our content-automation tool, expanding it from articles to posts, notes, threads, and messages. It shipped the book&#8217;s code data pipelines and TypeScript harness.</p><p>In this article I will show you how it works.</p><p>The concrete blueprint relies on a specialized team and an e2e lifecycle.</p><div class="callout-block" data-callout="true"><h2><a href="https://academy.towardsai.net/pages/free-lesson-offer-2?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Start Your Transition Into AI Engineering (Product)</a></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2YEV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef20e0ae-d28f-4cbf-bb86-675b4a0ab84a_1400x1380.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2YEV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef20e0ae-d28f-4cbf-bb86-675b4a0ab84a_1400x1380.png 424w, https://substackcdn.com/image/fetch/$s_!2YEV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef20e0ae-d28f-4cbf-bb86-675b4a0ab84a_1400x1380.png 848w, https://substackcdn.com/image/fetch/$s_!2YEV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef20e0ae-d28f-4cbf-bb86-675b4a0ab84a_1400x1380.png 1272w, https://substackcdn.com/image/fetch/$s_!2YEV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef20e0ae-d28f-4cbf-bb86-675b4a0ab84a_1400x1380.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2YEV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef20e0ae-d28f-4cbf-bb86-675b4a0ab84a_1400x1380.png" width="1400" height="1380" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ef20e0ae-d28f-4cbf-bb86-675b4a0ab84a_1400x1380.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1380,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Multi-agent free lesson architecture&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Multi-agent free lesson architecture" title="Multi-agent free lesson architecture" srcset="https://substackcdn.com/image/fetch/$s_!2YEV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef20e0ae-d28f-4cbf-bb86-675b4a0ab84a_1400x1380.png 424w, https://substackcdn.com/image/fetch/$s_!2YEV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef20e0ae-d28f-4cbf-bb86-675b4a0ab84a_1400x1380.png 848w, https://substackcdn.com/image/fetch/$s_!2YEV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef20e0ae-d28f-4cbf-bb86-675b4a0ab84a_1400x1380.png 1272w, https://substackcdn.com/image/fetch/$s_!2YEV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef20e0ae-d28f-4cbf-bb86-675b4a0ab84a_1400x1380.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Squid applies the multi-agent pattern to coding. My <a href="https://academy.towardsai.net/pages/free-lesson-offer-2?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering course</a> applies it to writing, and I just released a free hands-on lesson that distills the whole system.</p><p>You build a multi-agent system composed of two FastMCP servers (Deep Research + LinkedIn Writer) orchestrated by a harness, plus an observability and evals layer on top. The shift from classic backend/frontend stacks to MCP servers and harnesses is the pattern shaping modern agentic AI.</p><p>Built for software and data engineers moving into agentic AI engineering.</p><p><em>Part of the 35-lesson course. Rated 5/5 by 300+ students. First 7 lessons free.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/pages/free-lesson-offer-2?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Start the free lesson &#8594;&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.towardsai.net/pages/free-lesson-offer-2?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Start the free lesson &#8594;</span></a></p></div><h2>The Six Agents Engineering Team</h2><p>The system contains six agents. No agent both writes code and decides whether the code is correct.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9yS9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecde9a6b-b62d-4f38-a010-6c8af2e9728e_1208x1038.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9yS9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecde9a6b-b62d-4f38-a010-6c8af2e9728e_1208x1038.png 424w, https://substackcdn.com/image/fetch/$s_!9yS9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecde9a6b-b62d-4f38-a010-6c8af2e9728e_1208x1038.png 848w, https://substackcdn.com/image/fetch/$s_!9yS9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecde9a6b-b62d-4f38-a010-6c8af2e9728e_1208x1038.png 1272w, https://substackcdn.com/image/fetch/$s_!9yS9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecde9a6b-b62d-4f38-a010-6c8af2e9728e_1208x1038.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9yS9!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecde9a6b-b62d-4f38-a010-6c8af2e9728e_1208x1038.png" width="1200" height="1031.1258278145694" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ecde9a6b-b62d-4f38-a010-6c8af2e9728e_1208x1038.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:1038,&quot;width&quot;:1208,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The six-agent team&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="The six-agent team" title="The six-agent team" srcset="https://substackcdn.com/image/fetch/$s_!9yS9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecde9a6b-b62d-4f38-a010-6c8af2e9728e_1208x1038.png 424w, https://substackcdn.com/image/fetch/$s_!9yS9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecde9a6b-b62d-4f38-a010-6c8af2e9728e_1208x1038.png 848w, https://substackcdn.com/image/fetch/$s_!9yS9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecde9a6b-b62d-4f38-a010-6c8af2e9728e_1208x1038.png 1272w, https://substackcdn.com/image/fetch/$s_!9yS9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecde9a6b-b62d-4f38-a010-6c8af2e9728e_1208x1038.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">My Agentic Engineering Team</figcaption></figure></div><p>The <strong>product manager agent</strong> manages the tasks and ensures the feature adheres to the software architect&#8217;s specifications. It takes a raw feature specification, writes or updates an Architecture Decision Record (ADR) for non-obvious choices, and splits the feature into ordered tasks. It also maintains the Domain-Driven Design (DDD) glossary so vocabulary stays consistent between the business and engineering.</p><p>Note how, because Claude can easily handle both PM and software architecture work, we decided to merge these roles together. We did this to avoid fragmenting the context just to follow a standard human process. Ultimately, planning should be closely aligned with the software architect&#8217;s vision. In human processes, dividing these two responsibilities often created more issues than solutions.</p><p>The <strong>software engineer agent</strong> uses red-green Test-Driven Development (TDD). It writes the failing test, writes the minimal code to pass it, and then refactors. The software engineer uses direct command-line interfaces (CLIs) like <code>git</code>, <code>mongosh</code>, and <code>gh</code>. It never uses MCP wrappers. CLIs are more flexible because they tap directly into the power of bash. Plus, LLMs have seen considerably more bash code than MCP wrappers during training.</p><p>The <strong>tester agent</strong> specializes in the adversarial end-to-end edge-case pass. It catches false-confidence claims where the software engineer says the tests pass. It does this by reading every acceptance criterion against concrete evidence, like the test name, file lines, and command output.</p><p>The <strong>pull request reviewer agent</strong> performs a diff-only review. It looks for dead code, duplication, missing test coverage, and documentation adherence. It does a narrow performance review on hot paths only. It&#8217;s explicitly told not to micro-optimize one-off scripts.</p><p>The <strong>on-call agent</strong> loops on the Continuous Integration (CI) pipeline until it passes. In an earlier iteration, the CI check lived inside the software engineer and tester loop, and it got skipped constantly. Promoting it to a dedicated agent invoked by the orchestrator increased the probability the step runs.</p><p>The <strong>self-improve agent</strong> is an optional meta agent. After the feature is done, while looking over the results, the human can run the self-improve agent to scan the run for high-signal lessons and propose updates to the agentic coding layer that consists of <code>CLAUDE.md</code>, skills and subagents. This is a double-edged sword. It can constantly improve your workflow or quickly degrade it if you are not careful. That&#8217;s why it&#8217;s incredibly important that this step is gated by a human.</p><p>The secret sauce is in anchoring the agents into your own documentation.</p><h2>Keeping Up With Documentation: ADRs &amp; DDD Glossary</h2><p>The ADR directory acts as compressed architectural memory across runs. Every non-obvious choice regarding the datastore, synchronization defaults, authentication boundaries, or dependency lock-in ships with an ADR. These records include the status, context, decision, and consequences at <code>docs/adr/&lt;NNNN_title&gt;.md</code>. The product manager reads the directory before grooming a new feature, so decisions stay consistent across feature branches.</p><p>The DDD glossary gives shared vocabulary between the business and engineering at <code>docs/glossary.md</code>. It enforces one canonical name per concept. Code identifiers, OpenAPI schemas, database columns, and customer-facing interfaces all use the term exactly as it appears there. This gives Claude Code business context, not just code context, properly anchoring your code in your domain. The software engineer, tester, and pull request reviewer all reason about the same domain.</p><p>I have an honest caveat. The agents still under-use both the ADRs and the glossary. The spine exists, but I am still working on getting the agents to lean on it consistently.</p><p>Now the agents have the context they need to execute a feature from a raw specification all the way to a merged pull request.</p><h2>The Night Skill. The End-To-End Workflow.</h2><p>The <code>/night</code> skill takes one input, which is a feature specification written by the human, and produces one output, which is a merged pull request with green CI. Everything in this section sits between those two endpoints.</p><p>The <code>/night</code> pipeline is a long-running lifecycle. That&#8217;s why it&#8217;s called the &#8220;night&#8221; skill. It&#8217;s scoped to run for hours at a time, often with multiple pipelines in parallel.</p><p>It has two human checkpoints and five retry caps, while everything else is automated. The orchestrator acts as a manager. It never writes code itself, never runs tests itself, and never reviews the diff itself. It launches agents and enforces human validation.</p><p>After a human carefully writes a detailed feature specification, it calls the <code>/night</code> skill, which creates a new branch and worktree. The product manager reads the glossary and ADR directory, updates or writes a new ADR if needed, and splits the feature into a task plan.</p><p>Then we hit the first human gate. The user approves the plan, optionally sharpened by the <code>/grill-me</code> skill. The <code>/grill-me</code> skill is inspired by Matt Pocock&#8217;s work, which forces the agent to ask sharp questions back about anything fuzzy in the plan, such as interfaces, modularization, or new tools. This conversation is the line between vibe coding and agentic coding.</p><p>Next is the inner loop per task. The software engineer implements the code, the tester verifies it, and failures route back to the software engineer. This loop is capped at 5 attempts. Convergence is mostly mechanical through a run, fail, fix, and run cycle.</p><p>The product manager then performs an acceptance review on the whole feature from the user&#8217;s perspective. Rejections are packed into a single task back into the inner loop. This is capped at 3 attempts, because judgment-call loops are where Claude Code spirals.</p><p>Next, we repeat a similar loop using the PR reviewer agent, which looks at the diff, with a maximum of 3 attempts to avoid perfectionism. Adding a maximum number of attempts here is critical, because during review an LLM almost always has something else to say.</p><p>After the push, the on-call agent watches CI with a maximum of 5 attempts, routing failures back to the software engineer.</p><p>When the CI is green, we notify the user (e.g., via Slack) that the PR is ready for review. Optionally, based on any potential issues found while running the <code>/night</code> skill, we run <code>self-improve</code> to propagate that into your memory.</p><p> The /night lifecycle. Two human gates, five retry caps, everything else automated.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8VWO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5775bbe9-c91b-46fe-94ef-1a819be0bce8_1200x965.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8VWO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5775bbe9-c91b-46fe-94ef-1a819be0bce8_1200x965.png 424w, https://substackcdn.com/image/fetch/$s_!8VWO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5775bbe9-c91b-46fe-94ef-1a819be0bce8_1200x965.png 848w, https://substackcdn.com/image/fetch/$s_!8VWO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5775bbe9-c91b-46fe-94ef-1a819be0bce8_1200x965.png 1272w, https://substackcdn.com/image/fetch/$s_!8VWO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5775bbe9-c91b-46fe-94ef-1a819be0bce8_1200x965.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8VWO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5775bbe9-c91b-46fe-94ef-1a819be0bce8_1200x965.png" width="1200" height="965" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5775bbe9-c91b-46fe-94ef-1a819be0bce8_1200x965.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:965,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The /night lifecycle. Two human gates, five retry caps, everything else automated.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The /night lifecycle. Two human gates, five retry caps, everything else automated." title="The /night lifecycle. Two human gates, five retry caps, everything else automated." srcset="https://substackcdn.com/image/fetch/$s_!8VWO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5775bbe9-c91b-46fe-94ef-1a819be0bce8_1200x965.png 424w, https://substackcdn.com/image/fetch/$s_!8VWO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5775bbe9-c91b-46fe-94ef-1a819be0bce8_1200x965.png 848w, https://substackcdn.com/image/fetch/$s_!8VWO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5775bbe9-c91b-46fe-94ef-1a819be0bce8_1200x965.png 1272w, https://substackcdn.com/image/fetch/$s_!8VWO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5775bbe9-c91b-46fe-94ef-1a819be0bce8_1200x965.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">My Agentic Coding Setup</figcaption></figure></div><p>Beautiful! With this process I one-shot most of the features I am working on. And when it&#8217;s not a one-shot, I&#8217;m typically 95&#8211;99% there by the time I review the PR.</p><h2>How the Tester Stopped Re-Running What the SWE Already Ran</h2><p>The biggest problem with the e2e workflow above is that it&#8217;s slow and redundant. I preferred that over generating AI slop that I have to manually review and fix.</p><p>Still, there are a few tweaks that we can make to the workflow to improve speed and efficiency.</p><p>For example, when the tester re-ran the linter, type checker, formatter, and the happy-path suite that the software engineer had already run, we paid for everything twice. This was the number-one source of having a system that works but is too slow to use.</p><p>To fix this, the tester now accepts the software engineer&#8217;s reports for formatting and happy-path tests. It only runs the adversarial end-to-end edge-case pass itself. This covers the part the software engineer can&#8217;t credibly self-verify. Trust is bounded. Intuitively, I realized I&#8217;d started shifting the <code>Tester</code> toward QA-style practices, rather than just running simple tests.</p><p>I am still iterating on optimizations. For example, I want to route some subagents to Claude Sonnet models instead of Claude Opus. I also plan to narrow toolsets per role to reduce reasoning failures.</p><p>Also, depending on what you are working on, you might want to use the system more as a fast, snappy assistant than as a long-running workflow that prioritizes correctness above all.</p><h2>Day vs. Night: Two Orchestrators, One Team</h2><p>That&#8217;s why we have two pipelines running the same agents. The <code>/night</code> skill is the full lifecycle. It&#8217;s long-running, set-and-forget, has two human gates, and runs while you are away from the keyboard or working in parallel.</p><p>The <code>/day</code> skill is the lean inner loop. It runs the software engineer, the tester, and human commits for surgical edits. It skips product manager grooming, the pull request reviewer, and the on-call agent.</p><p>There is a concrete use case for the <code>/day</code> skill. When I read a merged pull request and find code I don&#8217;t like, the <code>/day</code> skill runs the stripped software engineer and tester loop to apply targeted edits. Then the on-call agent cleans up any CI fallout. This is the surgery that keeps the system from becoming a black box.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2bZF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1143ede6-93a8-4c72-ac85-3f12071bd61b_1200x965.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2bZF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1143ede6-93a8-4c72-ac85-3f12071bd61b_1200x965.png 424w, https://substackcdn.com/image/fetch/$s_!2bZF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1143ede6-93a8-4c72-ac85-3f12071bd61b_1200x965.png 848w, https://substackcdn.com/image/fetch/$s_!2bZF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1143ede6-93a8-4c72-ac85-3f12071bd61b_1200x965.png 1272w, https://substackcdn.com/image/fetch/$s_!2bZF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1143ede6-93a8-4c72-ac85-3f12071bd61b_1200x965.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2bZF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1143ede6-93a8-4c72-ac85-3f12071bd61b_1200x965.png" width="1200" height="965" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1143ede6-93a8-4c72-ac85-3f12071bd61b_1200x965.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:965,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Day vs. Night &#8212; same agent team, two orchestrators tuned for different workloads.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Day vs. Night &#8212; same agent team, two orchestrators tuned for different workloads." title="Day vs. Night &#8212; same agent team, two orchestrators tuned for different workloads." srcset="https://substackcdn.com/image/fetch/$s_!2bZF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1143ede6-93a8-4c72-ac85-3f12071bd61b_1200x965.png 424w, https://substackcdn.com/image/fetch/$s_!2bZF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1143ede6-93a8-4c72-ac85-3f12071bd61b_1200x965.png 848w, https://substackcdn.com/image/fetch/$s_!2bZF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1143ede6-93a8-4c72-ac85-3f12071bd61b_1200x965.png 1272w, https://substackcdn.com/image/fetch/$s_!2bZF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1143ede6-93a8-4c72-ac85-3f12071bd61b_1200x965.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Day vs. Night: Same agent team, two orchestrators tuned for different workloads.</figcaption></figure></div><p>Both pipelines have one thing in common. The human is in the loop on purpose, not as a fallback.</p><h2>Why Code Templates Are a Waste of Time in 2026</h2><p>Most teams are still scaffolding from cookiecutter templates that were outdated the day they were committed. This is a maintenance tax disguised as productivity. Squid stops paying that tax. Technology moves fast enough that any frozen template&#8217;s frameworks, tooling, interfaces, and opinions all need their own maintenance pipeline. That&#8217;s only worth it if one template fans out across dozens of projects.</p><p>A Copier or cookiecutter template isn&#8217;t free. I tried scaling one across Python, TypeScript, and Go. I watched the project balloon into a maintenance burden where most files would never be used. Maintaining a template engine to support multiple stacks is a full-time job.</p><p>Asking Claude Code to copy from the last project fails too. It propagates the technical debt baked into the source codebase. You inherit the mess, not the ideal state.</p><p>The real shift relies on markdown, not Jinja. I call these <strong>agentic templates</strong>.</p><p>You encode good practices as skills and <code>CLAUDE.md</code> files. Fundamentals like clean architecture, CI/CD discipline, testing patterns, and development cycles rarely change. When they do change, you edit prose instead of regenerating from a template engine that quickly slides into dependency hell.</p><p>Tooling stays dynamic. You don&#8217;t pin framework versions inside a template. You keep a decision tree of allowed choices and let the agent pull the latest interfaces on demand via Context7 at scaffold time.</p><p>Project structure can&#8217;t be templatized. The anti-pattern organizes by type, putting files into <code>agents/</code>, <code>nodes/</code>, <code>schemas/</code>, and <code>tools/</code> directories. One business module&#8217;s logic ends up scattered across four folders, forcing both humans and the agent&#8217;s context window to thrash.</p><p>The correct pattern organizes by actionability, keeping one bounded context per directory. Each domain owns its own types, store, Application Programming Interface (API), and prompts. That&#8217;s locally readable, easier to maintain, and easier for the agent to reason about.</p><p>Because we describe the structure in Markdown files instead of cookiecutter templates, we can define it like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bhZj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83fcb967-260a-49e3-af0c-b1a7f0181e01_2737x1779.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bhZj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83fcb967-260a-49e3-af0c-b1a7f0181e01_2737x1779.png 424w, https://substackcdn.com/image/fetch/$s_!bhZj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83fcb967-260a-49e3-af0c-b1a7f0181e01_2737x1779.png 848w, https://substackcdn.com/image/fetch/$s_!bhZj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83fcb967-260a-49e3-af0c-b1a7f0181e01_2737x1779.png 1272w, https://substackcdn.com/image/fetch/$s_!bhZj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83fcb967-260a-49e3-af0c-b1a7f0181e01_2737x1779.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bhZj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83fcb967-260a-49e3-af0c-b1a7f0181e01_2737x1779.png" width="1456" height="946" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/83fcb967-260a-49e3-af0c-b1a7f0181e01_2737x1779.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:946,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;code&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="code" title="code" srcset="https://substackcdn.com/image/fetch/$s_!bhZj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83fcb967-260a-49e3-af0c-b1a7f0181e01_2737x1779.png 424w, https://substackcdn.com/image/fetch/$s_!bhZj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83fcb967-260a-49e3-af0c-b1a7f0181e01_2737x1779.png 848w, https://substackcdn.com/image/fetch/$s_!bhZj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83fcb967-260a-49e3-af0c-b1a7f0181e01_2737x1779.png 1272w, https://substackcdn.com/image/fetch/$s_!bhZj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83fcb967-260a-49e3-af0c-b1a7f0181e01_2737x1779.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Avoid global dumping grounds like <code>utils/</code> or <code>helpers/</code>. Avoid a root-level <code>types.py</code> grab bag. Avoid grouping tests by type.</p><p>The <code>/scaffold</code> skill acts as an interactive bootstrap. An <code>AskUserQuestion</code> prompt drives a tight decision tree covering project identity, layout, components, backend, frontend framework, infrastructure, agent team, tracker, ADR and glossary opt-ins, and external services. A deterministic table picks only the matching specifications from the specification library. Unused categories never enter the context. The skill writes a tailored <code>CLAUDE.md</code> brief, lays down an empty folder skeleton, and hands off.</p><p>Then, based on the agentically generated template, you can use <code>/night</code> or <code>/day</code> to start writing real code.</p><h2>Open-Sourcing Squid</h2><p>I don&#8217;t want to keep Squid for myself. I want to share it with the community to learn from and contribute to.</p><p>Thus, <strong>I am open-sourcing <a href="https://github.com/iusztinpaul/squid">Squid</a>.</strong></p><p>You can install it as a Claude Code plugin:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">/plugin marketplace add iusztinpaul/squid
/plugin install squid@squid</code></pre></div><p>I want you to try it, build something awesome with it, and if you like it, contribute back:</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://github.com/iusztinpaul/squid&quot;,&quot;text&quot;:&quot;Check the full codebase&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://github.com/iusztinpaul/squid"><span>Check the full codebase</span></a></p><p><em>Still, here is what I&#8217;m wondering:</em></p><blockquote><p><em><strong>What is your agentic coding setup? How is Squid different from your own approach?</strong></em></p></blockquote><p><em>Click the button below and tell me. I read every response.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/squid-my-agentic-coding-setup-may-2026/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/squid-my-agentic-coding-setup-may-2026/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to restack this for your readers.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/squid-my-agentic-coding-setup-may-2026?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/squid-my-agentic-coding-setup-may-2026?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><div class="callout-block" data-callout="true"><h4>Whenever you&#8217;re ready, here is how I can help you</h4><p>If you want to go from zero to shipping production-grade AI agents, check out my <strong><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering course</a></strong>, built with Towards AI.</p><p>35 lessons. Three end-to-end portfolio projects. A certificate. And a Discord community with direct access to industry experts and me.</p><p>Built for software, data engineers or scientists transitioning into AI engineering.</p><p><em>Rated 5/5 by 300+ students. The first 7 lessons are free:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Start here&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Start here</span></a></p><p><em>Not ready to commit?</em> Start with our <strong><a href="https://email-course.towardsai.net/?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">free Agentic AI Engineering Guide</a></strong>, a 6-day email course on the mistakes that silently break AI agents in production.</p></div><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[Building Agentic GraphRAG Systems]]></title><description><![CDATA[From knowledge graphs and ontologies to a unified memory as an MCP server for your AI agent.]]></description><link>https://www.decodingai.com/p/agentic-graphrag</link><guid isPermaLink="false">https://www.decodingai.com/p/agentic-graphrag</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Tue, 05 May 2026 05:01:08 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!XYED!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21aa3d9-5221-4fef-a924-aea1f1eb9076_1400x1000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XYED!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21aa3d9-5221-4fef-a924-aea1f1eb9076_1400x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XYED!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21aa3d9-5221-4fef-a924-aea1f1eb9076_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!XYED!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21aa3d9-5221-4fef-a924-aea1f1eb9076_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!XYED!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21aa3d9-5221-4fef-a924-aea1f1eb9076_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!XYED!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21aa3d9-5221-4fef-a924-aea1f1eb9076_1400x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XYED!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21aa3d9-5221-4fef-a924-aea1f1eb9076_1400x1000.png" width="1400" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f21aa3d9-5221-4fef-a924-aea1f1eb9076_1400x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;GraphRAG at a glance. Fragmented sources unified into a single knowledge graph the agent reads from and writes back to via two tools.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="GraphRAG at a glance. Fragmented sources unified into a single knowledge graph the agent reads from and writes back to via two tools." title="GraphRAG at a glance. Fragmented sources unified into a single knowledge graph the agent reads from and writes back to via two tools." srcset="https://substackcdn.com/image/fetch/$s_!XYED!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21aa3d9-5221-4fef-a924-aea1f1eb9076_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!XYED!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21aa3d9-5221-4fef-a924-aea1f1eb9076_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!XYED!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21aa3d9-5221-4fef-a924-aea1f1eb9076_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!XYED!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff21aa3d9-5221-4fef-a924-aea1f1eb9076_1400x1000.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 1: GraphRAG at a glance.</em></figcaption></figure></div><p>I gave this talk twice in one month: at O&#8217;Reilly&#8217;s Context Engineering Event and at Abi Aryan&#8217;s Maven course on LLM inference at scale. After being blasted with questions, I realized something: GraphRAG isn&#8217;t a retrieval algorithm, it&#8217;s a data modeling problem.</p><p>Powering agents with knowledge graphs (KGs) and ontologies is still an unsolved problem. All the engineers I spoke to want GraphRAG, but don&#8217;t know how to implement it.</p><p>But at its core, we should ask a different question. Why do we even need GraphRAG in the first place? Why complicate our solution over a simple RAG system?</p><p>There are three core reasons.</p><p>First, you face context rot. As the context window fills, the signal-to-noise ratio collapses. The LLM degrades.</p><p>You pay for this degradation in quality, cost, and latency <a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents">[1]</a>.</p><p>Second, you face data fragmentation. In the agent era, your data lives in silos most builders share: documents, notes, research, emails, and text messages. We are no longer lucky enough to have all the data nicely stored in a single database.</p><p>Third, the agent&#8217;s unified memory naturally maps to a knowledge graph (KG). People have preferences and experiences. They went into specific locations, met with other people, or have a list of items to do. Things get trickier when <em>&#8220;Arthur told Felix that his favorite coffee shop is in the center of Timisoara&#8221;</em>, but after two months <em>&#8220;it moved to Lisbon&#8221;</em>. You need to start tracking relationships between people, locations, and most especially how these relate in time.</p><p>GraphRAG solves all three.</p><p>This is a data modeling problem, not a retrieval algorithm. It took a painful LangChain detour and a hard MongoDB RAM conversation to settle that for me. You need an ontology.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!K9Ds!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffefc6b09-62e0-4a2f-93b6-ad8c2ce773db_1200x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!K9Ds!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffefc6b09-62e0-4a2f-93b6-ad8c2ce773db_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!K9Ds!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffefc6b09-62e0-4a2f-93b6-ad8c2ce773db_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!K9Ds!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffefc6b09-62e0-4a2f-93b6-ad8c2ce773db_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!K9Ds!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffefc6b09-62e0-4a2f-93b6-ad8c2ce773db_1200x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!K9Ds!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffefc6b09-62e0-4a2f-93b6-ad8c2ce773db_1200x1200.png" width="1200" height="1200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fefc6b09-62e0-4a2f-93b6-ad8c2ce773db_1200x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1200,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Agentic GraphRAG Architecture&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Agentic GraphRAG Architecture" title="Agentic GraphRAG Architecture" srcset="https://substackcdn.com/image/fetch/$s_!K9Ds!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffefc6b09-62e0-4a2f-93b6-ad8c2ce773db_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!K9Ds!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffefc6b09-62e0-4a2f-93b6-ad8c2ce773db_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!K9Ds!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffefc6b09-62e0-4a2f-93b6-ad8c2ce773db_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!K9Ds!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffefc6b09-62e0-4a2f-93b6-ad8c2ce773db_1200x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image 2: The full GraphRAG system architecture.</figcaption></figure></div><p>By the end of this article, you will learn about ontology-first design, the three extraction modes, append-only data models, and hybrid retrieval joined by Reciprocal Rank Fusion (RRF). Finally, you will see how to expose the GraphRAG engine as a unified memory layer via an MCP server to power your agents. In other words, <strong>how to do agentic GraphRAG</strong>.</p><p>Before walking through the architecture, let&#8217;s understand why the story has to start from the ontology.</p><div class="callout-block" data-callout="true"><h2><a href="https://github.com/iusztinpaul/designing-real-world-ai-agents-workshop">Build Your Own Multi-Agent System Free Workshop (Product)</a></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://github.com/iusztinpaul/designing-real-world-ai-agents-workshop" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!erqJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c792e04-46a4-4edf-8c7f-bf7d370cabde_1400x1380.png 424w, https://substackcdn.com/image/fetch/$s_!erqJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c792e04-46a4-4edf-8c7f-bf7d370cabde_1400x1380.png 848w, https://substackcdn.com/image/fetch/$s_!erqJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c792e04-46a4-4edf-8c7f-bf7d370cabde_1400x1380.png 1272w, https://substackcdn.com/image/fetch/$s_!erqJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c792e04-46a4-4edf-8c7f-bf7d370cabde_1400x1380.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!erqJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c792e04-46a4-4edf-8c7f-bf7d370cabde_1400x1380.png" width="1400" height="1380" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0c792e04-46a4-4edf-8c7f-bf7d370cabde_1400x1380.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1380,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Multi-agent workshop architecture&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:&quot;https://github.com/iusztinpaul/designing-real-world-ai-agents-workshop&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Multi-agent workshop architecture" title="Multi-agent workshop architecture" srcset="https://substackcdn.com/image/fetch/$s_!erqJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c792e04-46a4-4edf-8c7f-bf7d370cabde_1400x1380.png 424w, https://substackcdn.com/image/fetch/$s_!erqJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c792e04-46a4-4edf-8c7f-bf7d370cabde_1400x1380.png 848w, https://substackcdn.com/image/fetch/$s_!erqJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c792e04-46a4-4edf-8c7f-bf7d370cabde_1400x1380.png 1272w, https://substackcdn.com/image/fetch/$s_!erqJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c792e04-46a4-4edf-8c7f-bf7d370cabde_1400x1380.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This article shows what an MCP-served unified memory looks like end to end. If you want to actually build agentic systems with MCP servers like this, I open-sourced a hands-on workshop for that.</p><p>Two MCP servers from scratch: a Deep Research Agent (Gemini + Google Search grounding) and a Writing Workflow with an evaluator-optimizer loop.</p><p>Packaged with slides, a ~2-hour video, runnable reference code, and an &#8220;implement-it-yourself&#8221; skeleton via agentic coding best practices (25 tickets, one orchestrator skill, and two agents: SWE and tester).</p><p>Originally presented at the AI Engineering Conference Europe. 200+ stars on GitHub. Free.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://github.com/iusztinpaul/designing-real-world-ai-agents-workshop&quot;,&quot;text&quot;:&quot;Go to workshop &#8594;&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://github.com/iusztinpaul/designing-real-world-ai-agents-workshop"><span>Go to workshop &#8594;</span></a></p></div><h2>Why the Story Starts From the Ontology</h2><p>Whenever you need to connect dots across a corpus of multiple documents rather than find the most relevant paragraph, you go for GraphRAG. Knowledge is stored as entities and edges.</p><p>You traverse connections rather than find similar text.</p><p>An ontology is a collection of classes and the relationships allowed between them. If you come from object-oriented programming, you already have the right intuition.</p><p>Throughout this article, we will build a digital twin. My favorite example. We will define a Global Ontology of six entity types organized into two sub-ontologies.</p><p>The data pipeline deterministically constructs the Document Ontology. It contains <code>DOCUMENT</code> and <code>CHUNK</code> nodes. It uses <code>PART_OF</code>, <code>NEXT</code>, <code>REFERENCED</code>, and <code>MENTIONS</code> edges.</p><p>The LLM extracts the Person Ontology. It contains <code>PERSON</code>, <code>TASK</code>, <code>EPISODE</code>, and <code>PREFERENCE</code> nodes. It uses <code>RELATED_TO</code>, <code>TODO</code>, <code>EXPERIENCED</code>, and <code>HAS</code> edges.</p><p>The schema is flexible. You define it for your business case. Every section after this one assumes these exact node and edge labels.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6-9A!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69db60bc-3ff8-438a-a432-cfedf2ad3622_2586x897.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6-9A!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69db60bc-3ff8-438a-a432-cfedf2ad3622_2586x897.png 424w, https://substackcdn.com/image/fetch/$s_!6-9A!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69db60bc-3ff8-438a-a432-cfedf2ad3622_2586x897.png 848w, https://substackcdn.com/image/fetch/$s_!6-9A!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69db60bc-3ff8-438a-a432-cfedf2ad3622_2586x897.png 1272w, https://substackcdn.com/image/fetch/$s_!6-9A!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69db60bc-3ff8-438a-a432-cfedf2ad3622_2586x897.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6-9A!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69db60bc-3ff8-438a-a432-cfedf2ad3622_2586x897.png" width="1200" height="416.2087912087912" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/69db60bc-3ff8-438a-a432-cfedf2ad3622_2586x897.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:505,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;ontology_example&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="ontology_example" title="ontology_example" srcset="https://substackcdn.com/image/fetch/$s_!6-9A!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69db60bc-3ff8-438a-a432-cfedf2ad3622_2586x897.png 424w, https://substackcdn.com/image/fetch/$s_!6-9A!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69db60bc-3ff8-438a-a432-cfedf2ad3622_2586x897.png 848w, https://substackcdn.com/image/fetch/$s_!6-9A!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69db60bc-3ff8-438a-a432-cfedf2ad3622_2586x897.png 1272w, https://substackcdn.com/image/fetch/$s_!6-9A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69db60bc-3ff8-438a-a432-cfedf2ad3622_2586x897.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image 3: Left shows the Global Ontology split into a Document Ontology and a Person Ontology. Right shows an instantiated KG with nodes wired together via the eight typed edges.</figcaption></figure></div><p>Skipping the ontology carries a heavy cost. I tried LangChain&#8217;s <code>MongoDBGraphStore</code>, which lets the LLM extract entity and relationship types freely. Five documents produced 17 node types and 34 relationship types.</p><p>This included <code>part_of</code>, <code>Part Of</code>, and <code>part of</code> as three separate types. The underlying data model does not enforce a schema at the storage layer.</p><p>With an ontology, the LLM can only extract what you defined. The constrained scope also allows you to use cheaper extractor models.</p><p>That&#8217;s why GraphRAG is the right tool when you have a clearly defined schema. It works when you need to identify relationships.</p><p>It reduces hallucination on complex queries that span interconnected facts. Domains where knowledge graphs naturally fit are legal, medical, financial, business operations, productivity tools and in my opinion, the crown jewel: personal assistants. With a KG, you can naturally build the unified memory of your personal assistant to properly remember what you like, what you did, and what you have to do, all anchored in time.</p><p>For example, Palantir built its empire using ontologies. Google uses KG to power its search, and Microsoft uses it in its internal ops tools.</p><p>With the ontology defined, the next architectural choice is the shape of the graph itself and how to extract those entities from raw text.</p><h2>RDF vs. Property Graphs, and the Three Extraction Modes</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GtoY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05040c7c-d947-454a-ae93-a39e52ef86fb_1400x1342.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GtoY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05040c7c-d947-454a-ae93-a39e52ef86fb_1400x1342.png 424w, https://substackcdn.com/image/fetch/$s_!GtoY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05040c7c-d947-454a-ae93-a39e52ef86fb_1400x1342.png 848w, https://substackcdn.com/image/fetch/$s_!GtoY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05040c7c-d947-454a-ae93-a39e52ef86fb_1400x1342.png 1272w, https://substackcdn.com/image/fetch/$s_!GtoY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05040c7c-d947-454a-ae93-a39e52ef86fb_1400x1342.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GtoY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05040c7c-d947-454a-ae93-a39e52ef86fb_1400x1342.png" width="1400" height="1342" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/05040c7c-d947-454a-ae93-a39e52ef86fb_1400x1342.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1342,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;RDF vs. Labeled Property Graph on the same Arthur fact. RDF explodes every property into its own triplet. Property Graphs attach properties to the node. Agent stacks use property graphs in practice.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="RDF vs. Labeled Property Graph on the same Arthur fact. RDF explodes every property into its own triplet. Property Graphs attach properties to the node. Agent stacks use property graphs in practice." title="RDF vs. Labeled Property Graph on the same Arthur fact. RDF explodes every property into its own triplet. Property Graphs attach properties to the node. Agent stacks use property graphs in practice." srcset="https://substackcdn.com/image/fetch/$s_!GtoY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05040c7c-d947-454a-ae93-a39e52ef86fb_1400x1342.png 424w, https://substackcdn.com/image/fetch/$s_!GtoY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05040c7c-d947-454a-ae93-a39e52ef86fb_1400x1342.png 848w, https://substackcdn.com/image/fetch/$s_!GtoY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05040c7c-d947-454a-ae93-a39e52ef86fb_1400x1342.png 1272w, https://substackcdn.com/image/fetch/$s_!GtoY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05040c7c-d947-454a-ae93-a39e52ef86fb_1400x1342.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 4: RDF vs. Labeled Property Graph on the same Arthur fact. RDF explodes every property into its own triplet. Property Graphs attach properties to the node. Agent stacks use property graphs in practice.</em></figcaption></figure></div><p>Every graph is structured as a collection of (entity, relationship, entity) triplets. But there are two ways to attach data to each entity or relationship instance, known as Resource Description Framework (RDF) and labeled property graphs.</p><p>RDF attaches each piece of metadata as another triplet. The graph explodes in size. Property graphs attach metadata as JSON on the entity or relationship.</p><p>In practice, GraphRAG and agents use property graphs <a href="https://www.manning.com/books/knowledge-graphs-and-llms-in-action">[3]</a>.</p><p>Now, during <strong>extraction</strong>, where we actually map data into our (entity, relationship, entity) triplets, plus their corresponding data, we have three core methods.</p><p><strong>Structured</strong> extraction is schema-guided. The LLM outputs entities per the Person Ontology.</p><p><strong>Semi-structured</strong> extraction uses metadata and lineage without an LLM. You parse the email&#8217;s links and attachments.</p><p><strong>Unstructured</strong> extraction uses an LLM without a schema. The LLM invents its own labels. This is useful for discovery, not for grounded retrieval. In other words, we use the LLM to extract triplets without an ontology. Exactly what we said to avoid in the previous section.</p><p>Here is the data-source mapping for the Person Ontology of the digital twin:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!B24g!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d668c23-2d2e-4e89-a677-7b5110ce6dd9_1920x830.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!B24g!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d668c23-2d2e-4e89-a677-7b5110ce6dd9_1920x830.png 424w, https://substackcdn.com/image/fetch/$s_!B24g!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d668c23-2d2e-4e89-a677-7b5110ce6dd9_1920x830.png 848w, https://substackcdn.com/image/fetch/$s_!B24g!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d668c23-2d2e-4e89-a677-7b5110ce6dd9_1920x830.png 1272w, https://substackcdn.com/image/fetch/$s_!B24g!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d668c23-2d2e-4e89-a677-7b5110ce6dd9_1920x830.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!B24g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d668c23-2d2e-4e89-a677-7b5110ce6dd9_1920x830.png" width="1456" height="629" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2d668c23-2d2e-4e89-a677-7b5110ce6dd9_1920x830.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:629,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;table&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="table" title="table" srcset="https://substackcdn.com/image/fetch/$s_!B24g!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d668c23-2d2e-4e89-a677-7b5110ce6dd9_1920x830.png 424w, https://substackcdn.com/image/fetch/$s_!B24g!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d668c23-2d2e-4e89-a677-7b5110ce6dd9_1920x830.png 848w, https://substackcdn.com/image/fetch/$s_!B24g!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d668c23-2d2e-4e89-a677-7b5110ce6dd9_1920x830.png 1272w, https://substackcdn.com/image/fetch/$s_!B24g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d668c23-2d2e-4e89-a677-7b5110ce6dd9_1920x830.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Table 1: Data-source mapping for the digital twin.</figcaption></figure></div><p>The Document Ontology can be completely done through semi-structured mechanics, since we already know what document each chunk comes from, the author of each document, and the references between them.</p><blockquote><p> &#128161; A student asked about open-domain extraction. Exploratory extraction is great early on when you are figuring out what ontology makes sense for your data. You can use zero-shot Named Entity Recognition (NER) models like GLiNER for that exploratory phase <a href="http://(https://neo4j.com/labs/agent-memory/explanation/extraction-pipeline/)">[4]</a>. Which you can easily run locally without having powerful inference hardware. Without that discipline, the output becomes unusable noise within tens of documents. A constrained scope lets you swap the frontier model for a small fine-tuned extractor like Gemini Flash Lite, Claude Haiku or even better, use Liquid open-source models fine-tuned on your ontology.</p></blockquote><p>These extraction modes feed directly into a five-component system that turns raw documents into queryable memory.</p><h2>The Five-Component Architecture</h2><p>The input consists of heterogeneous documents scattered across multiple silos. The output is a single queryable knowledge graph. The agent can search and write back to it via two tools.</p><p>Everything in between is plumbing built to serve that one job.</p><p>The data pipeline gathers from URIs, notes, emails and Google Drive. It normalizes everything into a document collection written to a warehouse.</p><p>The memory pipeline turns documents into knowledge-graph objects and writes them into the unified memory modeled as a KG.</p><p>The KG is the queryable artifact. The agent communicates with the knowledge graph via an MCP server that exposes search and write tools. If you are building in Python, choose FastMCP over the native MCP SDK, as it&#8217;s much easier to use and offers a better developer experience.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YdlO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F796f1448-9394-405d-9c74-2a20c2c8b782_1200x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YdlO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F796f1448-9394-405d-9c74-2a20c2c8b782_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!YdlO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F796f1448-9394-405d-9c74-2a20c2c8b782_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!YdlO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F796f1448-9394-405d-9c74-2a20c2c8b782_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!YdlO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F796f1448-9394-405d-9c74-2a20c2c8b782_1200x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YdlO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F796f1448-9394-405d-9c74-2a20c2c8b782_1200x1200.png" width="1200" height="1200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/796f1448-9394-405d-9c74-2a20c2c8b782_1200x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1200,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;the-five-component-architecture.png&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="the-five-component-architecture.png" title="the-five-component-architecture.png" srcset="https://substackcdn.com/image/fetch/$s_!YdlO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F796f1448-9394-405d-9c74-2a20c2c8b782_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!YdlO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F796f1448-9394-405d-9c74-2a20c2c8b782_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!YdlO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F796f1448-9394-405d-9c74-2a20c2c8b782_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!YdlO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F796f1448-9394-405d-9c74-2a20c2c8b782_1200x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image 5: The five-component architecture. Sources flow through the data and memory pipelines into the materialized knowledge graph. The agent talks to it through two MCP-exposed tools.</figcaption></figure></div><p>The <code>search_memory</code> family of tools brings only the slice the agent needs into the context window. The <code>write_memory</code> tools run the same data + memory pipelines on demand on a conversation or URI instead of running them in batch mode <a href="https://www.leoniemonigatti.com/blog/from-rag-to-agent-memory.html">[5]</a>.</p><p>Ultimately, we connect the MCP server to a harness such as Claude Code or Codex, where we inject custom business logic on how the tools should be used through a family of <code>assistant-memory</code> and <code>assistant-learn</code> skills.</p><p>For 2-3 hop traversals, Postgres or MongoDB handle documents, vectors, and graph-lookup in a single piece of infrastructure <a href="https://www.youtube.com/watch?v=JhfClrHIwG0">[7]</a>.</p><p>Reach for Neo4j only when deep traversals or specialized graph algorithms are core to the product <a href="https://neo4j.com/labs/agent-memory/explanation/graph-architecture/">[8]</a>. Or a good trade-off is to use it internally just for data exploration. Do not design for Google scale when you are processing thousands of documents.</p><p>The memory pipeline sits at the core of this architecture, transforming raw documents into the exact triplets the rest of the system queries.</p><h2>The Memory Pipeline</h2><p>The memory pipeline cleans the incoming document.</p><p>Next is optional chunking. If you can avoid chunking, avoid it. It introduces problems and is more about RAG-era reflexes than a necessity. You always have to customize the solution based on your data and try to introduce as little complexity as possible.</p><p>Next, the graph extractor emits triplets. You should use Pydantic-style schema descriptors so the LLM knows how each field should look.</p><p>Normalization is the most important step. You track the evolution of a single entity over time. Do not allow multiple versions of the same person to exist. The system re-uses the same canonical ID across extractions. New metadata and new relationships layer on top <a href="https://neo4j.com/labs/agent-memory/explanation/resolution-deduplication/">[9]</a>.</p><p>Finally, you embed the relevant fields for semantic search.</p><p>Now, let&#8217;s look at the core ways of data models you can use to store your KG.</p><h2>Single Mutable Collection vs. Append-Only Log Data Models</h2><p>There are two main approaches on how you can model your collections: as an append-only log or as a single mutable collection. Both have their pros and cons.</p><p>The append-only log consists of two collections: an append-only log and a queryable materialized view.</p><p>The system appends every event to an immutable log. A periodic materialization step squashes all events for the same ID into one canonical record.</p><p>You get versioning, temporality, and reversibility for free. You pay in RAM and operational complexity. As RAM is the most scarce and costly piece of hardware for hosting databases, this quickly translates into larger compute costs.</p><p>The single mutable collection approach drops the log. Each extraction directly upserts into the queryable collection.</p><p>You get simpler ops and real-time visibility, but the temporal audit trail is gone. Pick the single collection if operational simplicity and reduced costs beat time-travel.</p><p>Pick the two-collection append-only approach if you genuinely need an audit trail. Append-only collections never delete and never update. The same ID can appear multiple times across extractions, reflecting updates of an entity or relationship instance across the KG.</p><p>You can replay history up to a point in time, soft-delete, and revert a bad extraction. Materialization squashes all logs sharing an ID into one canonical entity.</p><p>An intuitive way of comparing the two methods is that the single mutable collection option is the same as the materialized view of the append-only option. Thus, one option comes with an append-only log, which comes with versioning and temporality, while the other doesn&#8217;t.</p><h3>How Would This Look Within the Digital Twin?</h3><p>Each log event lands with an auto-generated <code>ObjectId</code> plus a single <code>chunk_id</code> and <code>source_document_id</code> pinning it to one origin, with no embedding because nothing has been merged yet into the final instance. Materialization groups events by <code>(name, type)</code> for nodes and by the <code>(source, kind, target)</code> triplet for edges, swapping the <code>ObjectId</code> for a deterministic composite ID that <em>is</em> the merge key, unioning every contributing document into a <code>sources</code> array, and embedding each canonical entity once.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zNQp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c4d1233-69a5-4efc-96c1-9e853d3b78a6_2087x1309.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zNQp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c4d1233-69a5-4efc-96c1-9e853d3b78a6_2087x1309.png 424w, https://substackcdn.com/image/fetch/$s_!zNQp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c4d1233-69a5-4efc-96c1-9e853d3b78a6_2087x1309.png 848w, https://substackcdn.com/image/fetch/$s_!zNQp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c4d1233-69a5-4efc-96c1-9e853d3b78a6_2087x1309.png 1272w, https://substackcdn.com/image/fetch/$s_!zNQp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c4d1233-69a5-4efc-96c1-9e853d3b78a6_2087x1309.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zNQp!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c4d1233-69a5-4efc-96c1-9e853d3b78a6_2087x1309.png" width="1200" height="752.4725274725274" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5c4d1233-69a5-4efc-96c1-9e853d3b78a6_2087x1309.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:913,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;data_model_append_only_log&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="data_model_append_only_log" title="data_model_append_only_log" srcset="https://substackcdn.com/image/fetch/$s_!zNQp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c4d1233-69a5-4efc-96c1-9e853d3b78a6_2087x1309.png 424w, https://substackcdn.com/image/fetch/$s_!zNQp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c4d1233-69a5-4efc-96c1-9e853d3b78a6_2087x1309.png 848w, https://substackcdn.com/image/fetch/$s_!zNQp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c4d1233-69a5-4efc-96c1-9e853d3b78a6_2087x1309.png 1272w, https://substackcdn.com/image/fetch/$s_!zNQp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c4d1233-69a5-4efc-96c1-9e853d3b78a6_2087x1309.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image 6: The two-collection MongoDB shape. Left column shows the append-only log node and edge. Right column shows the materialized node and materialized edge.</figcaption></figure></div><p>Nodes and edges share a single collection, separated only by a <code>kind</code> discriminator. So within our MongoDB implementation, <code>$graphLookup</code> walks <code>source_node_id &#8594; target_node_id</code> recursively without joining across collections.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!E-qQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4ed1b3f-7ca3-4e9c-9e47-a74252a3189a_2292x1215.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!E-qQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4ed1b3f-7ca3-4e9c-9e47-a74252a3189a_2292x1215.png 424w, https://substackcdn.com/image/fetch/$s_!E-qQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4ed1b3f-7ca3-4e9c-9e47-a74252a3189a_2292x1215.png 848w, https://substackcdn.com/image/fetch/$s_!E-qQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4ed1b3f-7ca3-4e9c-9e47-a74252a3189a_2292x1215.png 1272w, https://substackcdn.com/image/fetch/$s_!E-qQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4ed1b3f-7ca3-4e9c-9e47-a74252a3189a_2292x1215.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!E-qQ!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4ed1b3f-7ca3-4e9c-9e47-a74252a3189a_2292x1215.png" width="1200" height="636.2637362637363" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e4ed1b3f-7ca3-4e9c-9e47-a74252a3189a_2292x1215.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:772,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;data_model_one_collection&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="data_model_one_collection" title="data_model_one_collection" srcset="https://substackcdn.com/image/fetch/$s_!E-qQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4ed1b3f-7ca3-4e9c-9e47-a74252a3189a_2292x1215.png 424w, https://substackcdn.com/image/fetch/$s_!E-qQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4ed1b3f-7ca3-4e9c-9e47-a74252a3189a_2292x1215.png 848w, https://substackcdn.com/image/fetch/$s_!E-qQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4ed1b3f-7ca3-4e9c-9e47-a74252a3189a_2292x1215.png 1272w, https://substackcdn.com/image/fetch/$s_!E-qQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4ed1b3f-7ca3-4e9c-9e47-a74252a3189a_2292x1215.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image 7: The one-collection MongoDB shape. Nodes and edges coexist in a single collection, both keyed by deterministic string IDs.</figcaption></figure></div><p> A student asked about community detection and isolated nodes. Once materialization runs, the system computes communities over the canonical node collection. An isolated node is just a singleton community. Filter or keep it based on your use case. Postgres and MongoDB handle hundreds of millions of small records. They can also scale vertically easily through sharding by partitioning on the entity and relationship IDs.</p><p>Now, let&#8217;s finally understand how we can query the KG and plug it into an agent.</p><h2>Finally...Let&#8217;s Understand the Retrieval Algorithm</h2><p>During retrieval, we use a hybrid index.</p><p>Text search uses exact keywords. Semantic search is meaning-based. Graph search is a multi-hop traversal across the typed edges.</p><p>Communities are an optional fourth index for topical clusters.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nriB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1d6313-13fa-49b8-9098-8a52d8995da4_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nriB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1d6313-13fa-49b8-9098-8a52d8995da4_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!nriB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1d6313-13fa-49b8-9098-8a52d8995da4_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!nriB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1d6313-13fa-49b8-9098-8a52d8995da4_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!nriB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1d6313-13fa-49b8-9098-8a52d8995da4_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nriB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1d6313-13fa-49b8-9098-8a52d8995da4_1024x1024.png" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9f1d6313-13fa-49b8-9098-8a52d8995da4_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Retrieval Example&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Retrieval Example" title="Retrieval Example" srcset="https://substackcdn.com/image/fetch/$s_!nriB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1d6313-13fa-49b8-9098-8a52d8995da4_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!nriB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1d6313-13fa-49b8-9098-8a52d8995da4_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!nriB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1d6313-13fa-49b8-9098-8a52d8995da4_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!nriB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f1d6313-13fa-49b8-9098-8a52d8995da4_1024x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 8: Top-down retrieval example for the query: &#8220;Create a presentation on GraphRAG for O&#8217;Reilly&#8221;.</em></figcaption></figure></div><p>GraphRAG retrieval is a two-stage move <a href="https://towardsdatascience.com/how-to-build-a-graph-rag-app-b323fc33ba06/">[10]</a>.</p><p>Stage 1 runs text and semantic search. It merges results with Reciprocal Rank Fusion (RRF). Apply a cutoff to get your entry points <a href="https://substack.com/@jeremyarancio/note/c-205294494">[11]</a>.</p><p>Stage 2 walks 2-3 hops across the typed edges to expand the result set.</p><p>During retrieval, GraphRAG&#8217;s addition over RAG is this multi-hop step, after the RRF merge, which is standard for most RAG systems.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!p-rw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc07f92e0-5a0b-46a8-b96d-f3ea887de28d_1399x1254.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!p-rw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc07f92e0-5a0b-46a8-b96d-f3ea887de28d_1399x1254.png 424w, https://substackcdn.com/image/fetch/$s_!p-rw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc07f92e0-5a0b-46a8-b96d-f3ea887de28d_1399x1254.png 848w, https://substackcdn.com/image/fetch/$s_!p-rw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc07f92e0-5a0b-46a8-b96d-f3ea887de28d_1399x1254.png 1272w, https://substackcdn.com/image/fetch/$s_!p-rw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc07f92e0-5a0b-46a8-b96d-f3ea887de28d_1399x1254.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!p-rw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc07f92e0-5a0b-46a8-b96d-f3ea887de28d_1399x1254.png" width="1399" height="1254" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c07f92e0-5a0b-46a8-b96d-f3ea887de28d_1399x1254.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1254,&quot;width&quot;:1399,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Two-stage retrieval. Text and semantic search feed RRF for entry points. From there, 2-3 hop graph traversal expands the result set.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Two-stage retrieval. Text and semantic search feed RRF for entry points. From there, 2-3 hop graph traversal expands the result set." title="Two-stage retrieval. Text and semantic search feed RRF for entry points. From there, 2-3 hop graph traversal expands the result set." srcset="https://substackcdn.com/image/fetch/$s_!p-rw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc07f92e0-5a0b-46a8-b96d-f3ea887de28d_1399x1254.png 424w, https://substackcdn.com/image/fetch/$s_!p-rw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc07f92e0-5a0b-46a8-b96d-f3ea887de28d_1399x1254.png 848w, https://substackcdn.com/image/fetch/$s_!p-rw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc07f92e0-5a0b-46a8-b96d-f3ea887de28d_1399x1254.png 1272w, https://substackcdn.com/image/fetch/$s_!p-rw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc07f92e0-5a0b-46a8-b96d-f3ea887de28d_1399x1254.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 9: Two-stage retrieval. Text and semantic search feed RRF for entry points. From there, 2-3 hop graph traversal expands the result set.</em></figcaption></figure></div><p>Still, there are two important details to highlight. There&#8217;s bottom-up, which expands entities for depth, while top-down hops across communities for a high-level overview <a href="https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/">[2]</a>. This translates to a trade-off between context size, latency and performance.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LuE-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb70ad9e2-6d9a-4cf7-b66d-ae7bb7b49214_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LuE-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb70ad9e2-6d9a-4cf7-b66d-ae7bb7b49214_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!LuE-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb70ad9e2-6d9a-4cf7-b66d-ae7bb7b49214_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!LuE-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb70ad9e2-6d9a-4cf7-b66d-ae7bb7b49214_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!LuE-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb70ad9e2-6d9a-4cf7-b66d-ae7bb7b49214_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LuE-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb70ad9e2-6d9a-4cf7-b66d-ae7bb7b49214_1024x1024.png" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b70ad9e2-6d9a-4cf7-b66d-ae7bb7b49214_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Bottom-up vs. top-down GraphRAG. Both start at text and semantic search. Bottom-up expands entities for depth. Top-down hops across communities for a high-level overview.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Bottom-up vs. top-down GraphRAG. Both start at text and semantic search. Bottom-up expands entities for depth. Top-down hops across communities for a high-level overview." title="Bottom-up vs. top-down GraphRAG. Both start at text and semantic search. Bottom-up expands entities for depth. Top-down hops across communities for a high-level overview." srcset="https://substackcdn.com/image/fetch/$s_!LuE-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb70ad9e2-6d9a-4cf7-b66d-ae7bb7b49214_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!LuE-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb70ad9e2-6d9a-4cf7-b66d-ae7bb7b49214_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!LuE-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb70ad9e2-6d9a-4cf7-b66d-ae7bb7b49214_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!LuE-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb70ad9e2-6d9a-4cf7-b66d-ae7bb7b49214_1024x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 10: Bottom-up vs. top-down GraphRAG. Both start at text and semantic search. Bottom-up expands entities for depth. Top-down hops across communities for a high-level overview.</em></figcaption></figure></div><p>Now, to close the loop, let&#8217;s connect everything to an agent.</p><h2>The Cherry on Top: Agentic GraphRAG</h2><p>GraphRAG becomes agentic when an agent gets to write to and search the knowledge graph autonomously <a href="https://www.leoniemonigatti.com/blog/from-rag-to-agent-memory.html">[5]</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!K9Ds!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffefc6b09-62e0-4a2f-93b6-ad8c2ce773db_1200x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!K9Ds!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffefc6b09-62e0-4a2f-93b6-ad8c2ce773db_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!K9Ds!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffefc6b09-62e0-4a2f-93b6-ad8c2ce773db_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!K9Ds!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffefc6b09-62e0-4a2f-93b6-ad8c2ce773db_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!K9Ds!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffefc6b09-62e0-4a2f-93b6-ad8c2ce773db_1200x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!K9Ds!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffefc6b09-62e0-4a2f-93b6-ad8c2ce773db_1200x1200.png" width="1200" height="1200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fefc6b09-62e0-4a2f-93b6-ad8c2ce773db_1200x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1200,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Agentic GraphRAG Architecture&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Agentic GraphRAG Architecture" title="Agentic GraphRAG Architecture" srcset="https://substackcdn.com/image/fetch/$s_!K9Ds!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffefc6b09-62e0-4a2f-93b6-ad8c2ce773db_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!K9Ds!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffefc6b09-62e0-4a2f-93b6-ad8c2ce773db_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!K9Ds!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffefc6b09-62e0-4a2f-93b6-ad8c2ce773db_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!K9Ds!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffefc6b09-62e0-4a2f-93b6-ad8c2ce773db_1200x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 11: Agentic GraphRAG via MCP. The agent calls search and write tools exposed by an MCP server.</em></figcaption></figure></div><p>The agent dynamically writes queries against the materialized knowledge graph using a family of <code>search_memory</code> tools. The <code>write_memory</code> family of tools runs the data and memory pipelines on the current conversation or any other type of document. These tools are exposed to the agent via the MCP server, implemented in FastMCP.</p><p>This differs from the five-component architecture explained earlier: this time, the agent decides when to search/write to memory.</p><p>The search tools can directly implement the text + semantic + graph-search algorithm programmatically, or let the agent write the query code on-demand, which gives more flexibility at the cost of potentially less optimal code.</p><p>As for the write tools, allowing the agent to ingest the current conversation ensures continual learning by dynamically tracking the user&#8217;s preferences, to-dos, experiences and more.</p><p>At the moment, harnesses such as Claude Code use the filesystem to implement the memory layer. But as the data grows, gets more complex, or we have to operate under strict cost/latency requirements, we will need more powerful solutions than just hoping the agent will figure it out through progressive disclosure.</p><h2>What&#8217;s Next</h2><p>In this piece, I presented only the high-level architecture and strategies around GraphRAG.</p><p>The issue is that when you start diving into each component, such as normalization, extraction, embedding or data modeling, you will realize that everything is extremely custom to your own data and use case.</p><p>This is especially true because GraphRAG is still in its early days, where there is no clear plan of attack.</p><p>That&#8217;s why I am actively working on a new book on how to implement a personal assistant from scratch (yes, together with Maxime Labonne!), where we will explore building a memory layer stage by stage: RAG, then GraphRAG, with an AI Evals layer on top to measure the actual gain in performance when introducing GraphRAG. As soon as I have more details on this, I will let you know.</p><p><em>But here is what I&#8217;m wondering:</em></p><blockquote><p><em><strong>Are you using a single database (Postgres / MongoDB) or splitting graph and vector workloads across specialized systems (Neo4j + Pinecone)?</strong></em></p></blockquote><p><em>Click the button below and tell me. I read every response.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/agentic-graphrag/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/agentic-graphrag/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to restack this for your readers.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/agentic-graphrag?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/agentic-graphrag?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><div class="callout-block" data-callout="true"><h4>Whenever you&#8217;re ready, here is how I can help you</h4><p>If you want to go from zero to shipping production-grade AI agents, check out my <strong><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering course</a></strong>, built with Towards AI.</p><p>34 lessons. Three end-to-end portfolio projects. A certificate. And a Discord community with direct access to industry experts and me.</p><p><em>Rated 5/5 by 300+ students. The first 6 lessons are free:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Start here&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Start here</span></a></p><p><em>Not ready to commit?</em> Start with our <strong><a href="https://email-course.towardsai.net/?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">free Agentic AI Engineering Guide</a></strong>, a 6-day email course on the mistakes that silently break AI agents in production.</p></div><div><hr></div><h2>References</h2><ol><li><p>Anthropic. (n.d.). Effective Context Engineering for AI Agents. Anthropic. <a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents">https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents</a></p></li><li><p>Larson, J. (2024, April 2). GraphRAG: Unlocking LLM Discovery on Narrative Private Data. Microsoft Research. <a href="https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/">https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/</a></p></li><li><p>Negro, A., Kus, V., Futia, G., &amp; Montagna, F. (n.d.). Knowledge Graphs and LLMs in Action. Manning. <a href="https://www.manning.com/books/knowledge-graphs-and-llms-in-action">https://www.manning.com/books/knowledge-graphs-and-llms-in-action</a></p></li><li><p>Neo4j Graph Data Platform. (n.d.). How Entity Extraction Works. Neo4j Agent Memory. <a href="https://www.manning.com/books/knowledge-graphs-and-llms-in-action">https://neo4j.com/labs/agent-memory/explanation/extraction-pipeline/</a></p></li><li><p>Monigatti, L. (n.d.). The Evolution From RAG to Agentic RAG to Agent Memory. Leonie Monigatti. <a href="https://www.leoniemonigatti.com/blog/from-rag-to-agent-memory.html">https://www.leoniemonigatti.com/blog/from-rag-to-agent-memory.html</a></p></li><li><p>Govindarajan, V. (n.d.). OpenClaw Architecture - Part 3: Memory and State Ownership. The Agent Stack. <a href="https://theagentstack.substack.com/p/openclaw-architecture-part-3-memory">https://theagentstack.substack.com/p/openclaw-architecture-part-3-memory</a></p></li><li><p>Iusztin, P., &amp; Rodrigues, J. (n.d.). <a href="https://www.decodingai.com/p/building-vertical-ai-agents-case-study-1">How We Killed Our RAG Pipeline.</a></p></li><li><p>Neo4j Graph Data Platform. (n.d.). Why Neo4j? Graph-Native Memory Architecture. Neo4j Agent Memory. <a href="https://neo4j.com/labs/agent-memory/explanation/graph-architecture/">https://neo4j.com/labs/agent-memory/explanation/graph-architecture/</a></p></li><li><p>Neo4j Graph Data Platform. (n.d.). Entity Resolution and Deduplication. Neo4j Agent Memory. <a href="https://neo4j.com/labs/agent-memory/explanation/resolution-deduplication/">https://neo4j.com/labs/agent-memory/explanation/resolution-deduplication/</a></p></li><li><p>Hedden, S. (n.d.). How to Build a Graph RAG App. Towards Data Science. <a href="https://towardsdatascience.com/how-to-build-a-graph-rag-app-b323fc33ba06/">https://towardsdatascience.com/how-to-build-a-graph-rag-app-b323fc33ba06/</a></p></li><li><p>Arancio, J. (n.d.). Comment on Hybrid RRF Retrieval Pipeline. Substack. <a href="https://substack.com/@jeremyarancio/note/c-205294494">https://substack.com/@jeremyarancio/note/c-205294494</a></p></li><li><p>Liu, J. (2025, May 19). There Are Only 6 RAG Evals. jxnl. <a href="https://jxnl.co/writing/2025/05/19/there-are-only-6-rag-evals/">https://jxnl.co/writing/2025/05/19/there-are-only-6-rag-evals/</a></p></li><li><p>Zhang, B. (2026, January 22). Scaling PostgreSQL to Power 800 Million ChatGPT Users. OpenAI. <a href="https://openai.com/index/scaling-postgresql/">https://openai.com/index/scaling-postgresql/</a></p></li><li><p>Govindarajan, V. (n.d.). OpenClaw Architecture - Part 2: Concurrency, Isolation, and the Invariants That Keep Agents Sane. The Agent Stack. <a href="https://theagentstack.substack.com/p/openclaw-architecture-part-2-concurrency">https://theagentstack.substack.com/p/openclaw-architecture-part-2-concurrency</a></p></li></ol><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[What Held Up at 3 AM: One Engineer's RAG Case Study]]></title><description><![CDATA[You iterate. You evaluate. Weave CLI unifies 11 vector databases into one workflow.]]></description><link>https://www.decodingai.com/p/ship-rag-with-weave-cli</link><guid isPermaLink="false">https://www.decodingai.com/p/ship-rag-with-weave-cli</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Wed, 29 Apr 2026 11:04:33 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!qCr3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0426d4ef-b40e-4da1-9f22-bc50409368e2_1400x1000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qCr3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0426d4ef-b40e-4da1-9f22-bc50409368e2_1400x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qCr3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0426d4ef-b40e-4da1-9f22-bc50409368e2_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!qCr3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0426d4ef-b40e-4da1-9f22-bc50409368e2_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!qCr3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0426d4ef-b40e-4da1-9f22-bc50409368e2_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!qCr3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0426d4ef-b40e-4da1-9f22-bc50409368e2_1400x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qCr3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0426d4ef-b40e-4da1-9f22-bc50409368e2_1400x1000.png" width="1400" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0426d4ef-b40e-4da1-9f22-bc50409368e2_1400x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;thumbnail.png&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="thumbnail.png" title="thumbnail.png" srcset="https://substackcdn.com/image/fetch/$s_!qCr3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0426d4ef-b40e-4da1-9f22-bc50409368e2_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!qCr3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0426d4ef-b40e-4da1-9f22-bc50409368e2_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!qCr3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0426d4ef-b40e-4da1-9f22-bc50409368e2_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!qCr3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0426d4ef-b40e-4da1-9f22-bc50409368e2_1400x1000.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Most AI demos work. Most AI products don&#8217;t. This series is a collection of interviews with engineers who shipped AI agents to production, covering the stacks they chose, the architectures they regretted, and what actually held up at 3 am.</p><p>This is an interview with <a href="https://www.linkedin.com/in/drmaximilien/">Michael Maximilien</a>, former CTO and Distinguished Engineer at IBM and Chairperson of the Board of the NodeJS Foundation. Now, the founder and CEO of ClawMax.ai, an AI agent orchestration platform powered by OpenClaw and the creator of <a href="https://github.com/maximilien/weave-cli/tree/main">weave-cli</a>, an open-source tool for shipping Retrieval-Augmented Generation (RAG) systems.</p><p><em>Watch our full interview on YouTube</em> &#8595;</p><div id="youtube2-eYaWxljC4sA" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;eYaWxljC4sA&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/eYaWxljC4sA?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>Michael Maximilien spent a year building RAG systems for customer after customer. Every new project required navigating dozens of moving parts. He had to pick a vector database, select an embedding model, chunk the data, ingest it, search it, and iterate.</p><blockquote><p><em>&#8220;I was doing this a lot and I wasn&#8217;t getting the results I wanted.&#8221;</em> &#8212; Max</p></blockquote><p>The failures were concrete. Halfway through an ingestion run, Milvus would run out of memory. Two collections made it in. The third was broken. Without a checkpoint or resume function, he had to recompute everything from scratch.</p><blockquote><p><em>&#8220;The experiment doesn&#8217;t just run, it fails. You have to be able to pick up from the failure.&#8221;</em> &#8212; Max</p></blockquote><p>Another failure mode involved manually comparing Weaviate against Milvus. One configuration typo could lead to drawing the wrong conclusion.</p><blockquote><p><em>&#8220;You might end up thinking Weaviate is better than Milvus when actually your comparison was wrong.&#8221;</em> &#8212; Max</p></blockquote><p>This manual flywheel stole time from actually helping his customers ship their products. He burned days on reset and re-ingest cycles that failed halfway. Worse, he produced results he could not trust.</p><p>Most teams treat RAG as a simple setup task. They picked a vector database because it trended online. They pick an embedding model because OpenAI is the safe default.</p><p>They guess a chunking strategy, guess the top-K retrieval parameters, and ship it. Then they spend the next six months vibe-checking the system.</p><p>Users complain. The team swaps a configuration knob. Nobody knows if it actually helped because nothing was measured.</p><blockquote><p><em>&#8220;There&#8217;s a lot of steps.&#8221;</em> &#8212; Max</p></blockquote><p>You lose the working system you thought you had. You burn weeks debugging silent ingestion failures because no trace exists.</p><p>Customer trust evaporates when the same question gets three different answers across releases.</p><p><em><strong>Max took the opposite bet. He built <a href="https://github.com/maximilien/weave-cli">Weave CLI</a>: a unified command-line tool for RAG over eleven vector databases.</strong></em></p><p>It features first-class observability implemented with <a href="https://github.com/comet-ml/opik">Opik</a>, an open-source evaluation and optimization tool, baked in from the first commit. You can try out their managed platform for free <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">here</a> for 25k spans/month.</p><p>By the end of this case study, you will understand how to unify your RAG stack so that switching a database, an embedding model, or an agent is merely a config change. You will learn how to measure everything, so every switch is tracked, evaluated and compared. Ultimately, you will learn how to benchmark your solution against multiple parameters to find the best configuration for your problem.</p><blockquote><p><em>&#8220;There&#8217;s no one solution. You iterate and evaluate.&#8221;</em> &#8212; Max</p></blockquote><p>But first, let&#8217;s understand what Weave CLI is and how it works.</p><h2>Understanding the System Architecture of Weave CLI</h2><p>Weave CLI wraps 11 vector databases behind a single interface. From the outside, it looks and feels like any other RAG system. On the ingestion side, it populates the chosen vector database with chunks, metadata, and embeddings. On the query side, it takes natural-language questions and returns top-k ranked chunks that an agent can use to create an answer with citations.</p><p>What makes Weave CLI special is that everything is swappable via a configuration file: the vector database, the embedding model, the chunking strategy, the query agent, the RAG agent that interprets the chunks and so on. With the goal of making it very easy for you to benchmark, iterate on and improve your RAG solution.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EBUu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F657a97ea-990e-4bc0-879e-9d1c1695c547_1400x549.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EBUu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F657a97ea-990e-4bc0-879e-9d1c1695c547_1400x549.png 424w, https://substackcdn.com/image/fetch/$s_!EBUu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F657a97ea-990e-4bc0-879e-9d1c1695c547_1400x549.png 848w, https://substackcdn.com/image/fetch/$s_!EBUu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F657a97ea-990e-4bc0-879e-9d1c1695c547_1400x549.png 1272w, https://substackcdn.com/image/fetch/$s_!EBUu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F657a97ea-990e-4bc0-879e-9d1c1695c547_1400x549.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EBUu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F657a97ea-990e-4bc0-879e-9d1c1695c547_1400x549.png" width="1400" height="549" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/657a97ea-990e-4bc0-879e-9d1c1695c547_1400x549.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:549,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image 1. Weave CLI in one breath.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image 1. Weave CLI in one breath." title="Image 1. Weave CLI in one breath." srcset="https://substackcdn.com/image/fetch/$s_!EBUu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F657a97ea-990e-4bc0-879e-9d1c1695c547_1400x549.png 424w, https://substackcdn.com/image/fetch/$s_!EBUu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F657a97ea-990e-4bc0-879e-9d1c1695c547_1400x549.png 848w, https://substackcdn.com/image/fetch/$s_!EBUu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F657a97ea-990e-4bc0-879e-9d1c1695c547_1400x549.png 1272w, https://substackcdn.com/image/fetch/$s_!EBUu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F657a97ea-990e-4bc0-879e-9d1c1695c547_1400x549.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 1: Weave CLI in one breath.</em></figcaption></figure></div><p>Weave CLI is composed of seven core components, each swappable by configuration.</p><p>The user-facing component is the <a href="https://github.com/spf13/cobra">Cobra-based CLI</a> and the interactive REPL. Weave stack sits underneath as the deployment layer. It brings the whole system, the databases, up or down with a local Docker/Podman Compose fallback.</p><p>Behind that surface sits the intelligence layer. Ten built-in agents share an <code>AgentChain</code> sequencer. Agents are used both within the CLI and during ingestion. Weave CLI supports RAG, QA and summarization agents, but what&#8217;s more interesting is during ingestion. For example, you describe your data, the <code>SchemaAgent</code> proposes a collection schema and a vector-database fit, the <code>ChunkingAgent</code> recommends a chunking strategy and an embedding provider is picked to match. The <code>Executor</code> drives a seven-step orchestration covering query analysis, planning, user confirmation, execution, reporting, display, and evaluation metrics.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5j2b!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b5ddfba-3610-4349-b378-e31cdb6538d4_1400x1400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5j2b!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b5ddfba-3610-4349-b378-e31cdb6538d4_1400x1400.png 424w, https://substackcdn.com/image/fetch/$s_!5j2b!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b5ddfba-3610-4349-b378-e31cdb6538d4_1400x1400.png 848w, https://substackcdn.com/image/fetch/$s_!5j2b!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b5ddfba-3610-4349-b378-e31cdb6538d4_1400x1400.png 1272w, https://substackcdn.com/image/fetch/$s_!5j2b!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b5ddfba-3610-4349-b378-e31cdb6538d4_1400x1400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5j2b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b5ddfba-3610-4349-b378-e31cdb6538d4_1400x1400.png" width="1400" height="1400" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7b5ddfba-3610-4349-b378-e31cdb6538d4_1400x1400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1400,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image 2. System architecture.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image 2. System architecture." title="Image 2. System architecture." srcset="https://substackcdn.com/image/fetch/$s_!5j2b!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b5ddfba-3610-4349-b378-e31cdb6538d4_1400x1400.png 424w, https://substackcdn.com/image/fetch/$s_!5j2b!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b5ddfba-3610-4349-b378-e31cdb6538d4_1400x1400.png 848w, https://substackcdn.com/image/fetch/$s_!5j2b!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b5ddfba-3610-4349-b378-e31cdb6538d4_1400x1400.png 1272w, https://substackcdn.com/image/fetch/$s_!5j2b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b5ddfba-3610-4349-b378-e31cdb6538d4_1400x1400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 2: High-level system architecture.</em></figcaption></figure></div><p>The data layer is built around the <code>VectorDBClient</code> interface in <a href="https://github.com/maximilien/weave-cli/tree/main/src/pkg/vectordb/interfaces.go">src/pkg/vectordb/interfaces.go</a>. It is cleanly split into four sub-interfaces: <code>CollectionOperations</code>, <code>DocumentOperations</code>, <code>QueryOperations</code>, and <code>SchemaOperations</code>. A package-level factory registry in <a href="https://github.com/maximilien/weave-cli/tree/main/src/pkg/vectordb/factory.go">factory.go</a> registers all 11 adapter sub-packages using the ports-and-adapters pattern.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fQna!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1019df63-46ff-4c02-a7ed-e9cb14dfe5fe_2120x819.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fQna!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1019df63-46ff-4c02-a7ed-e9cb14dfe5fe_2120x819.png 424w, https://substackcdn.com/image/fetch/$s_!fQna!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1019df63-46ff-4c02-a7ed-e9cb14dfe5fe_2120x819.png 848w, https://substackcdn.com/image/fetch/$s_!fQna!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1019df63-46ff-4c02-a7ed-e9cb14dfe5fe_2120x819.png 1272w, https://substackcdn.com/image/fetch/$s_!fQna!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1019df63-46ff-4c02-a7ed-e9cb14dfe5fe_2120x819.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fQna!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1019df63-46ff-4c02-a7ed-e9cb14dfe5fe_2120x819.png" width="1456" height="562" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1019df63-46ff-4c02-a7ed-e9cb14dfe5fe_2120x819.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:562,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;code&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="code" title="code" srcset="https://substackcdn.com/image/fetch/$s_!fQna!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1019df63-46ff-4c02-a7ed-e9cb14dfe5fe_2120x819.png 424w, https://substackcdn.com/image/fetch/$s_!fQna!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1019df63-46ff-4c02-a7ed-e9cb14dfe5fe_2120x819.png 848w, https://substackcdn.com/image/fetch/$s_!fQna!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1019df63-46ff-4c02-a7ed-e9cb14dfe5fe_2120x819.png 1272w, https://substackcdn.com/image/fetch/$s_!fQna!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1019df63-46ff-4c02-a7ed-e9cb14dfe5fe_2120x819.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Still... There is a trade-off to this design. A unified interface is a lowest-common-denominator by construction, so if you need PGVector&#8217;s transactional semantics or Neo4j&#8217;s native graph traversal as first-class features, a unified adapter costs you that expressiveness.</p><p>On top of the vector-database layer sit five embedding providers: OpenAI, sentence-transformers, Ollama, Cohere, and Voyage. The ingestion pipeline runs alongside these, handling file scanning, processing, and batching.</p><p>As of April 2026, Max has strong views on which vector database to choose. Weaviate is his default for cloud deployments. Pinecone is the pick for hosted solutions. OpenSearch covers self-hosted cloud. Milvus handles both local and cloud. Qdrant is his go-to for local use because its Rust implementation is low-memory and fast.</p><p>On top of the agent layer, we have the observability layer implemented using Opik and OpenTelemetry, along with an evaluation harness with four LLM judges. The evaluation harness is itself pluggable between a local evaluator and <a href="https://github.com/comet-ml/opik">Opik</a>.</p><p>The configuration is the source of truth for the whole stack. A <code>config.yaml</code> file holds the non-secret details of the vector database, agent, embedding model, and LLM, while secrets are loaded from a <code>.env</code> file. Check all the configs <a href="https://github.com/maximilien/weave-cli/tree/main/configs">here</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!l2Ox!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60b40610-b299-4b36-9eec-a721dc5a399e_2244x1203.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!l2Ox!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60b40610-b299-4b36-9eec-a721dc5a399e_2244x1203.png 424w, https://substackcdn.com/image/fetch/$s_!l2Ox!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60b40610-b299-4b36-9eec-a721dc5a399e_2244x1203.png 848w, https://substackcdn.com/image/fetch/$s_!l2Ox!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60b40610-b299-4b36-9eec-a721dc5a399e_2244x1203.png 1272w, https://substackcdn.com/image/fetch/$s_!l2Ox!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60b40610-b299-4b36-9eec-a721dc5a399e_2244x1203.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!l2Ox!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60b40610-b299-4b36-9eec-a721dc5a399e_2244x1203.png" width="1456" height="781" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/60b40610-b299-4b36-9eec-a721dc5a399e_2244x1203.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:781,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;code&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="code" title="code" srcset="https://substackcdn.com/image/fetch/$s_!l2Ox!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60b40610-b299-4b36-9eec-a721dc5a399e_2244x1203.png 424w, https://substackcdn.com/image/fetch/$s_!l2Ox!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60b40610-b299-4b36-9eec-a721dc5a399e_2244x1203.png 848w, https://substackcdn.com/image/fetch/$s_!l2Ox!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60b40610-b299-4b36-9eec-a721dc5a399e_2244x1203.png 1272w, https://substackcdn.com/image/fetch/$s_!l2Ox!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60b40610-b299-4b36-9eec-a721dc5a399e_2244x1203.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Let us trace a query through the system end-to-end using a concrete example: a user asks about Leica Noctilux lens auctions. The flow unfolds across nine hops, each an Opik span. First, the user submits the natural-language query to the REPL, which immediately starts monitoring the trace with Opik.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!N6p7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f2a275f-baff-4366-ba01-bebe3442410a_1400x1356.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!N6p7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f2a275f-baff-4366-ba01-bebe3442410a_1400x1356.png 424w, https://substackcdn.com/image/fetch/$s_!N6p7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f2a275f-baff-4366-ba01-bebe3442410a_1400x1356.png 848w, https://substackcdn.com/image/fetch/$s_!N6p7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f2a275f-baff-4366-ba01-bebe3442410a_1400x1356.png 1272w, https://substackcdn.com/image/fetch/$s_!N6p7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f2a275f-baff-4366-ba01-bebe3442410a_1400x1356.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!N6p7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f2a275f-baff-4366-ba01-bebe3442410a_1400x1356.png" width="1400" height="1356" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5f2a275f-baff-4366-ba01-bebe3442410a_1400x1356.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1356,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image 3. Every hop in a RAG query is an Opik span.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image 3. Every hop in a RAG query is an Opik span." title="Image 3. Every hop in a RAG query is an Opik span." srcset="https://substackcdn.com/image/fetch/$s_!N6p7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f2a275f-baff-4366-ba01-bebe3442410a_1400x1356.png 424w, https://substackcdn.com/image/fetch/$s_!N6p7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f2a275f-baff-4366-ba01-bebe3442410a_1400x1356.png 848w, https://substackcdn.com/image/fetch/$s_!N6p7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f2a275f-baff-4366-ba01-bebe3442410a_1400x1356.png 1272w, https://substackcdn.com/image/fetch/$s_!N6p7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f2a275f-baff-4366-ba01-bebe3442410a_1400x1356.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 3: The data flow of the RAG query execution.</em></figcaption></figure></div><p>The QueryAgent then validates and classifies the intent, passing control to the PlanningAgent to generate an execution plan. Next, the VectorDB adapter performs a semantic search to retrieve relevant documents. The ContextBuilder filters, deduplicates, and sorts these results before handing them to the RAGAgent, which generates the final answer with citations. Finally, the REPL ends the Opik trace, in which every step emits a span containing details such as costs, latency and input/output.</p><h3>Why Building in Go and Not TypeScript or Python</h3><p>Most popular agentic CLIs / REPLs, such as Claude Code or OpenCode, are built in TypeScript. But Max, as a former Node.js board chairperson, strongly suggests just going with Go or Rust if memory constraints are a concern.</p><p>Why? Because Go apps ship as single binaries. Simple. Beautiful. It runs everywhere.</p><p>On-premise customers cannot rely on npm, uv, or JVM registries being reachable inside their own networks, and dependency pinning does not fix network isolation. A single compiled binary sidesteps the entire problem. Go&#8217;s track record as the language of Kubernetes is the existing proof that this trade-off works for infrastructure tooling, which is exactly the category Weave CLI sits in. Max himself spent 10 years writing Go on those systems.</p><p>Max&#8217;s second argument is that language choice matters less than it used to, because AI coding assistants lower the learning-curve barrier across the board.</p><blockquote><p><em>&#8220;Most people don&#8217;t write code anymore.&#8221;</em> &#8212; Max</p></blockquote><p>Still... Max&#8217;s newer project (ClawMax.ai) is mostly in TypeScript because it is the best tool for the job, not because he switched allegiances.</p><blockquote><p><em>&#8220;The stack decision has to be what your system wants, not what the herd is doing.&#8221;</em> &#8212; Max</p></blockquote><p>Next, we zoom into the layers doing the heavy lifting, the ingestion pipeline and the unified VDB interface.</p><h2>Supporting 11 Vector Databases</h2><p>The vector database layer is where the ingestion pipeline meets the unified VDB interface. To see how they work together, we&#8217;ll trace Max&#8217;s Leica Noctilux auction catalog through the system one step at a time.</p><p>Each document in the catalog is a single lens listing. It contains a photo of the Noctilux, a short caption with the model number and condition, a price, and a few lines of provenance. The text is sparse. Most of the signal sits in the image itself, and the caption is just enough to disambiguate one Noctilux from another. That sparseness drives a multi-modal ingestion decision up front. The image and the surrounding caption are embedded into two separate collections, one keyed on image vectors and one on caption text vectors. At query time, the auction agent fans out to both collections and merges the results through the <code>ContextBuilder</code>.</p><p>Before the actual ingestion, we run a <code>FileScanner</code> that walks the 426 listing files on disk, applying glob matching, exclusion filters, and SHA256 deduplication (<a href="https://github.com/maximilien/weave-cli/tree/main/src/pkg/pipeline/scanner.go">src/pkg/pipeline/scanner.go</a>). Re-running ingestion on the same directory skips unchanged documents, making this step fully idempotent and computationally cheap.</p><p>The <code>DocumentProcessor</code> extracts text and images from each listing (<a href="https://github.com/maximilien/weave-cli/tree/main/src/pkg/pipeline/processor.go">src/pkg/pipeline/processor.go</a>). For the Leica dataset, the PDF extractor pulls the caption text, and OCR runs on the lens photo to catch any model number printed on the barrel. This step is idempotent but computationally expensive due to PDF parsing and OCR, and it fails if the document format is unsupported. Next, the <code>ChunkingAgent</code> dynamically selects the best chunking strategy for each document.</p><div class="callout-block" data-callout="true"><p>&#128161; Chunking is a tier 1 knob. Public benchmarking shows that swapping between recursive, sentence-level, and token-level strategies can move retrieval accuracy by double-digit percentages on the same corpus <a href="https://research.trychroma.com/evaluating-chunking">[1]</a>.</p></div><p>Next, we move to embedding (<a href="https://github.com/maximilien/weave-cli/tree/main/src/pkg/embeddings/model_registry.go">src/pkg/embeddings/model_registry.go</a>). In the Leica flow, caption text flows through the text embedder, and image descriptors flow through a separate image embedding model. Raw images larger than per-backend limits (Milvus caps fields at 65KB) get offloaded to S3/MinIO, leaving only a URL in the VDB payload. The default option is to use OpenAI&#8217;s embedding model, which is highly expensive in compute and API costs and can fail if you hit rate limits. When scaling, you can use open-source embeddings via Ollama. They run locally with no API key.</p><p>The <code>BatchWriter</code> processes documents with durability, such as checkpoint and resume functionality. For example, when ingesting data at scale, you often have network I/O failures or database connection drops. Through checkpointing, we ensure the state is idempotent. Batch checkpointing is the difference between a short retry and a multi-hour rebuild.</p><blockquote><p><em>&#8220;You have to recompute everything from scratch, which is crazy.&#8221;</em> &#8212; Max</p></blockquote><p>The <code>VectorDBClient</code> Interface sits at the core of the adapter pattern (<a href="https://github.com/maximilien/weave-cli/tree/main/src/pkg/vectordb/interfaces.go">src/pkg/vectordb/interfaces.go</a>) used to support the 11 databases. The project started with Weaviate. Milvus was surprisingly similar. Qdrant was also very similar. MongoDB was a different beast, but the interface still fit.</p><blockquote><p><em>&#8220;The biggest surprise was PGVector.&#8221;</em> &#8212; Max</p></blockquote><p>PGVector is the most incompatible on paper. Postgres is a relational database with its own migrations. Yet the unified interface fits.</p><p>The pipeline ends at any of the eleven vector databases (<a href="https://github.com/maximilien/weave-cli/tree/main/src/pkg/vectordb/factory.go">src/pkg/vectordb/factory.go</a>), emitting a final <code>vectordb.adapter</code> span. The 426 Leica listings are split into roughly 426 caption vectors in one collection and 426 image vectors in a parallel collection, both sharing listing IDs as the cross-reference key.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0znX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd956a09a-c56f-43ae-80ee-1a7c28ca7bd0_1400x590.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0znX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd956a09a-c56f-43ae-80ee-1a7c28ca7bd0_1400x590.png 424w, https://substackcdn.com/image/fetch/$s_!0znX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd956a09a-c56f-43ae-80ee-1a7c28ca7bd0_1400x590.png 848w, https://substackcdn.com/image/fetch/$s_!0znX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd956a09a-c56f-43ae-80ee-1a7c28ca7bd0_1400x590.png 1272w, https://substackcdn.com/image/fetch/$s_!0znX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd956a09a-c56f-43ae-80ee-1a7c28ca7bd0_1400x590.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0znX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd956a09a-c56f-43ae-80ee-1a7c28ca7bd0_1400x590.png" width="1400" height="590" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d956a09a-c56f-43ae-80ee-1a7c28ca7bd0_1400x590.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:590,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image 4. A document's seven-hop journey from source to vector store.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image 4. A document's seven-hop journey from source to vector store." title="Image 4. A document's seven-hop journey from source to vector store." srcset="https://substackcdn.com/image/fetch/$s_!0znX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd956a09a-c56f-43ae-80ee-1a7c28ca7bd0_1400x590.png 424w, https://substackcdn.com/image/fetch/$s_!0znX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd956a09a-c56f-43ae-80ee-1a7c28ca7bd0_1400x590.png 848w, https://substackcdn.com/image/fetch/$s_!0znX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd956a09a-c56f-43ae-80ee-1a7c28ca7bd0_1400x590.png 1272w, https://substackcdn.com/image/fetch/$s_!0znX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd956a09a-c56f-43ae-80ee-1a7c28ca7bd0_1400x590.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 4: The data flow of the document ingestion pipeline.</em></figcaption></figure></div><p>These steps cover every component any production ingestion pipeline needs, and Weave CLI ensures each one is swappable by configuration (<a href="https://github.com/maximilien/weave-cli/tree/main/src/pkg/stack/ingest.go">src/pkg/stack/ingest.go</a>): the FileScanner, the DocumentProcessor, the ChunkingAgent, the embedding provider, the BatchWriter, the VectorDBClient interface, and the concrete VDB adapter.</p><p>During retrieval, when a user asks <code>weave query "summarise the 2024 auction catalogue"</code>, the <code>QueryAgent</code> classifies the intent, the <code>PlanningAgent</code> decides to hit both Leica collections, and the <code>VectorDB</code> adapter runs a semantic search on each. The <code>ContextBuilder</code> then merges the image-collection hits with the caption-collection hits, deduplicates by listing ID, sorts by relevance score, and extracts content in priority order (caption text first, image metadata second, URL fallback last) into a single prompt for the <code>RAGAgent</code>.</p><p>The ingestion pipeline and VDB interface are the skeleton of Weave CLI. The agent layer is what makes it feel like Claude Code for vector databases.</p><h2>Zooming into the REPL</h2><p>Weave CLI provides a Claude-Code-like experience for vector databases, which, at its core, is a Read-Eval-Print Loop (REPL) environment hooked up to multiple agents.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pAbT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4424806c-ee3c-456b-bb7f-ae4514b9688b_1400x1344.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pAbT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4424806c-ee3c-456b-bb7f-ae4514b9688b_1400x1344.png 424w, https://substackcdn.com/image/fetch/$s_!pAbT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4424806c-ee3c-456b-bb7f-ae4514b9688b_1400x1344.png 848w, https://substackcdn.com/image/fetch/$s_!pAbT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4424806c-ee3c-456b-bb7f-ae4514b9688b_1400x1344.png 1272w, https://substackcdn.com/image/fetch/$s_!pAbT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4424806c-ee3c-456b-bb7f-ae4514b9688b_1400x1344.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pAbT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4424806c-ee3c-456b-bb7f-ae4514b9688b_1400x1344.png" width="1400" height="1344" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4424806c-ee3c-456b-bb7f-ae4514b9688b_1400x1344.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1344,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image 5. The Agent Layer up close.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image 5. The Agent Layer up close." title="Image 5. The Agent Layer up close." srcset="https://substackcdn.com/image/fetch/$s_!pAbT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4424806c-ee3c-456b-bb7f-ae4514b9688b_1400x1344.png 424w, https://substackcdn.com/image/fetch/$s_!pAbT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4424806c-ee3c-456b-bb7f-ae4514b9688b_1400x1344.png 848w, https://substackcdn.com/image/fetch/$s_!pAbT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4424806c-ee3c-456b-bb7f-ae4514b9688b_1400x1344.png 1272w, https://substackcdn.com/image/fetch/$s_!pAbT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4424806c-ee3c-456b-bb7f-ae4514b9688b_1400x1344.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 5: The Agent Layer up close.</em></figcaption></figure></div><p>Weave CLI ships with 12 built-in agents that you configure via YAML. Three of them are user-facing:</p><ol><li><p><strong>Precise QA</strong> &#8212; asks a question and answers it, and says it cannot answer when it lacks information. Zero hallucination tolerance.</p></li><li><p><strong>RAG</strong> &#8212; finds the closest chunks and generates an answer over them. This is the default.</p></li><li><p><strong>Summarize</strong> &#8212; produces a short summary of retrieved chunks.</p></li></ol><p>&#128161; The beauty is that you can add or modify them as you please.</p><p>The next eight agents power the Claude-Code-like orchestration loop: the <code>QueryAgent</code> for intent classification, the <code>PlanningAgent</code> for the execution plan, the <code>WeaveAgent</code> for tool execution with retries, the <code>BashAgent</code> for safe execution, the <code>RAGAgent</code> that the RAG persona dispatches to, the <code>OutputAgent</code> to format progress, the <code>ReportAgent</code> to generate operation reports, and the <code>EvalAgent</code> to track metrics.</p><p>The final two are domain helpers used during ingestion: the <code>ChunkingAgent</code> and the <code>SchemaAgent</code>.</p><p>Similar to the vector database layer, all the agents implement the same interface:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7L_X!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa78c406d-7dd3-4366-9375-25385f1357be_2418x627.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7L_X!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa78c406d-7dd3-4366-9375-25385f1357be_2418x627.png 424w, https://substackcdn.com/image/fetch/$s_!7L_X!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa78c406d-7dd3-4366-9375-25385f1357be_2418x627.png 848w, https://substackcdn.com/image/fetch/$s_!7L_X!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa78c406d-7dd3-4366-9375-25385f1357be_2418x627.png 1272w, https://substackcdn.com/image/fetch/$s_!7L_X!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa78c406d-7dd3-4366-9375-25385f1357be_2418x627.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7L_X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa78c406d-7dd3-4366-9375-25385f1357be_2418x627.png" width="1456" height="378" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a78c406d-7dd3-4366-9375-25385f1357be_2418x627.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:378,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;code&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="code" title="code" srcset="https://substackcdn.com/image/fetch/$s_!7L_X!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa78c406d-7dd3-4366-9375-25385f1357be_2418x627.png 424w, https://substackcdn.com/image/fetch/$s_!7L_X!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa78c406d-7dd3-4366-9375-25385f1357be_2418x627.png 848w, https://substackcdn.com/image/fetch/$s_!7L_X!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa78c406d-7dd3-4366-9375-25385f1357be_2418x627.png 1272w, https://substackcdn.com/image/fetch/$s_!7L_X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa78c406d-7dd3-4366-9375-25385f1357be_2418x627.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Let&#8217;s tie everything together. When you ask a query, the <code>QueryAgent</code> classifies intent and acts as a router. The <code>PlanningAgent</code> generates a plan of CLI commands. The <code>BashAgent</code> executes them and pipes the output through a command-line JSON processor for filtering. The <code>OutputAgent</code> formats the result. This is the Claude-Code-like loop in action.</p><p>The cherry on top is that the Weave CLI capabilities are also exposed as a Model Context Protocol (MCP) server. Thus, instead of using the Weave CLI directly, you can leverage its full functionality through your harness of choice (Claude Code, Codex, etc.).</p><p>Twelve agents, eleven databases, five embedding providers, and multiple chunking strategies create a lot of surface area. Opik is what makes the whole thing observable when something breaks.</p><h2>Monitoring the System</h2><p>With so many moving parts, you need to know the system is working. <a href="https://github.com/comet-ml/opik">Opik</a> is how Weave CLI answers that question: it traces every LLM call, every agent step, and every database write as an OpenTelemetry span.</p><blockquote><p><em>&#8220;Using Opik to tell me how many LLM calls, tokens, and cost per query.&#8221;</em> &#8212; Max</p></blockquote><p>During development, Max tracked a bug in which documents appeared to be ingested but were never persisted to Milvus. The Opik trace waterfall showed the database flush operations were silently timing out.</p><p><em>&#128161; If you want to try it out, you can create an account for free on Opik&#8217;s managed platform <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">here</a> for 25k spans/month.</em></p><p>The fix was adding dedicated timeout contexts per collection. Without the trace, this would have been a multi-day hunt through logs.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JSuf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff46859e3-3d44-4ec2-bc07-cb30c6250d7d_1400x1243.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JSuf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff46859e3-3d44-4ec2-bc07-cb30c6250d7d_1400x1243.png 424w, https://substackcdn.com/image/fetch/$s_!JSuf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff46859e3-3d44-4ec2-bc07-cb30c6250d7d_1400x1243.png 848w, https://substackcdn.com/image/fetch/$s_!JSuf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff46859e3-3d44-4ec2-bc07-cb30c6250d7d_1400x1243.png 1272w, https://substackcdn.com/image/fetch/$s_!JSuf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff46859e3-3d44-4ec2-bc07-cb30c6250d7d_1400x1243.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JSuf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff46859e3-3d44-4ec2-bc07-cb30c6250d7d_1400x1243.png" width="1400" height="1243" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f46859e3-3d44-4ec2-bc07-cb30c6250d7d_1400x1243.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1243,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1628767,&quot;alt&quot;:&quot;Image 6. Opik turns the RAG pipeline into a measurable waterfall.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image 6. Opik turns the RAG pipeline into a measurable waterfall." title="Image 6. Opik turns the RAG pipeline into a measurable waterfall." srcset="https://substackcdn.com/image/fetch/$s_!JSuf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff46859e3-3d44-4ec2-bc07-cb30c6250d7d_1400x1243.png 424w, https://substackcdn.com/image/fetch/$s_!JSuf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff46859e3-3d44-4ec2-bc07-cb30c6250d7d_1400x1243.png 848w, https://substackcdn.com/image/fetch/$s_!JSuf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff46859e3-3d44-4ec2-bc07-cb30c6250d7d_1400x1243.png 1272w, https://substackcdn.com/image/fetch/$s_!JSuf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff46859e3-3d44-4ec2-bc07-cb30c6250d7d_1400x1243.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 6: Opik turns the RAG pipeline into a measurable waterfall.</em></figcaption></figure></div><p>The integration provides cost and latency visibility per trace. You see tokens and dollars per query without writing custom logging. It provides a latency breakdown.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!57X_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aed6d3d-347f-4923-8c80-2df1d69fd2e2_2834x1844.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!57X_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aed6d3d-347f-4923-8c80-2df1d69fd2e2_2834x1844.png 424w, https://substackcdn.com/image/fetch/$s_!57X_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aed6d3d-347f-4923-8c80-2df1d69fd2e2_2834x1844.png 848w, https://substackcdn.com/image/fetch/$s_!57X_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aed6d3d-347f-4923-8c80-2df1d69fd2e2_2834x1844.png 1272w, https://substackcdn.com/image/fetch/$s_!57X_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aed6d3d-347f-4923-8c80-2df1d69fd2e2_2834x1844.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!57X_!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aed6d3d-347f-4923-8c80-2df1d69fd2e2_2834x1844.png" width="1200" height="780.4945054945055" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4aed6d3d-347f-4923-8c80-2df1d69fd2e2_2834x1844.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:947,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;opik_monitoring_dashboard.png&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="opik_monitoring_dashboard.png" title="opik_monitoring_dashboard.png" srcset="https://substackcdn.com/image/fetch/$s_!57X_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aed6d3d-347f-4923-8c80-2df1d69fd2e2_2834x1844.png 424w, https://substackcdn.com/image/fetch/$s_!57X_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aed6d3d-347f-4923-8c80-2df1d69fd2e2_2834x1844.png 848w, https://substackcdn.com/image/fetch/$s_!57X_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aed6d3d-347f-4923-8c80-2df1d69fd2e2_2834x1844.png 1272w, https://substackcdn.com/image/fetch/$s_!57X_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4aed6d3d-347f-4923-8c80-2df1d69fd2e2_2834x1844.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 7: <a href="https://github.com/comet-ml/opik">Opik&#8217;s</a> monitoring dashboard.</em></figcaption></figure></div><p>Finally, it provides error visibility to make silent failures loud.</p><p><strong>How hard was it to integrate Opik into Weave CLI?</strong></p><blockquote><p><em>&#8220;It&#8217;s a very straightforward integration &#8212; I pass all queries to the LLMs through Opik via OpenTelemetry, and then I query Opik to aggregate cost from the start of the command to the end.&#8221;</em> &#8212; Max</p></blockquote><p>Every step in the ingestion and retrieval data flows emits a span (<a href="https://github.com/maximilien/weave-cli/tree/main/src/pkg/llm/opik.go">src/pkg/llm/opik.go</a>), which are aggregated under traces containing all the steps between a user request/response.</p><p>It includes the query, the LLM reasoning, the tool calls, and the final response. The executor initializes Opik tracing here (<a href="https://github.com/maximilien/weave-cli/tree/main/src/pkg/executor/executor.go">src/pkg/executor/executor.go</a>).</p><p>Monitoring helps you debug your system. Evaluation moves everything forward, allowing you to quantify your application&#8217;s performance.</p><h2>Evaluating the Default Setup</h2><p>How do you know your agent is actually better after you swap an embedding model, a vector database or your chunking strategy? You need a good evaluation practice.</p><blockquote><p><em>&#8220;My customers always have five or six questions they ask every release to sanity-check the system. They know what to expect. So I took their QA questions and made them the baseline eval dataset.&#8221;</em> &#8212; Max</p></blockquote><p>Evaluation datasets come from real user behavior anchored in your business use case, not from standardized, generic benchmarks. If you do not have users yet, you should compile a small set of sanity questions a domain expert would actually ask.</p><p><strong>How does this work in Weave CLI?</strong></p><p>You start by defining an evaluation dataset in YAML format. This includes the query, expected answer, expected citations, and a minimum relevance score.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aJ_u!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb02ca7f-37d7-462d-b6d4-7c00d649b34e_2302x1459.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aJ_u!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb02ca7f-37d7-462d-b6d4-7c00d649b34e_2302x1459.png 424w, https://substackcdn.com/image/fetch/$s_!aJ_u!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb02ca7f-37d7-462d-b6d4-7c00d649b34e_2302x1459.png 848w, https://substackcdn.com/image/fetch/$s_!aJ_u!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb02ca7f-37d7-462d-b6d4-7c00d649b34e_2302x1459.png 1272w, https://substackcdn.com/image/fetch/$s_!aJ_u!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb02ca7f-37d7-462d-b6d4-7c00d649b34e_2302x1459.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aJ_u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb02ca7f-37d7-462d-b6d4-7c00d649b34e_2302x1459.png" width="1456" height="923" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eb02ca7f-37d7-462d-b6d4-7c00d649b34e_2302x1459.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:923,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;code&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="code" title="code" srcset="https://substackcdn.com/image/fetch/$s_!aJ_u!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb02ca7f-37d7-462d-b6d4-7c00d649b34e_2302x1459.png 424w, https://substackcdn.com/image/fetch/$s_!aJ_u!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb02ca7f-37d7-462d-b6d4-7c00d649b34e_2302x1459.png 848w, https://substackcdn.com/image/fetch/$s_!aJ_u!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb02ca7f-37d7-462d-b6d4-7c00d649b34e_2302x1459.png 1272w, https://substackcdn.com/image/fetch/$s_!aJ_u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb02ca7f-37d7-462d-b6d4-7c00d649b34e_2302x1459.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here is the full <a href="https://github.com/maximilien/weave-cli/blob/main/evals/datasets/baseline.yaml">baseline.yaml</a> file. Or this is how it looks in Opik:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lT6I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e2b4d24-959a-40ec-bece-a2eb48c9d76b_2840x1376.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lT6I!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e2b4d24-959a-40ec-bece-a2eb48c9d76b_2840x1376.png 424w, https://substackcdn.com/image/fetch/$s_!lT6I!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e2b4d24-959a-40ec-bece-a2eb48c9d76b_2840x1376.png 848w, https://substackcdn.com/image/fetch/$s_!lT6I!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e2b4d24-959a-40ec-bece-a2eb48c9d76b_2840x1376.png 1272w, https://substackcdn.com/image/fetch/$s_!lT6I!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e2b4d24-959a-40ec-bece-a2eb48c9d76b_2840x1376.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lT6I!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e2b4d24-959a-40ec-bece-a2eb48c9d76b_2840x1376.png" width="1200" height="581.0439560439561" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6e2b4d24-959a-40ec-bece-a2eb48c9d76b_2840x1376.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:705,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;opik_dashboard_dataset.png&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="opik_dashboard_dataset.png" title="opik_dashboard_dataset.png" srcset="https://substackcdn.com/image/fetch/$s_!lT6I!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e2b4d24-959a-40ec-bece-a2eb48c9d76b_2840x1376.png 424w, https://substackcdn.com/image/fetch/$s_!lT6I!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e2b4d24-959a-40ec-bece-a2eb48c9d76b_2840x1376.png 848w, https://substackcdn.com/image/fetch/$s_!lT6I!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e2b4d24-959a-40ec-bece-a2eb48c9d76b_2840x1376.png 1272w, https://substackcdn.com/image/fetch/$s_!lT6I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e2b4d24-959a-40ec-bece-a2eb48c9d76b_2840x1376.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 8: <a href="https://github.com/comet-ml/opik">Opik&#8217;s</a> dataset dashboard.</em></figcaption></figure></div><p>Then you pick an evaluator harness that includes a set of metrics to evaluate against. This harness is itself pluggable: you pick between a local evaluator and Opik (<a href="https://github.com/maximilien/weave-cli/tree/main/src/pkg/evaluation/provider.go">src/pkg/evaluation/provider.go</a>).</p><p>We use two families of evaluators. Rule-based evaluators use regular expressions, exact matches, and citation presence (<a href="https://github.com/maximilien/weave-cli/tree/main/src/pkg/evaluation/custom_evaluator.go">src/pkg/evaluation/custom_evaluator.go</a>) to compute metrics such as <code>CitationMatching</code> for the RAG agent.</p><p>They are fast, deterministic, and free. You use them for structural checks.</p><p>The second family uses an LLM as a judge. Weave CLI ships four of these judges (<a href="https://github.com/maximilien/weave-cli/tree/main/src/pkg/evaluation/provider_opik.go">src/pkg/evaluation/provider_opik.go</a>). They evaluate Accuracy, Faithfulness, Hallucination, and Context Relevance.</p><p>They are slower and cost tokens. You use them for semantic quality.</p><blockquote><p><em>&#8220;The hallucination, citation, and accuracy metrics are all from Opik&#8217;s library &#8212; I ported them to Golang.&#8221;</em> &#8212; Max</p></blockquote><div class="callout-block" data-callout="true"><p>&#128161; One key step most people forget is to align the LLM judge with the human expert. In our use case, the correlation between an LLM judge&#8217;s faithfulness score and human judgment hovers around 0.55. Judges are a signal, not a ground truth. For example, on average, I spent three weeks labeling a few-shot examples and computing agreeability scores before I trusted my own judgment.</p></div><p>Then, you run the evaluation command against a chosen agent. Finally, you compare the result of the experiment with the previous run. Each pair of agent and dataset is one experiment (<a href="https://github.com/maximilien/weave-cli/tree/main/src/cmd/eval/run.go">src/cmd/eval/run.go</a>).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HxOa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee8df39c-8c74-42f1-ab0c-f4bb96f2f044_1400x639.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HxOa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee8df39c-8c74-42f1-ab0c-f4bb96f2f044_1400x639.png 424w, https://substackcdn.com/image/fetch/$s_!HxOa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee8df39c-8c74-42f1-ab0c-f4bb96f2f044_1400x639.png 848w, https://substackcdn.com/image/fetch/$s_!HxOa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee8df39c-8c74-42f1-ab0c-f4bb96f2f044_1400x639.png 1272w, https://substackcdn.com/image/fetch/$s_!HxOa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee8df39c-8c74-42f1-ab0c-f4bb96f2f044_1400x639.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HxOa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee8df39c-8c74-42f1-ab0c-f4bb96f2f044_1400x639.png" width="1400" height="639" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ee8df39c-8c74-42f1-ab0c-f4bb96f2f044_1400x639.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:639,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:301887,&quot;alt&quot;:&quot;Image 7. The evaluation spine.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image 7. The evaluation spine." title="Image 7. The evaluation spine." srcset="https://substackcdn.com/image/fetch/$s_!HxOa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee8df39c-8c74-42f1-ab0c-f4bb96f2f044_1400x639.png 424w, https://substackcdn.com/image/fetch/$s_!HxOa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee8df39c-8c74-42f1-ab0c-f4bb96f2f044_1400x639.png 848w, https://substackcdn.com/image/fetch/$s_!HxOa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee8df39c-8c74-42f1-ab0c-f4bb96f2f044_1400x639.png 1272w, https://substackcdn.com/image/fetch/$s_!HxOa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee8df39c-8c74-42f1-ab0c-f4bb96f2f044_1400x639.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 9: The evaluation spine.</em></figcaption></figure></div><p>The <code>--use-opik</code> flag ships every trace and evaluation result to Opik (<a href="https://github.com/maximilien/weave-cli/tree/main/src/pkg/evaluation/runner.go">src/pkg/evaluation/runner.go</a>). Once in <a href="https://github.com/comet-ml/opik">Opik</a>, you get dataset management and experiment comparison.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mdXv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faed8789c-7b27-4092-b23c-f800edad149f_2838x1804.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mdXv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faed8789c-7b27-4092-b23c-f800edad149f_2838x1804.png 424w, https://substackcdn.com/image/fetch/$s_!mdXv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faed8789c-7b27-4092-b23c-f800edad149f_2838x1804.png 848w, https://substackcdn.com/image/fetch/$s_!mdXv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faed8789c-7b27-4092-b23c-f800edad149f_2838x1804.png 1272w, https://substackcdn.com/image/fetch/$s_!mdXv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faed8789c-7b27-4092-b23c-f800edad149f_2838x1804.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mdXv!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faed8789c-7b27-4092-b23c-f800edad149f_2838x1804.png" width="1200" height="763.1868131868132" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aed8789c-7b27-4092-b23c-f800edad149f_2838x1804.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:926,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;opik_dashboard_experiments&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="opik_dashboard_experiments" title="opik_dashboard_experiments" srcset="https://substackcdn.com/image/fetch/$s_!mdXv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faed8789c-7b27-4092-b23c-f800edad149f_2838x1804.png 424w, https://substackcdn.com/image/fetch/$s_!mdXv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faed8789c-7b27-4092-b23c-f800edad149f_2838x1804.png 848w, https://substackcdn.com/image/fetch/$s_!mdXv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faed8789c-7b27-4092-b23c-f800edad149f_2838x1804.png 1272w, https://substackcdn.com/image/fetch/$s_!mdXv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faed8789c-7b27-4092-b23c-f800edad149f_2838x1804.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 10: Opik&#8217;s experiments dashboard.</em></figcaption></figure></div><p>Scoring every run forces a decision on which agent to ship. Benchmarking on top of your custom datasets provides a structured way to choose a parameter, such as your chunking strategy or top k results, without guessing.</p><h2>Benchmarking and Optimizing the System</h2><p>An experiment is a single parameterized run over an agent, dataset, embedding, chunking strategy, database, and judge. A benchmark is a structured set of experiments.</p><p>You hold most variables constant to isolate the effect of one. Benchmarking is how you turn random runs into a parameter- and prompt-search problem. This is often known as the optimization flywheel.</p><blockquote><p><em>&#8220;That&#8217;s the reason I created Weave CLI. Because this is tedious, but also error-prone.&#8221;</em> &#8212; Max</p></blockquote><p>Every benchmark is one configuration typo away from drawing the wrong conclusion. Disciplined benchmarking catches that error.</p><p>Experiment metadata guarantees reproducibility. Every experiment records the database, embedding model, chunking strategy, dataset, and everything else required to reproduce it. That&#8217;s usually the whole config.</p><p>Opik tracks this out of the box. Without it, a benchmark from four weeks ago is useless.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MCDv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15b350ec-7b0d-4014-a58e-7db286550336_2830x1808.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MCDv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15b350ec-7b0d-4014-a58e-7db286550336_2830x1808.png 424w, https://substackcdn.com/image/fetch/$s_!MCDv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15b350ec-7b0d-4014-a58e-7db286550336_2830x1808.png 848w, https://substackcdn.com/image/fetch/$s_!MCDv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15b350ec-7b0d-4014-a58e-7db286550336_2830x1808.png 1272w, https://substackcdn.com/image/fetch/$s_!MCDv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15b350ec-7b0d-4014-a58e-7db286550336_2830x1808.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MCDv!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15b350ec-7b0d-4014-a58e-7db286550336_2830x1808.png" width="1200" height="766.4835164835165" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/15b350ec-7b0d-4014-a58e-7db286550336_2830x1808.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:930,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;opik_dashboard_experiment.png&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="opik_dashboard_experiment.png" title="opik_dashboard_experiment.png" srcset="https://substackcdn.com/image/fetch/$s_!MCDv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15b350ec-7b0d-4014-a58e-7db286550336_2830x1808.png 424w, https://substackcdn.com/image/fetch/$s_!MCDv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15b350ec-7b0d-4014-a58e-7db286550336_2830x1808.png 848w, https://substackcdn.com/image/fetch/$s_!MCDv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15b350ec-7b0d-4014-a58e-7db286550336_2830x1808.png 1272w, https://substackcdn.com/image/fetch/$s_!MCDv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15b350ec-7b0d-4014-a58e-7db286550336_2830x1808.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 11: <a href="https://github.com/comet-ml/opik">Opik&#8217;s</a> experiment dashboard.</em></figcaption></figure></div><p>When working on RAG systems, the optimization flywheel involves resetting the database, re-ingesting data with new parameters, re-evaluating, and comparing on your metrics of choice.</p><blockquote><p><em>&#8220;Benchmark is comparing multiple agents side by side. Same dataset, different agents &#8212; and each (agent, dataset) combination is its own experiment, you can compare later with its metadata.&#8221;</em> &#8212; Max</p></blockquote><p>You fix a baseline dataset and hold it constant. You vary one axis, typically the agent. You score against multiple metrics.</p><p>Each pair of agent and dataset is one <a href="https://github.com/comet-ml/opik">Opik</a> experiment. You compare them side-by-side to spot regressions and unexpected wins.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DwWF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95d52fb2-ed34-4ad8-a77c-56e4c202260b_1400x1324.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DwWF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95d52fb2-ed34-4ad8-a77c-56e4c202260b_1400x1324.png 424w, https://substackcdn.com/image/fetch/$s_!DwWF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95d52fb2-ed34-4ad8-a77c-56e4c202260b_1400x1324.png 848w, https://substackcdn.com/image/fetch/$s_!DwWF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95d52fb2-ed34-4ad8-a77c-56e4c202260b_1400x1324.png 1272w, https://substackcdn.com/image/fetch/$s_!DwWF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95d52fb2-ed34-4ad8-a77c-56e4c202260b_1400x1324.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DwWF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95d52fb2-ed34-4ad8-a77c-56e4c202260b_1400x1324.png" width="1400" height="1324" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/95d52fb2-ed34-4ad8-a77c-56e4c202260b_1400x1324.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1324,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image 8. The optimization flywheel.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image 8. The optimization flywheel." title="Image 8. The optimization flywheel." srcset="https://substackcdn.com/image/fetch/$s_!DwWF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95d52fb2-ed34-4ad8-a77c-56e4c202260b_1400x1324.png 424w, https://substackcdn.com/image/fetch/$s_!DwWF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95d52fb2-ed34-4ad8-a77c-56e4c202260b_1400x1324.png 848w, https://substackcdn.com/image/fetch/$s_!DwWF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95d52fb2-ed34-4ad8-a77c-56e4c202260b_1400x1324.png 1272w, https://substackcdn.com/image/fetch/$s_!DwWF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95d52fb2-ed34-4ad8-a77c-56e4c202260b_1400x1324.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 12: The optimization flywheel.</em></figcaption></figure></div><p>You trigger this loop via the command line with <code>weave eval run --dataset baseline --agents precise-qa,rag,summarize --use-opik</code>. Every subsequent benchmark streams into the same Opik project.</p><p>Max ran this loop for his Leica auction customer. He held the dataset and agent constant.</p><p>He varied only the embedding provider. He tested OpenAI against sentence-transformers. The open-source model won on quality by 11 percent.</p><p>It was 240 times faster for re-embedding. The vectors were 50 percent smaller, and the cost was zero.</p><p>This is a counterintuitive outcome. Without a structured benchmark, Max would have defaulted to OpenAI and been wrong.</p><h3>How to Keep the Flywheel Under Control?</h3><p>This optimization process involves running your ingestion and retrieval hundreds of times. Which can get costly fast. Super fast. The ingestion checkpointing makes it affordable.</p><p>Still, you should optimize your system in order of cheapest-to-change, biggest-win-first <a href="https://jxnl.co/writing/2024/02/28/levels-of-complexity-rag-applications/">[8]</a>. First, tune retrieval parameters like top-K. They are free to change and often provide the biggest wins.</p><p>Second, tune the embedding model. It is the cheapest component to swap and has a huge impact. Third, tune the chunking strategy. It requires re-ingestion but offers moderate quality gains.</p><p>Finally, tune the vector database. It has the highest switching cost and usually the smallest difference in quality.</p><p>The optimization flywheel effectively isolates variables, but it remains a manual process today.</p><p>The good news is that Weave CLI is heading toward full automated hyperparameter optimization across databases, embeddings, and chunking strategies. Just imagine. You will launch it before the weekend, and it will return on Monday with the best configuration for your dataset.</p><div class="callout-block" data-callout="true"><p>&#128173; P.S. If you want to use Weave CLI but think it&#8217;s missing a feature, Max is more than pleased to add it. Just open a PR/issue on the repository.</p></div><p><em>You can reproduce this benchmark step by step on your own stack by following <a href="https://github.com/maximilien/weave-cli/blob/main/demos/opik/DEMO.md">this doc</a>.</em></p><p><em>Watch our full interview on YouTube for all the 3am stories &#8595;</em></p><div id="youtube2-eYaWxljC4sA" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;eYaWxljC4sA&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/eYaWxljC4sA?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h2>Final Thoughts</h2><blockquote><p><em>Looking back, what was the hardest thing to implement, and what surprised you the most while building weave-cli?</em> &#8212; Paul</p></blockquote><p>The hardest part was designing a unified <code>VectorDBClient</code> that felt natural across 11 providers with wildly different APIs. The adapter pattern was the insight that made it work.</p><p>The biggest surprise was benchmarking OSS embeddings against OpenAI on the client&#8217;s data and finding them 11% higher quality, 240x faster, and free. A call we&#8217;d never have made without evals in place.</p><blockquote><p><em>If you had to rebuild Weave CLI from scratch, at what point would you introduce monitoring and evaluation? Would you do it earlier, later, or at the same time?</em> &#8212; Paul</p></blockquote><p>I&#8217;d introduce monitoring from day one. Having <a href="https://github.com/comet-ml/opik">Opik</a> traces during the early vector DB work would have immediately surfaced issues such as the silent Milvus persistence failures, which we debugged manually. As for evals, I&#8217;d keep at the same stage (after the core RAG pipeline was functional), but I&#8217;d design the harness interface up front for citation tracking and confidence scoring.</p><p><a href="https://github.com/comet-ml/opik">Opik</a> was easy to integrate and was key to getting the client dashboard working, since I could just run experiments and use evaluations and tracing to decide on the best options for the client.</p><p>Now, your <strong>next practical step</strong> is to experiment with <a href="https://github.com/maximilien/weave-cli">Weave CLI</a> on a real problem. Point it at 100 documents you want to do RAG on, ingest everything into two collections with two different embedding providers, and run the benchmark against the baseline evaluation dataset.</p><p>You can follow the step-by-step tutorial from <a href="https://github.com/maximilien/weave-cli/blob/main/demos/opik/DEMO.md">here</a></p><p><em>But here is what I&#8217;m wondering:</em></p><p><strong>While building your latest RAG system, what was your strategy to find the right parameters, such as the embedding model, chunking or retrieval strategies?</strong></p><p><em>Click the button below and tell me. I read every response.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/ship-rag-with-weave-cli/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.decodingai.com/p/ship-rag-with-weave-cli/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to restack this for your readers.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/ship-rag-with-weave-cli?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.decodingai.com/p/ship-rag-with-weave-cli?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><div class="callout-block" data-callout="true"><h4>Whenever you&#8217;re ready, here is how I can help you</h4><p>If you want to go from zero to shipping production-grade AI agents, check out my <strong><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering course</a></strong>, built with Towards AI.</p><p>34 lessons. Three end-to-end portfolio projects. A certificate. And a Discord community with direct access to industry experts and me.</p><p><em>Rated 5/5 by 300+ students. The first 6 lessons are free:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Start here&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Start here</span></a></p><p><em>Not ready to commit?</em> Start with our <strong><a href="https://email-course.towardsai.net/?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">free Agentic AI Engineering Guide</a></strong>, a 6-day email course on the mistakes that silently break AI agents in production.</p></div><div><hr></div><p><em>Thanks again to <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a> for sponsoring this case study and keeping it free!</em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oSDm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oSDm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 424w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 848w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1272w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png" width="1456" height="364" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:364,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Opik Banner&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Opik Banner" title="Opik Banner" srcset="https://substackcdn.com/image/fetch/$s_!oSDm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 424w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 848w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1272w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Try Opik for free here</a> (25k spans/month free)</figcaption></figure></div><p><strong>If you want to monitor, evaluate and optimize your AI workflows and agents:</strong></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;text&quot;:&quot;Try Opik for free&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul"><span>Try Opik for free</span></a></p><div><hr></div><h2>References</h2><ol><li><p>Chroma. (n.d.). Evaluating Chunking Strategies for Retrieval. Chroma. <a href="https://research.trychroma.com/evaluating-chunking">https://research.trychroma.com/evaluating-chunking</a></p></li><li><p>OpenTelemetry. (n.d.). Traces &amp; Spans specification. OpenTelemetry. <a href="https://opentelemetry.io/docs/concepts/signals/traces/">https://opentelemetry.io/docs/concepts/signals/traces/</a></p></li><li><p>Husain, H. (n.d.). Creating a LLM-as-a-Judge That Drives Business Results. <a href="http://Hamel Husain. https://hamel.dev/blog/posts/llm-judge/">Hamel Husain. https://hamel.dev/blog/posts/llm-judge/</a></p></li><li><p>Husain, H. (n.d.). Escaping POC Purgatory: Evaluation-Driven Development for AI. Hamel Husain. <a href="https://hamel.dev/blog/posts/evals/">https://hamel.dev/blog/posts/evals/</a></p></li><li><p>Liu, J. (2025, May 19). There Are Only 6 RAG Evals. Jason Liu. <a href="https://jxnl.co/writing/2025/05/19/there-are-only-6-rag-evals/">https://jxnl.co/writing/2025/05/19/there-are-only-6-rag-evals/</a></p></li><li><p>Comet. (n.d.). Opik &#8212; LLM observability &amp; evaluation platform. GitHub. <a href="https://github.com/comet-ml/opik">https://github.com/comet-ml/opik</a></p></li><li><p>Yan, E. (2024, August 18). Evaluating the Effectiveness of LLM Evaluators (LLM-as-Judge). Eugene Yan. <a href="https://eugeneyan.com/writing/llm-evaluators/">https://eugeneyan.com/writing/llm-evaluators/</a></p></li><li><p>Liu, J. (2024, February 28). Levels of Complexity: RAG Applications. Jason Liu. <a href="https://jxnl.co/writing/2024/02/28/levels-of-complexity-rag-applications/">https://jxnl.co/writing/2024/02/28/levels-of-complexity-rag-applications/</a></p></li></ol><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[Stop Orchestrating AI Agents. Use Ralph Loops Instead.]]></title><description><![CDATA[How one simple loop beats multi-agent orchestration and context rot in production.]]></description><link>https://www.decodingai.com/p/ralph-loops</link><guid isPermaLink="false">https://www.decodingai.com/p/ralph-loops</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Thu, 23 Apr 2026 11:02:54 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!p_kr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a12d8a1-8961-409a-958c-2b398c62ed60_1400x1275.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When building Brown, my writing assistant, I designed five specialized LLM nodes. One handled the introduction, another wrote sections, and others managed the conclusion, title, and editing. It became complicated, slow, and expensive.</p><p>I eventually collapsed the system into two agents: a writer and a reviewer operating in a loop. The simpler version performed better. The model retained the full context, and verification became a simple review step rather than a massive orchestration problem.</p><p>Most AI teams hit this exact wall. Developers spend more time babysitting AI than engineering, copying error logs and re-prompting models. The real bottleneck is the human.</p><p>Three failure modes explain why.</p><p>First, <strong>context rot.</strong> In long AI conversations, the context window becomes a junk drawer. Every failed attempt piles up until the sliding window drops the original specification. The model slides into a &#8220;dumb zone&#8221; where it hallucinates and forgets its goals. Traditional fixes like summarizing break down over dozens of reasoning rounds.</p><p>Second, <strong>premature exit.</strong> AI agents declare victory too early. Anthropic&#8217;s research notes that agents usually look around, see that progress has been made, and declare the job done <a href="https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents">[1]</a>. Standard ReAct loops inherit the flaw.</p><p>Third, <strong>single-pass fragility.</strong> One prompt, one context, one shot. When it fails, the failure is chaotic. Jumping to multi-agent orchestration introduces distributed systems nightmares.</p><p>Ralph loops break the cycle by making &#8220;try again with fresh eyes&#8221; the default. Named after Ralph Wiggum from The Simpsons, the pattern wipes the conversation, reloads the full specification fresh each iteration, and uses the filesystem and git as the memory layer.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!p_kr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a12d8a1-8961-409a-958c-2b398c62ed60_1400x1275.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!p_kr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a12d8a1-8961-409a-958c-2b398c62ed60_1400x1275.png 424w, https://substackcdn.com/image/fetch/$s_!p_kr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a12d8a1-8961-409a-958c-2b398c62ed60_1400x1275.png 848w, https://substackcdn.com/image/fetch/$s_!p_kr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a12d8a1-8961-409a-958c-2b398c62ed60_1400x1275.png 1272w, https://substackcdn.com/image/fetch/$s_!p_kr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a12d8a1-8961-409a-958c-2b398c62ed60_1400x1275.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!p_kr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a12d8a1-8961-409a-958c-2b398c62ed60_1400x1275.png" width="1400" height="1275" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7a12d8a1-8961-409a-958c-2b398c62ed60_1400x1275.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1275,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1214355,&quot;alt&quot;:&quot;Top shows context accumulating until the model forgets. Bottom shows state living on disk where each turn starts clean.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Top shows context accumulating until the model forgets. Bottom shows state living on disk where each turn starts clean." title="Top shows context accumulating until the model forgets. Bottom shows state living on disk where each turn starts clean." srcset="https://substackcdn.com/image/fetch/$s_!p_kr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a12d8a1-8961-409a-958c-2b398c62ed60_1400x1275.png 424w, https://substackcdn.com/image/fetch/$s_!p_kr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a12d8a1-8961-409a-958c-2b398c62ed60_1400x1275.png 848w, https://substackcdn.com/image/fetch/$s_!p_kr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a12d8a1-8961-409a-958c-2b398c62ed60_1400x1275.png 1272w, https://substackcdn.com/image/fetch/$s_!p_kr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a12d8a1-8961-409a-958c-2b398c62ed60_1400x1275.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 1: The top shows context accumulating until the model forgets. Bottom shows the state living on disk, where each turn starts clean.</em></figcaption></figure></div><p>They remove the AI&#8217;s ability to grade its own work, using objective signals such as passing tests or linters to call the job done. Boris Cherny, creator of Claude Code, states that giving Claude a way to verify its work increases quality two to three times <a href="https://x.com/bcherny/status/2007179832300581177">[2]</a>.</p><p>One model. One loop. One verification signal. Failure becomes predictable, the loop catches errors and re-prompts automatically, creating a relatively deterministic feedback loop that will 10x the quality of the agent.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dR8u!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F922e43de-ab6a-486e-8b41-2e5dd3bf3851_1400x724.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dR8u!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F922e43de-ab6a-486e-8b41-2e5dd3bf3851_1400x724.png 424w, https://substackcdn.com/image/fetch/$s_!dR8u!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F922e43de-ab6a-486e-8b41-2e5dd3bf3851_1400x724.png 848w, https://substackcdn.com/image/fetch/$s_!dR8u!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F922e43de-ab6a-486e-8b41-2e5dd3bf3851_1400x724.png 1272w, https://substackcdn.com/image/fetch/$s_!dR8u!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F922e43de-ab6a-486e-8b41-2e5dd3bf3851_1400x724.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dR8u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F922e43de-ab6a-486e-8b41-2e5dd3bf3851_1400x724.png" width="1400" height="724" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/922e43de-ab6a-486e-8b41-2e5dd3bf3851_1400x724.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:724,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:766461,&quot;alt&quot;:&quot;The Ralph loop. One model, one task per iteration, filesystem and git as memory, objective verification as the only exit.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Ralph loop. One model, one task per iteration, filesystem and git as memory, objective verification as the only exit." title="The Ralph loop. One model, one task per iteration, filesystem and git as memory, objective verification as the only exit." srcset="https://substackcdn.com/image/fetch/$s_!dR8u!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F922e43de-ab6a-486e-8b41-2e5dd3bf3851_1400x724.png 424w, https://substackcdn.com/image/fetch/$s_!dR8u!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F922e43de-ab6a-486e-8b41-2e5dd3bf3851_1400x724.png 848w, https://substackcdn.com/image/fetch/$s_!dR8u!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F922e43de-ab6a-486e-8b41-2e5dd3bf3851_1400x724.png 1272w, https://substackcdn.com/image/fetch/$s_!dR8u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F922e43de-ab6a-486e-8b41-2e5dd3bf3851_1400x724.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 2: The Ralph loop. One model, one task per iteration, filesystem and git as memory, objective verification as the only exit.</em></figcaption></figure></div><p>Now, let&#8217;s look at what Ralph loops are and when you can actually use them in practice.</p><div class="callout-block" data-callout="true"><h2><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Go Deeper Into Production AI Engineering (Product)</a></h2><p>Ralph loops prove that most of the leverage lies in the harness, not the model. If you want to master how to design, verify, and ship those AI harnesses in production, check out my <strong><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering course</a></strong>, built with Towards AI.</p><p>34 lessons. Three end-to-end portfolio projects. A certificate. And a Discord community with direct access to industry experts and me.</p><p><em>Rated 5/5 by 300+ students. The first 6 lessons are free:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Start here&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Start here</span></a></p></div><h2>What Ralph Loops Are and How They Work</h2><p>Geoffrey Huntley named the pattern after Ralph Wiggum from The Simpsons, noting the character tries the same thing over and over until it works. Huntley&#8217;s motto captures the philosophy: the technique is deterministically bad in an undeterministic world. The simplest implementation is a bash while-true loop that pipes a prompt file into the agent forever, acting as a continuous harness pattern <a href="https://ghuntley.com/ralph/">[3]</a>, <a href="https://blog.langchain.com/the-anatomy-of-an-agent-harness/">[4]</a>.</p><p>As Einstein reportedly said: &#8220;Insanity is doing the same thing over and over again and expecting different results.&#8221; Well... I am sure he didn&#8217;t predict the rise of Claude Code, because that&#8217;s exactly what Ralph loops are all about.</p><p>Models are stochastic, strong at reading large contexts but imperfect on first pass. Re-running the same instruction forces self-review. The first iteration produces good but flawed output.</p><p>During the second pass, the model spots what it missed and refactors. The third iteration handles cleanup. Huntley delivered a minimum viable product quoted at 50,000 <em>for</em> <em>just </em>297 in tokens using a single Ralph loop: a 170x cost reduction over the human estimate <a href="https://ghuntley.com/ralph/">[3]</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7XzG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5267ce9-c548-40a1-959b-8416f9387c06_1400x1400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7XzG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5267ce9-c548-40a1-959b-8416f9387c06_1400x1400.png 424w, https://substackcdn.com/image/fetch/$s_!7XzG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5267ce9-c548-40a1-959b-8416f9387c06_1400x1400.png 848w, https://substackcdn.com/image/fetch/$s_!7XzG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5267ce9-c548-40a1-959b-8416f9387c06_1400x1400.png 1272w, https://substackcdn.com/image/fetch/$s_!7XzG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5267ce9-c548-40a1-959b-8416f9387c06_1400x1400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7XzG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5267ce9-c548-40a1-959b-8416f9387c06_1400x1400.png" width="1400" height="1400" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c5267ce9-c548-40a1-959b-8416f9387c06_1400x1400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1400,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Single-pass fails chaotically. Multi-agent deadlocks on shared state. Ralph loops isolate one task per iteration and use verification as the exit gate.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Single-pass fails chaotically. Multi-agent deadlocks on shared state. Ralph loops isolate one task per iteration and use verification as the exit gate." title="Single-pass fails chaotically. Multi-agent deadlocks on shared state. Ralph loops isolate one task per iteration and use verification as the exit gate." srcset="https://substackcdn.com/image/fetch/$s_!7XzG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5267ce9-c548-40a1-959b-8416f9387c06_1400x1400.png 424w, https://substackcdn.com/image/fetch/$s_!7XzG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5267ce9-c548-40a1-959b-8416f9387c06_1400x1400.png 848w, https://substackcdn.com/image/fetch/$s_!7XzG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5267ce9-c548-40a1-959b-8416f9387c06_1400x1400.png 1272w, https://substackcdn.com/image/fetch/$s_!7XzG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5267ce9-c548-40a1-959b-8416f9387c06_1400x1400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 3: Single-pass fails chaotically. Multi-agent deadlocks on shared state. Ralph loops isolate one task per iteration and uses verification as the exit gate.</em></figcaption></figure></div><p>You can run these loops in two modes. Shared context keeps the session alive for explicit self-review. Fresh context starts a new session each iteration, removing confirmation bias.</p><p>The model sees only the repository and the skill file.</p><p>Nowadays, it&#8217;s common to replace brittle n8n workflows with a single Claude Code skill. This becomes even more powerful when running in a Ralph loop, especially because at the end of each run, you can take the signal and tell the model to update the skill with anything it should have done differently.</p><p>The skill evolves and quality improves automatically <a href="https://read.readwise.io/read/01kp5bgy8b07y256ythvkz2tt7">[5]</a>.</p><p>This self-improving mechanism reduces manual prompt tuning when applied to specific, repetitive engineering tasks.</p><h2>Three Real-World Use Cases</h2><p>In practice, Ralph loops don&#8217;t have a clear implementation pattern. They are more of an intuitive strategy you can get creative with. Thus, you have multiple ways of implementing them.</p><p>From my experience with Claude, you have three options for running Ralph loops, from highest abstraction to lowest:</p><ul><li><p><code>/ralph-loop</code><strong> plugin</strong> &#8212; the fastest path. Install it, run <code>/ralph-loop</code> in your session, and it manages the cycle for you.</p></li><li><p><code>/loop</code><strong> command</strong> &#8212; Claude Code&#8217;s built-in scheduler. <code>/loop every 1 minute /your-skill</code> fires the skill on a schedule <a href="https://www.anthropic.com/engineering/claude-code-best-practices">[6]</a>.</p></li><li><p><code>while true</code><strong> bash loop</strong> &#8212; the most primitive form. A one-liner that pipes a prompt file into the agent and restarts it forever.</p></li></ul><p>Because Claude Code keeps state through the files it&#8217;s working on, it retains context from the failed attempt and reads its own git diffs. Each iteration learns from the last.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mJx5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe828a6ca-601e-42ce-8973-5e7817283061_1400x742.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mJx5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe828a6ca-601e-42ce-8973-5e7817283061_1400x742.png 424w, https://substackcdn.com/image/fetch/$s_!mJx5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe828a6ca-601e-42ce-8973-5e7817283061_1400x742.png 848w, https://substackcdn.com/image/fetch/$s_!mJx5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe828a6ca-601e-42ce-8973-5e7817283061_1400x742.png 1272w, https://substackcdn.com/image/fetch/$s_!mJx5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe828a6ca-601e-42ce-8973-5e7817283061_1400x742.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mJx5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe828a6ca-601e-42ce-8973-5e7817283061_1400x742.png" width="1400" height="742" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e828a6ca-601e-42ce-8973-5e7817283061_1400x742.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:742,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The Stop Hook turns objective signals into the loop's only exit condition.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Stop Hook turns objective signals into the loop's only exit condition." title="The Stop Hook turns objective signals into the loop's only exit condition." srcset="https://substackcdn.com/image/fetch/$s_!mJx5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe828a6ca-601e-42ce-8973-5e7817283061_1400x742.png 424w, https://substackcdn.com/image/fetch/$s_!mJx5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe828a6ca-601e-42ce-8973-5e7817283061_1400x742.png 848w, https://substackcdn.com/image/fetch/$s_!mJx5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe828a6ca-601e-42ce-8973-5e7817283061_1400x742.png 1272w, https://substackcdn.com/image/fetch/$s_!mJx5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe828a6ca-601e-42ce-8973-5e7817283061_1400x742.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 4: The Stop Hook turns objective signals into the loop&#8217;s only exit condition.</em></figcaption></figure></div><h3>Implementing a ticket backlog with test-driven development</h3><p>You can set up a ticket folder with numbered text files. Run a while-true loop that tells Claude to implement the next most important ticket using Test-Driven Development (TDD). The model writes tests first, writes the code, commits the changes, and moves on.</p><p>Claude reads all tickets, skips completed ones, picks the next priority, implements it, marks it done, and commits. No dependency graph is needed because the model decides the ordering on the fly. One dumb loop acts like a relentless single-threaded engineer working through the backlog.</p><p>For example, set up a <code>doc/tickets</code> folder with numbered tickets (001, 002, 003...). Each describes a feature or fix. Then run:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;shell&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-shell">while true; do
  claude "implement the next most important ticket using TDD principles from doc/tickets. commit when done"
done</code></pre></div><p>Or use Claude Code&#8217;s built-in loop:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;shell&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-shell">/loop every 1 minute
build the next ticket from doc/tickets using TDD, run tests, commit when done</code></pre></div><h3>Adding test coverage</h3><p>You can set a concrete goal to raise coverage from 16 percent to 95 percent. The loop reads coverage metrics, writes tests for uncovered functions, runs the suite, identifies gaps, and iterates.</p><p>The coverage report provides the objective backpressure. The loop does not stop until the numbers validate success. Each iteration chips away at untested code paths until the threshold is met.</p><p>The implementation is as easy as:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;shell&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-shell">while true; do
  claude "analyze coverage gaps, write tests for uncovered functions, run the test suite, fix failures. stop when coverage exceeds 95%"
done</code></pre></div><h3>Framework and dependency migrations</h3><p>Migrations require crisp completion criteria. Upgrading React v16 to v19, Next.js 14 to 15, or migrating Jest to Vitest demands a clean build and passing tests. The agent swaps syntax, updates dependencies, and runs build commands.</p><p>It uses compiler errors and failing tests as feedback. Each cycle fixes a batch of errors until the toolchain confirms the code is clean. Deterministic verification signals make framework migrations the perfect Ralph loop candidate.</p><p>These are three concrete starting points. Before you wire the first one up, there is one honest limit you should know.</p><h2>What&#8217;s Next</h2><p>Ralph loops are the starting point. Once comfortable, add self-improving skills that update their instructions after each run, wire stop hooks for objective quality gates to avoid infinite loops, and connect to external systems like Linear or GitHub Issues so the loop reacts to new work automatically.</p><p>The pattern scales further than it looks. OpenAI&#8217;s Codex team shipped one million lines of code across 1,500 pull requests with zero human-written code using what they call a &#8220;Ralph Wiggum Loop&#8221; <a href="https://openai.com/index/harness-engineering-codex/">[7]</a>.</p><p>These loops are safe when repo-contained and the toolchain acts as the judge. They get dangerous with irreversible side effects outside the repo. Alexey Grigorev learned this when a Claude Code agent ran <code>terraform destroy</code> on DataTalks.Club&#8217;s production infrastructure, wiping the database, VPC, and all automated snapshots &#8212; two and a half years of data gone in one iteration. If your loop can destroy shared state, review every plan manually <a href="https://alexeygrigorev.com/posts/dropped-production-database/">[8]</a>.</p><p><em><strong>What is the first piece of work in your repo you would trust a Ralph loop with? You could choose a TDD backlog, a coverage ramp, a framework migration or what else?</strong></em></p><p><em>Click the button below and tell me. I read every response.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/ralph-loops/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/ralph-loops/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to restack this for your readers.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/ralph-loops?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/ralph-loops?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><div class="callout-block" data-callout="true"><h4>Whenever you&#8217;re ready, here is how I can help you</h4><p>If you want to go from zero to shipping production-grade AI agents, check out my <strong><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering course</a></strong>, built with Towards AI.</p><p>34 lessons. Three end-to-end portfolio projects. A certificate. And a Discord community with direct access to industry experts and me.</p><p><em>Rated 5/5 by 300+ students. The first 6 lessons are free:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Start here&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Start here</span></a></p><p><em>Not ready to commit?</em> Start with our <strong><a href="https://email-course.towardsai.net/?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">free Agentic AI Engineering Guide</a></strong>, a 6-day email course on the mistakes that silently break AI agents in production.</p></div><div><hr></div><h2>References</h2><ol><li><p>Anthropic. (n.d.). Effective Harnesses for Long-Running Agents. Anthropic. <a href="https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents">https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents</a></p></li><li><p>Cherny, B. (n.d.). I&#8217;m Boris and I Created Claude Code. X. <a href="https://x.com/bcherny/status/2007179832300581177">https://x.com/bcherny/status/2007179832300581177</a></p></li></ol><ol><li><p>Huntley, G. (n.d.). Ralph Wiggum as a &#8220;software engineer&#8221;. Geoffrey Huntley. <a href="https://ghuntley.com/ralph/">https://ghuntley.com/ralph/</a></p></li><li><p>LangChain. (n.d.). The Anatomy of an Agent Harness. LangChain Blog. <a href="https://blog.langchain.com/the-anatomy-of-an-agent-harness/">https://blog.langchain.com/the-anatomy-of-an-agent-harness/</a></p></li><li><p>Parsons, C. (n.d.). Ralph Loops: Build Dumb AI Loops That Ship. AI Engineer. <a href="https://read.readwise.io/read/01kp5bgy8b07y256ythvkz2tt7">https://read.readwise.io/read/01kp5bgy8b07y256ythvkz2tt7</a></p></li><li><p>Anthropic. (n.d.). Claude Code: Best Practices for Agentic Coding. Anthropic. <a href="https://www.anthropic.com/engineering/claude-code-best-practices">https://www.anthropic.com/engineering/claude-code-best-practices</a></p></li><li><p>Lopopolo, R. (n.d.). Harness engineering: leveraging Codex in an agent-first world. OpenAI. <a href="https://openai.com/index/harness-engineering-codex/">https://openai.com/index/harness-engineering-codex/</a></p></li><li><p>Grigorev, A. (n.d.). How I Dropped Our Production Database and Now Pay 10% More for AWS. Alexey Grigorev. <a href="https://alexeygrigorev.com/posts/dropped-production-database/">https://alexeygrigorev.com/posts/dropped-production-database/</a></p></li></ol><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[Karpathy Named It. I Built One on My Notes.]]></title><description><![CDATA[A deep research agent over my notes, highlights, and transcripts, grounded in years of curated thinking, not the public web.]]></description><link>https://www.decodingai.com/p/llm-knowledge-base-obsidian-readwise-notebooklm</link><guid isPermaLink="false">https://www.decodingai.com/p/llm-knowledge-base-obsidian-readwise-notebooklm</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Tue, 21 Apr 2026 08:00:51 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!WeAK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd980dfbc-f43d-41a9-85a7-fe27e6e20b61_1400x1000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I&#8217;ve been building what Andrej Karpathy calls <strong>an LLM Knowledge Base</strong> on top of my private data for the past few months &#8212; without realizing that was the name for it. Now, seeing it&#8217;s such a hot topic, I want to share my own twist on it. Similar to Andrej&#8217;s design, but still very different in how I approach the problem.</p><p>I keep my notes in Obsidian, my reading in Readwise, and my topical research in NotebookLM. Each tool is excellent in isolation, but no AI can reach across all three.</p><p>Whenever I reach for a general-purpose deep-research tool like Perplexity or Gemini Deep Research, it just searches the public web. Every user gets the exact same sources, and the resulting article reads like everyone else&#8217;s. What I actually want to research is my own curated thinking.</p><p>I want to leverage the books I highlighted, the notes I wrote, and the transcripts I dumped into NotebookLM. That is the edge. That is the signal nobody else has.</p><p>To solve this, I built a deep research agent as three Claude Code skills. The <code>/research_create</code>, <code>/research_search</code>, and <code>/research_distill</code> skills run on top of my private data via the <code>obsidian</code>, <code>readwise</code>, and <code>nlm</code> command-line interfaces (CLIs).</p><p>The system uses multi-round query expansion with gap analysis between rounds. It outputs a <code>memory/</code> folder with an <code>index.yaml</code> file that acts as a progressive-disclosure wiki over the source files. We also apply post-processing, including deduplication and re-ranking, to keep the result focused.</p><p>There is no vector database and no Retrieval-Augmented Generation (RAG) pipeline. We use the filesystem as state and Markdown, YAML, and JSON as the wire format. If you already keep notes in Obsidian, articles in Readwise, or research in NotebookLM, this is for you.</p><p>By the end of this article, you will know exactly how it works, see it run on this very article, and have a blueprint to build your own.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WeAK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd980dfbc-f43d-41a9-85a7-fe27e6e20b61_1400x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WeAK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd980dfbc-f43d-41a9-85a7-fe27e6e20b61_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!WeAK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd980dfbc-f43d-41a9-85a7-fe27e6e20b61_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!WeAK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd980dfbc-f43d-41a9-85a7-fe27e6e20b61_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!WeAK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd980dfbc-f43d-41a9-85a7-fe27e6e20b61_1400x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WeAK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd980dfbc-f43d-41a9-85a7-fe27e6e20b61_1400x1000.png" width="1400" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d980dfbc-f43d-41a9-85a7-fe27e6e20b61_1400x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;From three scattered tools to a queryable research memory to a grounded article&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="From three scattered tools to a queryable research memory to a grounded article" title="From three scattered tools to a queryable research memory to a grounded article" srcset="https://substackcdn.com/image/fetch/$s_!WeAK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd980dfbc-f43d-41a9-85a7-fe27e6e20b61_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!WeAK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd980dfbc-f43d-41a9-85a7-fe27e6e20b61_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!WeAK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd980dfbc-f43d-41a9-85a7-fe27e6e20b61_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!WeAK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd980dfbc-f43d-41a9-85a7-fe27e6e20b61_1400x1000.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 1: From three scattered tools to a queryable research memory to a grounded article. This is the end-to-end loop in one frame.</em></figcaption></figure></div><p>Here is the system at a glance. We will look at the three skills, three CLI adapters, and one memory folder, before we open the heaviest skill in the next section.</p><div class="callout-block" data-callout="true"><h2><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Your Path to Agentic AI Engineering for Production (Product)</a></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Uql0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe507da12-ec85-47cf-bd9e-afadffc7e99d_1200x1200.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Uql0!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe507da12-ec85-47cf-bd9e-afadffc7e99d_1200x1200.gif 424w, https://substackcdn.com/image/fetch/$s_!Uql0!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe507da12-ec85-47cf-bd9e-afadffc7e99d_1200x1200.gif 848w, https://substackcdn.com/image/fetch/$s_!Uql0!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe507da12-ec85-47cf-bd9e-afadffc7e99d_1200x1200.gif 1272w, https://substackcdn.com/image/fetch/$s_!Uql0!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe507da12-ec85-47cf-bd9e-afadffc7e99d_1200x1200.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Uql0!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe507da12-ec85-47cf-bd9e-afadffc7e99d_1200x1200.gif" width="1200" height="1200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e507da12-ec85-47cf-bd9e-afadffc7e99d_1200x1200.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1200,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;placeholder&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="placeholder" title="placeholder" srcset="https://substackcdn.com/image/fetch/$s_!Uql0!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe507da12-ec85-47cf-bd9e-afadffc7e99d_1200x1200.gif 424w, https://substackcdn.com/image/fetch/$s_!Uql0!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe507da12-ec85-47cf-bd9e-afadffc7e99d_1200x1200.gif 848w, https://substackcdn.com/image/fetch/$s_!Uql0!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe507da12-ec85-47cf-bd9e-afadffc7e99d_1200x1200.gif 1272w, https://substackcdn.com/image/fetch/$s_!Uql0!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe507da12-ec85-47cf-bd9e-afadffc7e99d_1200x1200.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The three-skill + memory-folder pattern in this article is one slice of harness engineering. If you want to master the rest, such as orchestration, context engineering, evals, and production deployment, check out my <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering course</a>, built with Towards AI.</p><p>34 lessons. Three end-to-end portfolio projects. A certificate. And a Discord community with direct access to industry experts and me.</p><p><em>Rated 5/5 by 300+ students. The first 6 lessons are free:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Start here&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Start here</span></a></p></div><h2>Three Skills, Three CLIs, One Memory Folder</h2><p>The system relies on three distinct skills. First, <code>/research_create</code> builds a <code>memory/</code> folder from scratch for a given topic or brain dump. Second, <code>/research_search</code> handles the read side, letting any future agent query an existing <code>memory/</code> folder via <code>index.yaml</code> with progressive disclosure.</p><p>Third, <code>/research_distill</code> takes a finished piece of content and extracts only the sources that were actually used into a single portable <code>research.md</code> appendix.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!L-BX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e2d9468-d2e9-4b2a-8e18-dd56482d4a83_1400x1154.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!L-BX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e2d9468-d2e9-4b2a-8e18-dd56482d4a83_1400x1154.png 424w, https://substackcdn.com/image/fetch/$s_!L-BX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e2d9468-d2e9-4b2a-8e18-dd56482d4a83_1400x1154.png 848w, https://substackcdn.com/image/fetch/$s_!L-BX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e2d9468-d2e9-4b2a-8e18-dd56482d4a83_1400x1154.png 1272w, https://substackcdn.com/image/fetch/$s_!L-BX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e2d9468-d2e9-4b2a-8e18-dd56482d4a83_1400x1154.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!L-BX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e2d9468-d2e9-4b2a-8e18-dd56482d4a83_1400x1154.png" width="1400" height="1154" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9e2d9468-d2e9-4b2a-8e18-dd56482d4a83_1400x1154.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1154,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1232783,&quot;alt&quot;:&quot;The system at a glance&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The system at a glance" title="The system at a glance" srcset="https://substackcdn.com/image/fetch/$s_!L-BX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e2d9468-d2e9-4b2a-8e18-dd56482d4a83_1400x1154.png 424w, https://substackcdn.com/image/fetch/$s_!L-BX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e2d9468-d2e9-4b2a-8e18-dd56482d4a83_1400x1154.png 848w, https://substackcdn.com/image/fetch/$s_!L-BX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e2d9468-d2e9-4b2a-8e18-dd56482d4a83_1400x1154.png 1272w, https://substackcdn.com/image/fetch/$s_!L-BX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e2d9468-d2e9-4b2a-8e18-dd56482d4a83_1400x1154.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 2: The system at a glance. Claude Code orchestrates three skills that wire CLI adapters into a single memory folder.</em></figcaption></figure></div><p>The <code>memory/</code> folder is built around <code>index.yaml</code>. It holds metadata per source, including <code>uri_highlights</code>, <code>uri_full</code>, <code>original_path</code>, and <code>origin</code>. The LLM reads the index first, then picks three to five relevant files based on summaries and reads those directly.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!62sR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb67fbab3-a24a-4b9d-95e1-98643cf616e6_1328x944.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!62sR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb67fbab3-a24a-4b9d-95e1-98643cf616e6_1328x944.png 424w, https://substackcdn.com/image/fetch/$s_!62sR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb67fbab3-a24a-4b9d-95e1-98643cf616e6_1328x944.png 848w, https://substackcdn.com/image/fetch/$s_!62sR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb67fbab3-a24a-4b9d-95e1-98643cf616e6_1328x944.png 1272w, https://substackcdn.com/image/fetch/$s_!62sR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb67fbab3-a24a-4b9d-95e1-98643cf616e6_1328x944.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!62sR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb67fbab3-a24a-4b9d-95e1-98643cf616e6_1328x944.png" width="1328" height="944" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b67fbab3-a24a-4b9d-95e1-98643cf616e6_1328x944.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:944,&quot;width&quot;:1328,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Memory Dir Screenshot&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Memory Dir Screenshot" title="Memory Dir Screenshot" srcset="https://substackcdn.com/image/fetch/$s_!62sR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb67fbab3-a24a-4b9d-95e1-98643cf616e6_1328x944.png 424w, https://substackcdn.com/image/fetch/$s_!62sR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb67fbab3-a24a-4b9d-95e1-98643cf616e6_1328x944.png 848w, https://substackcdn.com/image/fetch/$s_!62sR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb67fbab3-a24a-4b9d-95e1-98643cf616e6_1328x944.png 1272w, https://substackcdn.com/image/fetch/$s_!62sR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb67fbab3-a24a-4b9d-95e1-98643cf616e6_1328x944.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 3: The </em><code>memory/</code><em> folder on disk &#8212; </em><code>index.yaml</code><em> alongside each source&#8217;s key-highlights and full-document files.</em></figcaption></figure></div><p>There are no embeddings, no chunking, and no vector store to maintain, ensuring references stay perfectly traceable. Like OpenClaw, we treat memory as plain Markdown in the agent workspace, where files are the source of truth and the model only remembers what gets written to disk <a href="https://theagentstack.substack.com/p/openclaw-architecture-part-3-memory">[1]</a>.</p><p>The Obsidian, Readwise, and NotebookLM files act as the raw, immutable data. We touch them manually as humans, never through this pipeline. On top of that, <code>/research_create</code> produces a local actionable knowledge base for a specific scope, resulting in an ephemeral <code>memory/</code> folder per topic.</p><p>This separation allows the same raw data to feed many different research projects without contamination. The key invariant of this architecture is that the orchestrator never loads source files. Researcher subagents touch the raw files, while the orchestrator only ever sees structured JSON summaries flowing between steps.</p><p>We chose CLIs over Model Context Protocol (MCP) servers for three reasons. First, token economics. A skill enters Claude Code&#8217;s context at boot at ~100 tokens of metadata, and the body loads only when invoked.</p><p>By comparison, Notion&#8217;s MCP server dumps roughly 20,000 tokens of self-documenting tools at startup whether you use them or not. That is roughly 200&#215; less context before you have done anything <a href="https://youtube.com/watch?v=vEvytl7wrGM">[2]</a>.</p><p>Second, CLIs compose with bash. The orchestrator can pipe results through tools like <code>jq</code> or redirect output straight to a file, whereas MCP tool calls must round-trip through the LLM.</p><p>Third, Markdown is the native language of LLMs. As Simon Willison argues, Markdown with YAML frontmatter is more in the spirit of LLMs than MCP, because you put text in the context and let the LLM pick <a href="https://read.readwise.io/read/01kh8p44e70a1273g7ykgx7h5y">[3]</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QNsK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F920270dc-457a-4cbd-97c4-7612d667565e_1400x1040.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QNsK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F920270dc-457a-4cbd-97c4-7612d667565e_1400x1040.png 424w, https://substackcdn.com/image/fetch/$s_!QNsK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F920270dc-457a-4cbd-97c4-7612d667565e_1400x1040.png 848w, https://substackcdn.com/image/fetch/$s_!QNsK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F920270dc-457a-4cbd-97c4-7612d667565e_1400x1040.png 1272w, https://substackcdn.com/image/fetch/$s_!QNsK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F920270dc-457a-4cbd-97c4-7612d667565e_1400x1040.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QNsK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F920270dc-457a-4cbd-97c4-7612d667565e_1400x1040.png" width="1400" height="1040" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/920270dc-457a-4cbd-97c4-7612d667565e_1400x1040.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1040,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1174644,&quot;alt&quot;:&quot;Token economics &#8212; MCP vs skill&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Token economics &#8212; MCP vs skill" title="Token economics &#8212; MCP vs skill" srcset="https://substackcdn.com/image/fetch/$s_!QNsK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F920270dc-457a-4cbd-97c4-7612d667565e_1400x1040.png 424w, https://substackcdn.com/image/fetch/$s_!QNsK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F920270dc-457a-4cbd-97c4-7612d667565e_1400x1040.png 848w, https://substackcdn.com/image/fetch/$s_!QNsK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F920270dc-457a-4cbd-97c4-7612d667565e_1400x1040.png 1272w, https://substackcdn.com/image/fetch/$s_!QNsK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F920270dc-457a-4cbd-97c4-7612d667565e_1400x1040.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 4: A skill enters context at ~100 tokens of metadata. An MCP server dumps ~20,000 tokens of self-documenting tools, whether you use them or not.</em></figcaption></figure></div><p>That is the whole architecture. Now let&#8217;s open up the heaviest of the three skills, <code>/research_create</code>, and watch the multi-round research loop in detail, where the orchestrator-never-loads invariant earns its keep.</p><h2>How <code>/research_create</code> Works</h2><p>The process starts with a brain dump from the user, which can include text, URLs, or local file paths. During the deep research, you confirm three configuration knobs in one prompt: the number of rounds, queries per round, and a topic slug. Seed URIs from the brain dump always land in the output with a relevance score of 1.0, bypassing reranking because they are your explicit picks.</p><p>The orchestrator generates queries and dispatches one researcher subagent per query in parallel. Each researcher runs platform-specific searches. For Readwise, this means querying the library, feed, highlights, and document notes. For Obsidian, it means querying the local vault files. For NotebookLM, it means querying the projects and their associated sources and notes.</p><p>For Obsidian, we found that using its CLI &#8212; which leverages its index &#8212; is 10&#215; more efficient than letting the LLM roam around your vault.</p><p>The subagent does its own within-agent deduplication by original path. It captures metadata while files are open. It also caps output at a top-15 limit of unique findings.</p><p>Between rounds, a <code>gap_analyzer</code> subagent reads the deduplicated findings via <code>jq</code> without full reads. It flags thin or missing themes against the initial key themes and emits the next round&#8217;s queries. After all rounds, a <code>reranker</code> subagent scores every candidate between 0.0 and 1.0 using the cheapest sufficient signal. It checks metadata first, then reads the head and tail of the doc, and uses full reads only as a last resort.</p><p>Finally, a <code>builder</code> subagent invokes a Python script to emit the YAML deterministically, placing seeds first, then descending by score.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8chM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb24424a5-ab62-44e5-9458-dc2fdb92079a_1400x1400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8chM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb24424a5-ab62-44e5-9458-dc2fdb92079a_1400x1400.png 424w, https://substackcdn.com/image/fetch/$s_!8chM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb24424a5-ab62-44e5-9458-dc2fdb92079a_1400x1400.png 848w, https://substackcdn.com/image/fetch/$s_!8chM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb24424a5-ab62-44e5-9458-dc2fdb92079a_1400x1400.png 1272w, https://substackcdn.com/image/fetch/$s_!8chM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb24424a5-ab62-44e5-9458-dc2fdb92079a_1400x1400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8chM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb24424a5-ab62-44e5-9458-dc2fdb92079a_1400x1400.png" width="1400" height="1400" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b24424a5-ab62-44e5-9458-dc2fdb92079a_1400x1400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1400,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The full /research_create pipeline&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The full /research_create pipeline" title="The full /research_create pipeline" srcset="https://substackcdn.com/image/fetch/$s_!8chM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb24424a5-ab62-44e5-9458-dc2fdb92079a_1400x1400.png 424w, https://substackcdn.com/image/fetch/$s_!8chM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb24424a5-ab62-44e5-9458-dc2fdb92079a_1400x1400.png 848w, https://substackcdn.com/image/fetch/$s_!8chM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb24424a5-ab62-44e5-9458-dc2fdb92079a_1400x1400.png 1272w, https://substackcdn.com/image/fetch/$s_!8chM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb24424a5-ab62-44e5-9458-dc2fdb92079a_1400x1400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 5: The full </em><code>/research_create</code><em> pipeline. The orchestrator schedules. Subagents do the heavy reads.</em></figcaption></figure></div><p>We use this shape because context isolation is our central design choice. Every step that touches real source content runs in an isolated subagent with its own context window. The orchestrator only sees the compacted metadata of each file, while moving the actual file using <code>mv</code> bash commands into the memory folder.</p><p>The <code>index.yaml</code> file holds pointers and metadata for every file in the wiki. The orchestrator holds pointers, while subagents hold content. Geoffrey Huntley, creator of Ralph Loops, states that your primary context window should operate as a scheduler, scheduling other subagents to perform expensive allocation-type work <a href="https://ghuntley.com/ralph/">[4]</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kmhw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf9a9df2-03c9-4289-b05b-5b8778885174_1122x749.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kmhw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf9a9df2-03c9-4289-b05b-5b8778885174_1122x749.png 424w, https://substackcdn.com/image/fetch/$s_!kmhw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf9a9df2-03c9-4289-b05b-5b8778885174_1122x749.png 848w, https://substackcdn.com/image/fetch/$s_!kmhw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf9a9df2-03c9-4289-b05b-5b8778885174_1122x749.png 1272w, https://substackcdn.com/image/fetch/$s_!kmhw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf9a9df2-03c9-4289-b05b-5b8778885174_1122x749.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kmhw!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf9a9df2-03c9-4289-b05b-5b8778885174_1122x749.png" width="1200" height="801.0695187165776" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/df9a9df2-03c9-4289-b05b-5b8778885174_1122x749.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:749,&quot;width&quot;:1122,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:174291,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/194537609?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf9a9df2-03c9-4289-b05b-5b8778885174_1122x749.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kmhw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf9a9df2-03c9-4289-b05b-5b8778885174_1122x749.png 424w, https://substackcdn.com/image/fetch/$s_!kmhw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf9a9df2-03c9-4289-b05b-5b8778885174_1122x749.png 848w, https://substackcdn.com/image/fetch/$s_!kmhw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf9a9df2-03c9-4289-b05b-5b8778885174_1122x749.png 1272w, https://substackcdn.com/image/fetch/$s_!kmhw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf9a9df2-03c9-4289-b05b-5b8778885174_1122x749.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image 6: The top of an index.yaml file &#8212; topic, input summary, and the first few source entries with their summaries and relevance scores.</figcaption></figure></div><p>Subagents compress tens of thousands of input tokens into 1,000&#8211;2,000 output tokens before handing back to the orchestrator. That compression ratio is the whole point. The researcher subagents read deeply, and the orchestrator stays light.</p><p>Once the <code>memory/</code> folder exists, anyone can read it without loading source files. We use the <code>/research_search</code> skill to query this index.</p><h2>How <code>/research_search</code> Works</h2><p>The <code>/research_search</code> skill handles the read side of the system. Any agent can be handed a <code>memory/</code> folder and query it without loading source files into context. The skill encodes the protocol once so future agents do not have to re-derive it.</p><p>The system uses three layers of detail. Layer 1 is the <code>summary</code> field in <code>index.yaml</code>. It contains two to three sentences per source and is always loaded as part of the index. It is enough to answer what you have on a topic or build a table of contents.</p><p>Layer 2 is the key-highlights file, which holds the condensed topics of a file. This is extremely powerful when using reader tools such as Readwise, as these highlights are made manually by you, the reader, consisting of huge signal. Thus, not every source has this layer. It&#8217;s better not to have it at all than to have an LLM extract it.</p><p>Layer 3 is the <code>uri_full</code> file, representing the complete original document. You read it only when key highlights are insufficient or inexistent.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iqIL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe08b1e99-8f5f-4a88-a373-fbae8cef9811_1400x1255.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iqIL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe08b1e99-8f5f-4a88-a373-fbae8cef9811_1400x1255.png 424w, https://substackcdn.com/image/fetch/$s_!iqIL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe08b1e99-8f5f-4a88-a373-fbae8cef9811_1400x1255.png 848w, https://substackcdn.com/image/fetch/$s_!iqIL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe08b1e99-8f5f-4a88-a373-fbae8cef9811_1400x1255.png 1272w, https://substackcdn.com/image/fetch/$s_!iqIL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe08b1e99-8f5f-4a88-a373-fbae8cef9811_1400x1255.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iqIL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe08b1e99-8f5f-4a88-a373-fbae8cef9811_1400x1255.png" width="1400" height="1255" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e08b1e99-8f5f-4a88-a373-fbae8cef9811_1400x1255.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1255,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1230683,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/194537609?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe08b1e99-8f5f-4a88-a373-fbae8cef9811_1400x1255.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iqIL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe08b1e99-8f5f-4a88-a373-fbae8cef9811_1400x1255.png 424w, https://substackcdn.com/image/fetch/$s_!iqIL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe08b1e99-8f5f-4a88-a373-fbae8cef9811_1400x1255.png 848w, https://substackcdn.com/image/fetch/$s_!iqIL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe08b1e99-8f5f-4a88-a373-fbae8cef9811_1400x1255.png 1272w, https://substackcdn.com/image/fetch/$s_!iqIL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe08b1e99-8f5f-4a88-a373-fbae8cef9811_1400x1255.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image 7: Three layers of detail per source. The agent stays at Layer 1 unless it has a reason to descend.</figcaption></figure></div><p>Anthropic notes that models are great at navigating filesystems, and presenting tools as code on a filesystem allows models to read tool definitions on-demand, rather than reading them all up-front <a href="https://www.anthropic.com/engineering/building-more-efficient-ai-agents">[5]</a>. That maps exactly onto <code>index.yaml</code> plus lazy key-highlights loading.</p><p>Intuitively, the <code>index.yaml</code> file gives us progressive disclosure &#8212; the same pattern used inside skills &#8212; so the agent can choose from many options without drowning in information <a href="https://newsletter.swirlai.com/p/agent-skills-progressive-disclosure">[6]</a>.</p><p>The agent slices <code>index.yaml</code> by origin, location, relevance-score threshold, tags, author, publication, date range, or NotebookLM notebook.</p><p>The most beautiful part? Because <code>index.yaml</code> is structured data, the agent writes code on top of it. It uses <code>jq</code> filters, Python sorts, and <code>awk</code> projections.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KCNN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e53bda1-434e-4d87-a916-8ba090446067_1015x412.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KCNN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e53bda1-434e-4d87-a916-8ba090446067_1015x412.png 424w, https://substackcdn.com/image/fetch/$s_!KCNN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e53bda1-434e-4d87-a916-8ba090446067_1015x412.png 848w, https://substackcdn.com/image/fetch/$s_!KCNN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e53bda1-434e-4d87-a916-8ba090446067_1015x412.png 1272w, https://substackcdn.com/image/fetch/$s_!KCNN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e53bda1-434e-4d87-a916-8ba090446067_1015x412.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KCNN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e53bda1-434e-4d87-a916-8ba090446067_1015x412.png" width="1015" height="412" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3e53bda1-434e-4d87-a916-8ba090446067_1015x412.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:412,&quot;width&quot;:1015,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Index Single Sample Screenshot&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Index Single Sample Screenshot" title="Index Single Sample Screenshot" srcset="https://substackcdn.com/image/fetch/$s_!KCNN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e53bda1-434e-4d87-a916-8ba090446067_1015x412.png 424w, https://substackcdn.com/image/fetch/$s_!KCNN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e53bda1-434e-4d87-a916-8ba090446067_1015x412.png 848w, https://substackcdn.com/image/fetch/$s_!KCNN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e53bda1-434e-4d87-a916-8ba090446067_1015x412.png 1272w, https://substackcdn.com/image/fetch/$s_!KCNN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e53bda1-434e-4d87-a916-8ba090446067_1015x412.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 8: A single source entry in </em><code>index.yaml</code><em> &#8212; origin, authors, relevance score, and the URIs that power progressive disclosure.</em></figcaption></figure></div><p>LlamaIndex&#8217;s head-to-head benchmark proves this scales. A filesystem-explorer agent beat a hybrid vector RAG pipeline on correctness (8.4 vs 6.4) and relevance (9.6 vs 8.0) at a sub-60 document scale, precisely because the LLM saw whole files instead of chunks <a href="https://www.llamaindex.ai/blog/did-filesystem-tools-kill-vector-search">[7]</a>.</p><p>Portability comes for free. Hand the self-contained <code>memory/</code> folder to any agent, and they get up to speed instantly. Search lets agents find what is there. But once you have drafted an article, you also need to know what part of the wiki you actually used within your piece. That is <code>/research_distill</code>.</p><h2>How <code>/research_distill</code> Works</h2><p>Given any piece of content and the <code>memory/</code> folder used during writing, the skill walks every source in <code>index.yaml</code>. It decides whether the content actually used it by checking for explicit references or traceable ideas. The process is conservative by default. It is better to miss a borderline source than include one that was not actually used.</p><p>The output is a single <code>research.md</code> file. It is fully self-contained, meaning you never need to go back to the <code>memory/</code> folder again. For this very article, <code>/research_distill</code> should match around 15 to 20 of the 62 sources in the memory folder.</p><p>This matters because downstream generation loops re-load the research on each iteration. For example, within the evaluator-optimizer pattern, the system generates, critiques, and revises <a href="https://www.anthropic.com/engineering/building-effective-agents">[8]</a>. Keeping the anchor research small is the difference between an article that stays grounded and one that starts hallucinating.</p><p>As I explained in my article on <a href="https://www.decodingai.com/p/your-rag-pipeline-is-overkill-rlms">Recursive Language Models (RLMs)</a>, when the corpus fits in context with progressive disclosure, fancy retrieval is overkill <a href="https://www.decodingai.com/p/your-rag-pipeline-is-overkill-rlms">[9]</a>.</p><h2>What&#8217;s Next</h2><p>For personal-scale research involving hundreds of sources, a well-structured <code>memory/</code> folder with an <code>index.yaml</code> beats a RAG pipeline on every axis. It gives you full lineage back to source URLs, portability to pass the folder to any agent, and lower costs with no embedding model or vector store.</p><p>To further optimize the system, making it more context-efficient, I am considering moving the deduplication and re-ranking fully into Python scripts, adding a local cross-encoder reranker to avoid LLM calls for scoring, and extending the researcher with tag-aware filtering.</p><p><em>But here is what I&#8217;m wondering:</em></p><p><strong>What data source in your work makes you most want a private deep research agent? Is it your Obsidian vault, your Readwise library, a code repository, or your team&#8217;s shared documents?</strong></p><p><em>Click the button below and tell me. I read every response.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/llm-knowledge-base-obsidian-readwise-notebooklm/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/llm-knowledge-base-obsidian-readwise-notebooklm/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to restack this for your readers.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/llm-knowledge-base-obsidian-readwise-notebooklm?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/llm-knowledge-base-obsidian-readwise-notebooklm?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><div class="callout-block" data-callout="true"><h4>Whenever you&#8217;re ready, here is how I can help you</h4><p>If you want to go from zero to shipping production-grade AI agents, check out my <strong><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering course</a></strong>, built with Towards AI.</p><p>34 lessons. Three end-to-end portfolio projects. A certificate. And a Discord community with direct access to industry experts and me.</p><p><em>Rated 5/5 by 300+ students. The first 6 lessons are free:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Start here&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Start here</span></a></p><p><em>Not ready to commit?</em> Start with our <strong><a href="https://email-course.towardsai.net/?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">free Agentic AI Engineering Guide</a></strong>, a 6-day email course on the mistakes that silently break AI agents in production.</p></div><div><hr></div><h2>References</h2><ol><li><p>Govindarajan, V. (2026). OpenClaw Architecture Part 3 - Memory and State Ownership. The Agent Stack. <a href="https://theagentstack.substack.com/p/openclaw-architecture-part-3-memory">https://theagentstack.substack.com/p/openclaw-architecture-part-3-memory</a></p></li><li><p>Talebi, S. (2026). Claude Skills Explained in 23 Minutes. YouTube. <a href="https://youtube.com/watch?v=vEvytl7wrGM">https://youtube.com/watch?v=vEvytl7wrGM</a></p></li><li><p>Bowne-Anderson, H. (n.d.). Episode 70: 1,400 Production AI Deployments. Vanishing Gradients Podcast. <a href="https://read.readwise.io/read/01kh8p44e70a1273g7ykgx7h5y">https://read.readwise.io/read/01kh8p44e70a1273g7ykgx7h5y</a></p></li><li><p>Huntley, G. (n.d.). Ralph Wiggum as a &#8220;software engineer&#8221;. ghuntley.com. <a href="https://ghuntley.com/ralph/">https://ghuntley.com/ralph/</a></p></li><li><p>Anthropic. (n.d.). Building More Efficient AI Agents. Anthropic Blog. <a href="https://www.anthropic.com/engineering/building-more-efficient-ai-agents">https://www.anthropic.com/engineering/building-more-efficient-ai-agents</a></p></li><li><p>Grici&#363;nas, A. (n.d.). Agent Skills: Progressive Disclosure as a System Design Pattern. SwirlAI Newsletter. <a href="http://Govindarajan, V. (2026). OpenClaw Architecture Part 3 - Memory and State Ownership. The Agent Stack. https://theagentstack.substack.com/p/openclaw-architecture-part-3-memory Talebi, S. (2026). Claude Skills Explained in 23 Minutes. YouTube. https://youtube.com/watch?v=vEvytl7wrGM Bowne-Anderson, H. (n.d.). Episode 70: 1,400 Production AI Deployments. Vanishing Gradients Podcast. https://read.readwise.io/read/01kh8p44e70a1273g7ykgx7h5y Huntley, G. (n.d.). Ralph Wiggum as a &quot;software engineer&quot;. ghuntley.com. https://ghuntley.com/ralph/ Anthropic. (n.d.). Building More Efficient AI Agents. Anthropic Blog. https://www.anthropic.com/engineering/building-more-efficient-ai-agents Grici&#363;nas, A. (n.d.). Agent Skills: Progressive Disclosure as a System Design Pattern. SwirlAI Newsletter. https://newsletter.swirlai.com/p/agent-skills-progressive-disclosure LlamaIndex. (n.d.). Did Filesystem Tools Kill Vector Search?. LlamaIndex Blog. https://www.llamaindex.ai/blog/did-filesystem-tools-kill-vector-search Anthropic. (2025). Building Effective AI Agents. Anthropic Blog. https://www.anthropic.com/engineering/building-effective-agents Iusztin, P. (n.d.). Your RAG Pipeline Is Overkill (RLMs). Decoding AI Magazine. https://www.decodingai.com/p/your-rag-pipeline-is-overkill-rlms">https://newsletter.swirlai.com/p/agent-skills-progressive-disclosure</a></p></li><li><p>LlamaIndex. (n.d.). Did Filesystem Tools Kill Vector Search?. LlamaIndex Blog. <a href="https://www.llamaindex.ai/blog/did-filesystem-tools-kill-vector-search">https://www.llamaindex.ai/blog/did-filesystem-tools-kill-vector-search</a></p></li><li><p>Anthropic. (2025). Building Effective AI Agents. Anthropic Blog. <a href="https://www.anthropic.com/engineering/building-effective-agents">https://www.anthropic.com/engineering/building-effective-agents</a></p></li><li><p>Iusztin, P. (n.d.). Your RAG Pipeline Is Overkill (RLMs). Decoding AI Magazine. <a href="https://www.decodingai.com/p/your-rag-pipeline-is-overkill-rlms">https://www.decodingai.com/p/your-rag-pipeline-is-overkill-rlms</a></p></li></ol><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[How to Ship a Weekly Article in One Day]]></title><description><![CDATA[Inside the agentic AI workflow behind my weekly newsletter, course, and content]]></description><link>https://www.decodingai.com/p/how-i-automated-91-percent-of-my-business</link><guid isPermaLink="false">https://www.decodingai.com/p/how-i-automated-91-percent-of-my-business</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Wed, 15 Apr 2026 14:10:24 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!_tFE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd5e5540-20a1-4d9d-9c5e-edb32202cd5d_1400x353.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I publish one in-depth technical article every week on Decoding AI. That cadence sounds simple until you live it. The article itself eats up the time I should be digging into the Claude Code leak to understand how it works under the hood.</p><p>I am a builder first, not a writer. Most weeks, the writing strangles the building. This is the exact trap most weekly writers fall into.</p><p>When the writing eats the week, the next article has nothing real underneath it. So writers fill the gap with generics, surface-level takes, or invented examples that add noise to an already noisy internet.</p><p>The default fix most people reach for is to let AI write it. That fails for the opposite reason. When you put zero thought into the process, AI just industrializes the noise.</p><p>The whole point of writing is to share something you actually thought through, built, and learned. If AI writes for you, you publish nothing of value. If you write everything by hand, you don&#8217;t have enough time to build something worth publishing.</p><p>Both ends starve the loop that actually feeds the business: research, build, and teach.</p><p>What I built instead is an agentic AI workflow that automates ~90% of the manual writing pipeline while keeping me as the irreplaceable seed at the top. I provide the research direction and the brain dump that reflects my personal experience.</p><p>AI handles distribution speed. I handle thought, taste, and direction. By the end of this article, you will see exactly how the system works.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_tFE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd5e5540-20a1-4d9d-9c5e-edb32202cd5d_1400x353.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_tFE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd5e5540-20a1-4d9d-9c5e-edb32202cd5d_1400x353.png 424w, https://substackcdn.com/image/fetch/$s_!_tFE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd5e5540-20a1-4d9d-9c5e-edb32202cd5d_1400x353.png 848w, https://substackcdn.com/image/fetch/$s_!_tFE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd5e5540-20a1-4d9d-9c5e-edb32202cd5d_1400x353.png 1272w, https://substackcdn.com/image/fetch/$s_!_tFE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd5e5540-20a1-4d9d-9c5e-edb32202cd5d_1400x353.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_tFE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd5e5540-20a1-4d9d-9c5e-edb32202cd5d_1400x353.png" width="1400" height="353" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cd5e5540-20a1-4d9d-9c5e-edb32202cd5d_1400x353.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:353,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:331639,&quot;alt&quot;:&quot;The full pipeline at a glance. Human seed on the left, automated components in the middle, published article on the right.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The full pipeline at a glance. Human seed on the left, automated components in the middle, published article on the right." title="The full pipeline at a glance. Human seed on the left, automated components in the middle, published article on the right." srcset="https://substackcdn.com/image/fetch/$s_!_tFE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd5e5540-20a1-4d9d-9c5e-edb32202cd5d_1400x353.png 424w, https://substackcdn.com/image/fetch/$s_!_tFE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd5e5540-20a1-4d9d-9c5e-edb32202cd5d_1400x353.png 848w, https://substackcdn.com/image/fetch/$s_!_tFE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd5e5540-20a1-4d9d-9c5e-edb32202cd5d_1400x353.png 1272w, https://substackcdn.com/image/fetch/$s_!_tFE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd5e5540-20a1-4d9d-9c5e-edb32202cd5d_1400x353.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 1: The full pipeline at a glance. Human seed on the left, automated components in the middle, published article on the right.</em></figcaption></figure></div><p>We will cover my deep research agent, writing workflow with its evaluator-optimizer loop, image style-transfer step, and title &amp; SEO generator. And for the most important part, you will learn where the human-in-the-loop is irreplaceable.</p><h2>My Workflow: What Stays Human, What Gets Automated</h2><p>Before showing any architecture, I want to walk you through the manual workflow exactly as I used to run it. This is the boring, honest version. This is what every weekly technical writer secretly does, even if they pretend otherwise.</p><p>I used to research the topic for hours or days while taking notes. Then, I would write a high-level outline of the piece. Next, I sketched the first high-level diagram that helped me better visualize the narrative of the piece.</p><p>I expanded each outline section into bullet points, creating what I call the article guideline. After that, I wrote the article, edited it, and created the rest of the visuals. Finally, I wrote the title and SEO and copy-pasted everything into Substack.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zPY6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c6d8c2a-8192-440a-93c2-41c8b8c0d52f_1400x274.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zPY6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c6d8c2a-8192-440a-93c2-41c8b8c0d52f_1400x274.png 424w, https://substackcdn.com/image/fetch/$s_!zPY6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c6d8c2a-8192-440a-93c2-41c8b8c0d52f_1400x274.png 848w, https://substackcdn.com/image/fetch/$s_!zPY6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c6d8c2a-8192-440a-93c2-41c8b8c0d52f_1400x274.png 1272w, https://substackcdn.com/image/fetch/$s_!zPY6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c6d8c2a-8192-440a-93c2-41c8b8c0d52f_1400x274.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zPY6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c6d8c2a-8192-440a-93c2-41c8b8c0d52f_1400x274.png" width="1400" height="274" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6c6d8c2a-8192-440a-93c2-41c8b8c0d52f_1400x274.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:274,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:407709,&quot;alt&quot;:&quot;The nine-step workflow. Research and outline stay human; the rest gets automated with validation gates on the load-bearing steps.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The nine-step workflow. Research and outline stay human; the rest gets automated with validation gates on the load-bearing steps." title="The nine-step workflow. Research and outline stay human; the rest gets automated with validation gates on the load-bearing steps." srcset="https://substackcdn.com/image/fetch/$s_!zPY6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c6d8c2a-8192-440a-93c2-41c8b8c0d52f_1400x274.png 424w, https://substackcdn.com/image/fetch/$s_!zPY6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c6d8c2a-8192-440a-93c2-41c8b8c0d52f_1400x274.png 848w, https://substackcdn.com/image/fetch/$s_!zPY6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c6d8c2a-8192-440a-93c2-41c8b8c0d52f_1400x274.png 1272w, https://substackcdn.com/image/fetch/$s_!zPY6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c6d8c2a-8192-440a-93c2-41c8b8c0d52f_1400x274.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"><em>Image 2: The nine-step workflow. Research and outline stay human; the rest gets automated with validation gates on the load-bearing steps.</em></figcaption></figure></div><p>Now, everything gets automated except two things. I still do an in-depth round of research to understand the topic and collect a few high-quality golden source seeds. This is the fun part. These are mostly pulled from my Readwise reading list, which acts as a curated library I built over time while browsing Substack, YouTube, LinkedIn, X, and more. Then, I use this as a high-quality seed for my deep research agent to expand it and fill any potential gaps.</p><p>Second, while researching I do a brain dump of everything I consider relevant on the topic. After wrapping up the research, I refactor the brain dump into an outline that follows an engaging narrative. Then, I do a combination of manual and Claude Code work to expand it with bullet points, creating the article guideline.</p><p>Together, those two steps are the seed that makes everything downstream mine. Without them, the pipeline produces generic AI mush.</p><p>Before the automated pipeline existed, a 3,000-word article like one of my latest pieces, <a href="https://www.decodingai.com/p/agentic-harness-engineering">Agentic Harness Engineering</a>, would eat two to three days of my week running this exact nine-step grind by hand. Now, the same piece takes about a day.</p><h3>Why this works</h3><p>Writing prose is a translation step. It turns thoughts into words on a page or boxes in a diagram. Translation is exactly the kind of work LLMs excel at, if you already did the thinking.</p><p>If you haven&#8217;t, no amount of agent orchestration saves you. AI as a writing tool fails when you put zero thought into your process. It becomes a force multiplier when you use it to distribute your thoughts.</p><p>Now, let&#8217;s look at how the actual system works.</p><h2>Understanding The System Architecture</h2><p>The architecture has five big components plus a memory layer. The contract between them is the artifact each one writes to disk, such as the research markdown, the article guideline, the final article, branded image PNGs, and the final HTML bundle.</p><p>Here are the five components at a glance:</p><ol><li><p><strong>Deep Research agent (we call it Nova)</strong>: Takes a topic and golden sources, returning a ranked, structured research file.</p></li><li><p><strong>Writing Workflow (Brown)</strong>: Takes the article guideline and research, returning the full styled article via an evaluator-optimizer loop.</p></li><li><p><strong>Media style transfer</strong>: Because the article contains raw Mermaid diagrams, we apply the Decoding AI brand style.</p></li><li><p><strong>Title and SEO generator</strong>: Runs an expand-and-narrow loop to produce the title, subtitle, SEO title, and SEO description.</p></li><li><p><strong>HTML exporter</strong>: Converts the final markdown into platform-ready HTML for Substack, Medium, X, or LinkedIn to easily copy-paste the piece.</p></li></ol><p>The handoff contract between components is the filesystem. Each stage reads and writes plain files in a working directory. Internal per-component state lives in databases: PostgreSQL for Nova, and an SQLite checkpointer for Brown.</p><p>The artifacts make the pipeline debuggable, resumable, and human-in-the-loop friendly across stages. The databases make each stage individually resumable mid-run. For example, if the writing workflow fails after generating the first draft, we can easily resume without having to spend tokens on rerunning from scratch.</p><p>Also, because everything is managed through files, I can open any artifact at any key step, inspect it, edit it, and re-run downstream.</p><p>We will show you how we used this system to write one of our latest popular pieces: <a href="https://www.decodingai.com/p/agentic-harness-engineering">Agentic Harness Engineering</a>.</p><p>I&#8217;ve also used the same process to research and write professional lessons for other educational projects, such as our latest Agentic AI Engineering course, as the pipeline adapts to any type of educational business.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9Mog!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c5e383a-655e-45e9-b56d-1535601cbc20_1200x814.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9Mog!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c5e383a-655e-45e9-b56d-1535601cbc20_1200x814.png 424w, https://substackcdn.com/image/fetch/$s_!9Mog!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c5e383a-655e-45e9-b56d-1535601cbc20_1200x814.png 848w, https://substackcdn.com/image/fetch/$s_!9Mog!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c5e383a-655e-45e9-b56d-1535601cbc20_1200x814.png 1272w, https://substackcdn.com/image/fetch/$s_!9Mog!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c5e383a-655e-45e9-b56d-1535601cbc20_1200x814.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9Mog!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c5e383a-655e-45e9-b56d-1535601cbc20_1200x814.png" width="1200" height="814" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2c5e383a-655e-45e9-b56d-1535601cbc20_1200x814.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:814,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The full data flow. Human-seeded research at the left, evaluator-optimizer writing in the middle, branded media and SEO on the right, finished HTML at the terminus.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The full data flow. Human-seeded research at the left, evaluator-optimizer writing in the middle, branded media and SEO on the right, finished HTML at the terminus." title="The full data flow. Human-seeded research at the left, evaluator-optimizer writing in the middle, branded media and SEO on the right, finished HTML at the terminus." srcset="https://substackcdn.com/image/fetch/$s_!9Mog!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c5e383a-655e-45e9-b56d-1535601cbc20_1200x814.png 424w, https://substackcdn.com/image/fetch/$s_!9Mog!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c5e383a-655e-45e9-b56d-1535601cbc20_1200x814.png 848w, https://substackcdn.com/image/fetch/$s_!9Mog!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c5e383a-655e-45e9-b56d-1535601cbc20_1200x814.png 1272w, https://substackcdn.com/image/fetch/$s_!9Mog!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c5e383a-655e-45e9-b56d-1535601cbc20_1200x814.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 3: End-to-end system architecture: Human-seeded research on the left, evaluator-optimizer writing in the middle, branded media and SEO on the right, finished HTML at the terminus.</em></figcaption></figure></div><p>Each component is explained in depth in the sections below. Two are MCP servers (Nova and Brown) and three are skills (media style transfer, title &amp; SEO, HTML export).</p><p>In terms of concrete economics, the whole process runs at roughly ~$0.30 <em>to $</em>1 per image, mostly in Gemini credits, with the rest of the pipeline costing cents. This article, with 9 images, landed closer to $6, while a leaner piece with a single diagram sits around $1.</p><div class="callout-block" data-callout="true"><h2><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Build This Exact Stack Yourself (Product)</a></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NL_4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6823b5d3-72be-461a-9dfe-78be888cd22b_1200x1200.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NL_4!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6823b5d3-72be-461a-9dfe-78be888cd22b_1200x1200.gif 424w, https://substackcdn.com/image/fetch/$s_!NL_4!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6823b5d3-72be-461a-9dfe-78be888cd22b_1200x1200.gif 848w, https://substackcdn.com/image/fetch/$s_!NL_4!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6823b5d3-72be-461a-9dfe-78be888cd22b_1200x1200.gif 1272w, https://substackcdn.com/image/fetch/$s_!NL_4!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6823b5d3-72be-461a-9dfe-78be888cd22b_1200x1200.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NL_4!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6823b5d3-72be-461a-9dfe-78be888cd22b_1200x1200.gif" width="1200" height="1200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6823b5d3-72be-461a-9dfe-78be888cd22b_1200x1200.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1200,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;placeholder&quot;,&quot;title&quot;:&quot;placeholder&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="placeholder" title="placeholder" srcset="https://substackcdn.com/image/fetch/$s_!NL_4!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6823b5d3-72be-461a-9dfe-78be888cd22b_1200x1200.gif 424w, https://substackcdn.com/image/fetch/$s_!NL_4!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6823b5d3-72be-461a-9dfe-78be888cd22b_1200x1200.gif 848w, https://substackcdn.com/image/fetch/$s_!NL_4!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6823b5d3-72be-461a-9dfe-78be888cd22b_1200x1200.gif 1272w, https://substackcdn.com/image/fetch/$s_!NL_4!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6823b5d3-72be-461a-9dfe-78be888cd22b_1200x1200.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Reading about a pipeline is one thing. Building one is another. In my <strong><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering Course</a></strong>, built with Towards AI, I walk you through this exact stack from scratch.</p><p>Nova&#8217;s deep research loop, Brown&#8217;s evaluator-optimizer built on LangGraph, both served via FastMCP, plus the style-transfer skill, evaluation with Opik, and deployment on Docker, GCP, and GitHub Actions.</p><p>34 lessons. Three end-to-end portfolio projects. A certificate. And a Discord community with direct access to industry experts and me.</p><p>Rated 5/5 by 300+ students. The first 6 lessons are free:</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Start here&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Start here</span></a></p></div><h2>Walkthrough: The Artifacts of One Article</h2><p>Before diving into each component, let&#8217;s take a look at the input and output artifacts the pipeline produced while generating the <a href="https://www.decodingai.com/p/agentic-harness-engineering">Agentic Harness Engineering</a> article. Here are some trimmed versions of each, as they get pretty large.</p><h4>outline.md: the hand-written seed, Nova&#8217;s input (88 lines)</h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">## Outline
1. Introduction - Why Do We Need a Harness?
&#9;1. Personal story: To be researched
&#9;2. Problem + Agitation: ...
&#9;3. Transformation + Solution: ...
&#9;4. Intuitively, Mitchell Hashimoto has the best definition of a harness: "the idea that anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again."
&#9;5. 200 words
2. What the Hell Is A Harness?
    ...
3. How does a Harness Look?
&#9;1. Key components: LLM, tools, planning loop, context engineering, sandbox, memory, orchestration layer, serving layer, interfaces
&#9;2. The agent loop: Powered by planning techniques like ReAct
    ...
4. Planning &amp; Orchestration
    ...
&#9;4. 200 words
5. Key Tools
    ...
6. Sandbox Environment
    ...
7. Memory
    ...
8. Conclusion - The Future of Harness
    ...

# Resources

1. [My AI Adoption Journey](https://mitchellh.com/writing/my-ai-adoption-journey)
2. ...</code></pre></div><p>The seed is deliberately rough. It contains placeholders like &#8220;To be researched&#8221;, section blocks that will later be restructured, and hand-picked golden sources that anchor Nova&#8217;s first round of research. The idea is to dump ideas without thinking too much about structure while you are in your creative mindset.</p><h4>research.md: Nova&#8217;s output (1377 lines)</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SbxN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff49e76f9-5e97-45d1-9551-bf8757309256_1416x1364.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SbxN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff49e76f9-5e97-45d1-9551-bf8757309256_1416x1364.png 424w, https://substackcdn.com/image/fetch/$s_!SbxN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff49e76f9-5e97-45d1-9551-bf8757309256_1416x1364.png 848w, https://substackcdn.com/image/fetch/$s_!SbxN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff49e76f9-5e97-45d1-9551-bf8757309256_1416x1364.png 1272w, https://substackcdn.com/image/fetch/$s_!SbxN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff49e76f9-5e97-45d1-9551-bf8757309256_1416x1364.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SbxN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff49e76f9-5e97-45d1-9551-bf8757309256_1416x1364.png" width="1416" height="1364" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f49e76f9-5e97-45d1-9551-bf8757309256_1416x1364.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1364,&quot;width&quot;:1416,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;A trimmed view of Nova's collapsible-HTML research.md output.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A trimmed view of Nova's collapsible-HTML research.md output." title="A trimmed view of Nova's collapsible-HTML research.md output." srcset="https://substackcdn.com/image/fetch/$s_!SbxN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff49e76f9-5e97-45d1-9551-bf8757309256_1416x1364.png 424w, https://substackcdn.com/image/fetch/$s_!SbxN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff49e76f9-5e97-45d1-9551-bf8757309256_1416x1364.png 848w, https://substackcdn.com/image/fetch/$s_!SbxN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff49e76f9-5e97-45d1-9551-bf8757309256_1416x1364.png 1272w, https://substackcdn.com/image/fetch/$s_!SbxN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff49e76f9-5e97-45d1-9551-bf8757309256_1416x1364.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 4: A trimmed view of Nova&#8217;s collapsible-HTML research.md output.</em></figcaption></figure></div><h4>article_guideline.md: Expanded outline, Brown&#8217;s input (201 lines, 8 sections).</h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">## What We Are Planning to Share

...

## Why We Think It's Valuable

...

## Point of View

I write the article, Paul Iusztin. I am part of a bigger team known as Decoding AI....

----

## Article Outline

1. Why Do We Need a Harness?
2. What the Hell Is a Harness?
3. The Anatomy of a Harness
4. How the Agent Decides What to Do Next
5. The Tools That Let Agents Act
6. Where Agents Run
7. Memory Is Just the Filesystem
8. What's Next

## Section 1 - Why Do We Need a Harness?

...

## Section 2 - What the Hell Is a Harness?

...

- **Hook:** Start with the horse analogy. A horse is powerful on its own, but useless for farming without a harness &#8212; the straps, reins, and attachments that let you direct its strength toward useful work (inspiration from Jonathan Gimick from Manning). Same with LLMs: the model has the intelligence, but without tools, memory, state, guardrails, and orchestration, you can't put it to work reliably.
- **The clean definition:** LangChain's formulation is the clearest &#8212; **Agent = Model + Harness**. The harness is "every piece of code, configuration, and execution logic that isn't the model itself." The model provides intelligence. The harness makes that intelligence useful.
...

[GENERATE_DIAGRAM] Three levels of engineering: prompt, context, and harness engineering.
...

- **Transition:** Now that you know what a harness is, let's look at all its components and how they fit together at a high level &#8212; before diving deeper into each one.

- **Section length:** 300 words

## Section 3 - The Anatomy of a Harness

...

## Section 8 - What's Next

...</code></pre></div><p>The article guideline is deliberately as structured and detailed as possible. The idea is to have enough or even more detail about each section to fill in the requested word budget to ensure the LLM doesn&#8217;t fill in any gaps with generalities or, worse, with hallucinations.</p><h4>article.md: Brown&#8217;s final prose (~3,000 published words, 8 sections)</h4><p>See the <a href="https://www.decodingai.com/p/agentic-harness-engineering">Agentic Harness Engineering</a> article we posted a few weeks ago on Substack.</p><p>Now let&#8217;s see how Nova, the deep research agent, turns the outline and its golden sources into a structured research file.</p><h2>Deep Research: How Nova Builds the Knowledge Base</h2><p>Nova is an MCP server exposing ten specialized tools, orchestrated by the client, which is often a harness such as Claude Code or Cursor.</p><p>Here is how the overall deep research architecture works:</p><ol><li><p><strong>Query generation loop:</strong> Nova takes the topic and golden sources, runs gap analysis between the outline and the provided sources with Gemini Pro, and generates the next round of research queries based on what is missing. Three rounds hits the cost-versus-coverage sweet spot.</p></li><li><p><strong>Concurrent retrieval:</strong> Each round fans out concurrent Perplexity calls that return only metadata and a summary of each new source.</p></li><li><p><strong>Two-stage filtering:</strong> We full-scrape only the top five by a four-dimensional rubric evaluating trustworthiness, authority, relevance, and quality. For the rest of the sources we keep only the summary, which is enough for providing examples such as <code>Anthropic is implementing compaction in Claude Code</code>.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ckWR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feeb5adf8-5280-4597-a537-70fb43ac542d_1400x698.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ckWR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feeb5adf8-5280-4597-a537-70fb43ac542d_1400x698.png 424w, https://substackcdn.com/image/fetch/$s_!ckWR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feeb5adf8-5280-4597-a537-70fb43ac542d_1400x698.png 848w, https://substackcdn.com/image/fetch/$s_!ckWR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feeb5adf8-5280-4597-a537-70fb43ac542d_1400x698.png 1272w, https://substackcdn.com/image/fetch/$s_!ckWR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feeb5adf8-5280-4597-a537-70fb43ac542d_1400x698.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ckWR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feeb5adf8-5280-4597-a537-70fb43ac542d_1400x698.png" width="1400" height="698" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eeb5adf8-5280-4597-a537-70fb43ac542d_1400x698.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:698,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:841138,&quot;alt&quot;:&quot;Nova's deep research loop. Three rounds of gap-driven Perplexity queries, a two-stage filter, and source-specific ingestion produce the structured research file.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Nova's deep research loop. Three rounds of gap-driven Perplexity queries, a two-stage filter, and source-specific ingestion produce the structured research file." title="Nova's deep research loop. Three rounds of gap-driven Perplexity queries, a two-stage filter, and source-specific ingestion produce the structured research file." srcset="https://substackcdn.com/image/fetch/$s_!ckWR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feeb5adf8-5280-4597-a537-70fb43ac542d_1400x698.png 424w, https://substackcdn.com/image/fetch/$s_!ckWR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feeb5adf8-5280-4597-a537-70fb43ac542d_1400x698.png 848w, https://substackcdn.com/image/fetch/$s_!ckWR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feeb5adf8-5280-4597-a537-70fb43ac542d_1400x698.png 1272w, https://substackcdn.com/image/fetch/$s_!ckWR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feeb5adf8-5280-4597-a537-70fb43ac542d_1400x698.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 5: Nova&#8217;s deep research loop. Three rounds of gap-driven Perplexity queries, a two-stage filter, and source-specific ingestion produce the structured research file.</em></figcaption></figure></div><p>Nova ships one purpose-built tool per source family. We scrape web URLs using Firecrawl, while we ingest GitHub repos through gitingest. We ingest YouTube videos using Gemini Pro directly on the URL without a local download.</p><p>For example, this is how I used Nova when writing my harness article. I started with a vague topic about what an agent harness is and why it matters. I handed Nova a seed set of golden-source URLs inside the guideline, including the <a href="https://blog.langchain.com/the-anatomy-of-an-agent-harness/">LangChain harness post</a>, the <a href="https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents">Anthropic long-running agents piece</a>, and <a href="https://mitchellh.com/writing/my-ai-adoption-journey">Mitchell Hashimoto&#8217;s AI adoption journey</a>. Nova extracted these, scraped each one, and wrote the cleaned content into its working memory.</p><p>Nova then ran the three-round gap-analysis loop, fanning out concurrent Perplexity queries aimed at topics the seed sources had not covered. Every raw result was appended to the log. Ultimately, each source is filtered using a set of heuristics and LLMs to ensure we keep only high-quality results.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GkHV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa5550d9-3163-47db-8fa2-2376f0c74f82_1408x752.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GkHV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa5550d9-3163-47db-8fa2-2376f0c74f82_1408x752.png 424w, https://substackcdn.com/image/fetch/$s_!GkHV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa5550d9-3163-47db-8fa2-2376f0c74f82_1408x752.png 848w, https://substackcdn.com/image/fetch/$s_!GkHV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa5550d9-3163-47db-8fa2-2376f0c74f82_1408x752.png 1272w, https://substackcdn.com/image/fetch/$s_!GkHV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa5550d9-3163-47db-8fa2-2376f0c74f82_1408x752.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GkHV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa5550d9-3163-47db-8fa2-2376f0c74f82_1408x752.png" width="1408" height="752" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fa5550d9-3163-47db-8fa2-2376f0c74f82_1408x752.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:752,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Full Nova system architecture &#8212; MCP tools, Postgres state, two-stage filter, source-specific ingestion.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Full Nova system architecture &#8212; MCP tools, Postgres state, two-stage filter, source-specific ingestion." title="Full Nova system architecture &#8212; MCP tools, Postgres state, two-stage filter, source-specific ingestion." srcset="https://substackcdn.com/image/fetch/$s_!GkHV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa5550d9-3163-47db-8fa2-2376f0c74f82_1408x752.png 424w, https://substackcdn.com/image/fetch/$s_!GkHV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa5550d9-3163-47db-8fa2-2376f0c74f82_1408x752.png 848w, https://substackcdn.com/image/fetch/$s_!GkHV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa5550d9-3163-47db-8fa2-2376f0c74f82_1408x752.png 1272w, https://substackcdn.com/image/fetch/$s_!GkHV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa5550d9-3163-47db-8fa2-2376f0c74f82_1408x752.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 6: Full Nova system architecture &#8212; MCP tools, Postgres state, two-stage filter, and source-specific ingestion.</em></figcaption></figure></div><p>Finally, Nova compiles everything into the collapsible HTML research file.</p><p>The client knows how to leverage all of Nova&#8217;s MCP tools through a skill that glues together the ingestion, search, and all the other utility tools into the unified deep research algorithm that takes as input the outline.md file and outputs research.md.</p><h2>Writing Workflow: How Brown Turns an Idea Into an Article</h2><p>Brown picks up where Nova left off. Brown is a workflow, not an agent, implemented with LangGraph. We chose a workflow over an agent deliberately: prose generation rewards predictability over exploration.</p><p>First, we generate all the required Mermaid diagrams for the article using the orchestrator-worker pattern that looks around the article and spins up a specialized Mermaid-diagram agent based on all the user requests found within the article. These are usually flagged within the article guideline explicitly by stating &#8220;generate diagram&#8221;, &#8220;create a diagram&#8221;, or [GENERATE_DIAGRAM]. Next, these diagrams are passed downwards through the generation process. We&#8217;ll come back to how they get styled in the Branded Images section. The orchestrator-worker pattern can easily be extended to generate other types of media such as images, videos, or audio.</p><p>Next, we control Brown&#8217;s voice via the system prompt through three large tricks.</p><p>The <strong>first one</strong> is based on defining a set of six profile classes, each targeting a different family of rules. There are four generic profiles, which are static and agnostic to who is using the tool and what they are doing:</p><ol><li><p><strong>Structure Profile:</strong> How the prose is physically laid out on the page such as sentence, paragraph, list, and subheading shape.</p></li><li><p><strong>Mechanics Profile:</strong> The grammatical scaffolding the writing must respect such as active voice, point of view, and punctuation rules.</p></li><li><p><strong>Terminology Profile:</strong> What vocabulary is allowed and what filler is banned such as word choice, sentence phrasing, and descriptive language.</p></li><li><p><strong>Tonality Profile:</strong> How the article should feel to the reader such as formality level, voice characteristics, and emotional register.</p></li></ol><p>And two customizable:</p><ol><li><p><strong>Character Profile:</strong> Who is writing. For example, I added here my biography. This should be adapted per user.</p></li><li><p><strong>Article Profile:</strong> Special article characteristics such as the structure, referencing, and citations. This can be swapped to a LinkedIn, Reddit, or X profile to adapt the system to different formats.</p></li></ol><p>The <strong>second trick</strong> is to force the LLM to respect the article guideline and research over anything else, to ensure the user gets what they expect and that Brown adheres only to the research to avoid hallucinations.</p><p>The <strong>third trick</strong> is to add a set of few-shot examples, which beats anything else because showing works better than telling. For the best quality this should be changed when switching article formats and especially when switching content formats.</p><p>After we compile our system prompt, we call Gemini at a 0.7 temperature to produce a first draft with more randomness.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yFlH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c1496a1-738f-4f84-b063-fd8d445d5a70_1200x946.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yFlH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c1496a1-738f-4f84-b063-fd8d445d5a70_1200x946.png 424w, https://substackcdn.com/image/fetch/$s_!yFlH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c1496a1-738f-4f84-b063-fd8d445d5a70_1200x946.png 848w, https://substackcdn.com/image/fetch/$s_!yFlH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c1496a1-738f-4f84-b063-fd8d445d5a70_1200x946.png 1272w, https://substackcdn.com/image/fetch/$s_!yFlH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c1496a1-738f-4f84-b063-fd8d445d5a70_1200x946.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yFlH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c1496a1-738f-4f84-b063-fd8d445d5a70_1200x946.png" width="1200" height="946" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0c1496a1-738f-4f84-b063-fd8d445d5a70_1200x946.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:946,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Brown's writing loop. Six profiles compose the system prompt; a Generator-Reviewer-Editor loop iterates until the draft passes review.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Brown's writing loop. Six profiles compose the system prompt; a Generator-Reviewer-Editor loop iterates until the draft passes review." title="Brown's writing loop. Six profiles compose the system prompt; a Generator-Reviewer-Editor loop iterates until the draft passes review." srcset="https://substackcdn.com/image/fetch/$s_!yFlH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c1496a1-738f-4f84-b063-fd8d445d5a70_1200x946.png 424w, https://substackcdn.com/image/fetch/$s_!yFlH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c1496a1-738f-4f84-b063-fd8d445d5a70_1200x946.png 848w, https://substackcdn.com/image/fetch/$s_!yFlH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c1496a1-738f-4f84-b063-fd8d445d5a70_1200x946.png 1272w, https://substackcdn.com/image/fetch/$s_!yFlH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c1496a1-738f-4f84-b063-fd8d445d5a70_1200x946.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 7: Brown&#8217;s writing loop. Six profiles compose the system prompt; a Generator-Reviewer-Editor loop iterates until the draft passes review.</em></figcaption></figure></div><p>After the first generation pass, we start an evaluator-optimizer loop running a Reviewer node with 0.0 temperature against the guideline, research, and profiles to ensure the draft respects all the expected requirements. The Reviewer node returns a list of structured review objects via Pydantic. If issues are found, we run an Editor node at a 0.1 temperature that applies all these fixes.</p><p>The evaluator-optimizer loop runs for a fixed iteration count, not until a quality score is good enough. Because writing, like any creative work, is highly subjective, a single quality score becomes noisy and unpredictable. Empirically, running the loop for a fixed number of iterations yields better results and gives us more control over cost and latency.</p><p>Because the article might not be polished enough, we expose editing tools through the MCP server so the user can kick off another review-edit iteration on demand.</p><p>Now let&#8217;s see how we transform raw Mermaid output into branded diagrams.</p><h2>Generating Branded Images</h2><p>Brown produces Mermaid source for every diagram in the article. Mermaid is fast and predictable to generate with LLMs but visually generic. In theory you can customize them. But let&#8217;s be honest. They are ugly. Thus, the job of this stage is to keep the structure of the Mermaid diagrams Brown produced while applying a styling layer on top of them.</p><p>We use a skill that leverages Gemini&#8217;s Nano Banana for the style transfer. The skill takes a file as input and detects all the Mermaid diagrams in it. Then, for each diagram, it runs parallel subagents. Each invokes the Gemini script on the raw Mermaid text and outputs a styled PNG.</p><p>Here is the prompt engineering behind the styling:</p><ul><li><p>The branding is referenced both through a written file with color codes, fonts, and general guidelines, plus a representative image.</p></li><li><p>2 positive examples containing both the Mermaid inputs and positive styled outputs.</p></li></ul><div class="image-gallery-embed" data-attrs="{&quot;gallery&quot;:{&quot;images&quot;:[{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/83770f76-610c-4f37-9a7b-eeef0e7d9028_406x643.png&quot;},{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dc8ff545-b867-4f12-ab8f-a11e83a7ed2b_1456x720.png&quot;}],&quot;caption&quot;:&quot; Image 8: Positive few-shot example&quot;,&quot;alt&quot;:&quot;&quot;,&quot;staticGalleryImage&quot;:{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6c8d6ee6-897f-41c2-ba75-42a7100660b8_1456x720.png&quot;}},&quot;isEditorNode&quot;:true}"></div><ul><li><p>2 negative examples also containing the Mermaid inputs and faulty styled outputs.<br></p><div class="image-gallery-embed" data-attrs="{&quot;gallery&quot;:{&quot;images&quot;:[{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3ab302a2-6dd3-4948-9f8a-0d9509900edd_1262x1614.png&quot;},{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/40004311-517b-45dc-8b95-7461e8785d4d_912x1168.png&quot;}],&quot;caption&quot;:&quot;Image 9: Negative few-shot example&quot;,&quot;alt&quot;:&quot;&quot;,&quot;staticGalleryImage&quot;:{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a82133df-94b6-4842-88d6-b4428a5ff50a_1456x720.png&quot;}},&quot;isEditorNode&quot;:true}"></div></li></ul><p>When using images as few-shot examples, you should be really careful not to go overboard with them, as they add up in tokens quickly. Also, adding the positive and negative examples on top of just random style images and files was the special sauce for us that made everything work, as it clearly shows Nano Banana how to make the mapping between the two.</p><p>Now, let&#8217;s see how we generate punchy titles and relevant SEO.</p><h2>Generating Title &amp; SEO</h2><p>Title and SEO are the most important components. They decide whether the article gets read at all. Doing it by gut on a Friday night is the worst possible workflow.</p><p>The pipeline replaces gut with an expand-and-narrow loop. We generate nine versions from many angles, score ruthlessly, and keep only the top four. Then we repeat this process three times.</p><p>The generator creates nine candidate title, subtitle, SEO title, and SEO description packages per round, each from a different angle like personal transformation, curiosity, making bold claims, showing proof of work, and more. The idea is to have a lot of diversity during the expansion round.</p><p>The validator scores every candidate on six rubric-anchored dimensions: title, subtitle, SEO title, and SEO description quality, article alignment, and cohesion across the four pieces. It uses hybrid scoring, combining an LLM-judge for the qualitative rubrics and heuristic penalties for the hard constraints such as character count. For example, shorter titles score higher.</p><p>Then, based on the scores generated by the validator, we pick the top four winners and use them as seeds for the next round of generation.</p><p>The key here is to make the validator a subagent that doesn&#8217;t share the same context window as the generator to avoid any type of bias. Fresh eyes prevent self-confirmation bias. This is the same principle Brown uses for its evaluator-optimizer split and Nova uses for its filter step.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ixet!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c04250a-50e3-4908-9312-8e42b2071a2a_1400x1078.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ixet!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c04250a-50e3-4908-9312-8e42b2071a2a_1400x1078.png 424w, https://substackcdn.com/image/fetch/$s_!Ixet!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c04250a-50e3-4908-9312-8e42b2071a2a_1400x1078.png 848w, https://substackcdn.com/image/fetch/$s_!Ixet!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c04250a-50e3-4908-9312-8e42b2071a2a_1400x1078.png 1272w, https://substackcdn.com/image/fetch/$s_!Ixet!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c04250a-50e3-4908-9312-8e42b2071a2a_1400x1078.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ixet!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c04250a-50e3-4908-9312-8e42b2071a2a_1400x1078.png" width="1400" height="1078" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5c04250a-50e3-4908-9312-8e42b2071a2a_1400x1078.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1078,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1496877,&quot;alt&quot;:&quot;The expand-and-narrow loop. 9 angles &#215; 3 rounds, scored by an isolated validator on 6 dimensions, narrowing to a top-4 for A/B testing.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The expand-and-narrow loop. 9 angles &#215; 3 rounds, scored by an isolated validator on 6 dimensions, narrowing to a top-4 for A/B testing." title="The expand-and-narrow loop. 9 angles &#215; 3 rounds, scored by an isolated validator on 6 dimensions, narrowing to a top-4 for A/B testing." srcset="https://substackcdn.com/image/fetch/$s_!Ixet!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c04250a-50e3-4908-9312-8e42b2071a2a_1400x1078.png 424w, https://substackcdn.com/image/fetch/$s_!Ixet!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c04250a-50e3-4908-9312-8e42b2071a2a_1400x1078.png 848w, https://substackcdn.com/image/fetch/$s_!Ixet!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c04250a-50e3-4908-9312-8e42b2071a2a_1400x1078.png 1272w, https://substackcdn.com/image/fetch/$s_!Ixet!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c04250a-50e3-4908-9312-8e42b2071a2a_1400x1078.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 10: The expand-and-narrow loop. 9 angles &#215; 3 rounds, scored by an isolated validator on 6 dimensions, narrowing to a top-4 for A/B testing.</em></figcaption></figure></div><p><strong>So why pick the top four, and not three?</strong> When scheduling on Substack, we pick the top four for A/B testing rather than committing to the single highest-scored one. The validator is good but not omniscient, so we let real readers settle close calls.</p><h2>Exporting to HTML</h2><p>The last step is to compile the Markdown article into HTML so we can easily copy-paste everything into Substack. Boring, but necessary.</p><p>For this step, we created a skill that wraps <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Tivadar Danka&quot;,&quot;id&quot;:10322584,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!09ow!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F3b26cd48-153a-4207-b1e3-e14e1ec8d5e8_400x400.jpeg&quot;,&quot;uuid&quot;:&quot;90c20a7b-6f1f-4610-b387-9577d43f7932&quot;}" data-component-name="MentionToDOM"></span>&#8217;s <a href="https://github.com/the-palindrome/nb2wb">nb2wb</a> CLI tool that does all the heavy lifting. The tool supports most popular formats such as Substack, Medium, X, and LinkedIn.</p><p>Initially, it was built to map Jupyter Notebooks to these formats, but it works amazingly for Markdown files too.</p><h2>What Stays Irreplaceable</h2><p>It is not 100% automated. I still follow the original research direction. I still write the outline brain dump. I still validate every artifact. I still write the code that runs the pipeline.</p><p>The 90% automation is real, but the 10% is the part that matters most. It&#8217;s the part that makes this article stand out as human: the seed, the taste, and the validation are irreplaceable.</p><h2>What&#8217;s Next</h2><p>You might wonder how well this works. Well... You <strong>just read an article</strong> created by this exact workflow. In other words, this is an article that talks about itself. It&#8217;s not yet perfect, but it will get there.</p><p>End-to-end, it took about a day of my time. Without the pipeline, this same article would have taken three days of mostly translation work.</p><div class="callout-block" data-callout="true"><p><strong>&#128161; Want to build this exact stack yourself?</strong> Nova and Brown built with FastMCP &amp; LangGraph, the style-transfer skill, human-in-the-loop orchestration, evaluation with Opik, and deployment on Docker, GCP, and GitHub Actions. Every line walked through with me and the Towards AI team. That&#8217;s exactly what we teach in our <strong><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering Course</a></strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KX3c!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba704c84-6c91-43fa-9ade-fd25ee51f175_1280x720.png 424w, https://substackcdn.com/image/fetch/$s_!KX3c!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba704c84-6c91-43fa-9ade-fd25ee51f175_1280x720.png 848w, https://substackcdn.com/image/fetch/$s_!KX3c!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba704c84-6c91-43fa-9ade-fd25ee51f175_1280x720.png 1272w, https://substackcdn.com/image/fetch/$s_!KX3c!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba704c84-6c91-43fa-9ade-fd25ee51f175_1280x720.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KX3c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba704c84-6c91-43fa-9ade-fd25ee51f175_1280x720.png" width="1280" height="720" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ba704c84-6c91-43fa-9ade-fd25ee51f175_1280x720.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:720,&quot;width&quot;:1280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;placeholder&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="placeholder" title="placeholder" srcset="https://substackcdn.com/image/fetch/$s_!KX3c!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba704c84-6c91-43fa-9ade-fd25ee51f175_1280x720.png 424w, https://substackcdn.com/image/fetch/$s_!KX3c!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba704c84-6c91-43fa-9ade-fd25ee51f175_1280x720.png 848w, https://substackcdn.com/image/fetch/$s_!KX3c!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba704c84-6c91-43fa-9ade-fd25ee51f175_1280x720.png 1272w, https://substackcdn.com/image/fetch/$s_!KX3c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba704c84-6c91-43fa-9ade-fd25ee51f175_1280x720.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></div><p>Otherwise, here is what I&#8217;m wondering:</p><p><em>Which step of your own writing workflow do you think is the most dangerous to automate, and which one have you been avoiding automating because you weren&#8217;t sure how? </em></p><p><em>Click the button below and tell me. I read every response.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/how-i-automated-91-percent-of-my-business/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/how-i-automated-91-percent-of-my-business/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to restack this for your readers.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/how-i-automated-91-percent-of-my-business?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/how-i-automated-91-percent-of-my-business?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h2>Further Reading</h2><ol><li><p>LangChain. <a href="https://blog.langchain.com/the-anatomy-of-an-agent-harness/">The Anatomy of an Agent Harness</a></p></li><li><p>Anthropic. <a href="https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents">Effective Harnesses for Long-Running Agents</a></p></li><li><p>Mitchell Hashimoto. <a href="https://mitchellh.com/writing/my-ai-adoption-journey">My AI Adoption Journey</a></p></li><li><p>The Agent Stack. <a href="https://theagentstack.substack.com/p/openclaw-architecture-part-1-control">OpenClaw Architecture Part 1</a></p></li><li><p>cefboud. <a href="https://cefboud.com/posts/coding-agents-internals-opencode-deepdive/">How Coding Agents Actually Work: Inside OpenCode</a></p></li></ol><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[Your RAG Pipeline Is Overkill]]></title><description><![CDATA[The pattern that lets your model write code to explore its context instead of retrieving it.]]></description><link>https://www.decodingai.com/p/recursive-language-models</link><guid isPermaLink="false">https://www.decodingai.com/p/recursive-language-models</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Tue, 07 Apr 2026 11:03:14 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!jJY1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08afe0b8-6aa1-41ce-8bb3-cae88284181f_1400x1000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We constantly fight a battle against the context window limit. You either compress your data until it loses meaning, or you build a massive infrastructure project just to read a few documents. Today, we look at a third option. We explore a pattern that allows models to read millions of tokens by treating data as an environment rather than an input.</p><p>In most AI projects, such as the financial assistant I am working on, there is a constant battle between Retrieval-Augmented Generation (RAG) and Context-Augmented Generation (CAG). Should you implement a heavy RAG architecture up front that might not even work, or does CAG get the job done? For example, in our financial assistant system, we ultimately decided to use RAG only when we really HAVE to, because it introduces zigzag retrieval patterns that require dozens of queries per operation, increasing latency.</p><p>Also, while building Brown, my writing agent, I hit another wall. Brown needs to ingest massive amounts of research to anchor its writing process. At 180,000 input tokens, the Gemini API became entirely unreliable.</p><p>I faced constant timeouts, disconnections, and infrastructure breakdowns. Huge context windows suffer from API reliability and infrastructure stability issues, as well as performance degradation. But the thing is, I didn&#8217;t want to overcomplicate my solution with a RAG layer, so I started looking around for other solutions.</p><p>Most engineers face this painful tradeoff when working with large documents. You can stuff everything into the context window, but performance degrades quickly. This causes context rot, which happens when attention degrades over long contexts and earlier information loses its influence <a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents">[1]</a>, <a href="https://venturebeat.com/orchestration/mits-new-recursive-framework-lets-llms-process-10-million-tokens-without-context-rot/">[2]</a>.</p><p>Alternatively, you can build a RAG pipeline. But that requires maintaining vector databases, chunking strategies, and retrieval evaluation infrastructure.</p><p>Even the tools we use daily, like Claude Code or Cursor, rely on summarization-based context compression that loses critical information. I just wanted to dump my research into one file and get good answers without the infrastructure breaking. Recursive Language Models (RLMs) solve this exact problem <a href="https://arxiv.org/abs/2512.24601">[3]</a>.</p><p>RLMs use an inference-time pattern that treats your input as an external environment the model interacts with programmatically. You do not need chunking infrastructure or embedding pipelines. The model writes code to explore, filter, and recursively process your data on demand.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jJY1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08afe0b8-6aa1-41ce-8bb3-cae88284181f_1400x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jJY1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08afe0b8-6aa1-41ce-8bb3-cae88284181f_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!jJY1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08afe0b8-6aa1-41ce-8bb3-cae88284181f_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!jJY1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08afe0b8-6aa1-41ce-8bb3-cae88284181f_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!jJY1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08afe0b8-6aa1-41ce-8bb3-cae88284181f_1400x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jJY1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08afe0b8-6aa1-41ce-8bb3-cae88284181f_1400x1000.png" width="1400" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/08afe0b8-6aa1-41ce-8bb3-cae88284181f_1400x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The three approaches to processing large documents&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The three approaches to processing large documents" title="The three approaches to processing large documents" srcset="https://substackcdn.com/image/fetch/$s_!jJY1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08afe0b8-6aa1-41ce-8bb3-cae88284181f_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!jJY1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08afe0b8-6aa1-41ce-8bb3-cae88284181f_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!jJY1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08afe0b8-6aa1-41ce-8bb3-cae88284181f_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!jJY1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08afe0b8-6aa1-41ce-8bb3-cae88284181f_1400x1000.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 1: The three approaches to processing large documents. RAG adds infrastructure complexity. Context stuffing causes degradation. RLMs treat the input as an external environment the model programs against.</em></figcaption></figure></div><p>This approach scales the effective input and output lengths of LLMs. Researchers tested RLMs up to 10 million tokens across GPT-5 and Qwen3-Coder, showing they easily outperform base models <a href="https://arxiv.org/abs/2512.24601">[3]</a>. Base model performance degrades as a function of input length and task complexity, while RLM performance scales with less degradation.</p><p>RLMs are also a model-agnostic inference strategy, meaning they work with any model you choose.</p><p>However, this architecture has honest downsides you must consider. The inference cost has high variance due to differences in trajectory lengths. The system suffers from code fragility, meaning that if the model writes buggy code, the entire reasoning chain fails.</p><p>Errors in sub-calls can compound through the recursive tree, propagating hallucinations. Sequential sub-calls also create latency bottlenecks. This makes RLMs best suited for deep thinking applications rather than real-time chat.</p><p>To understand how we bypass these infrastructure limits, we need to examine the specific programming trick that keeps the model&#8217;s memory clean.</p><p>Here is what you will learn about this pattern:</p><ul><li><p>The mechanism that keeps massive documents outside the context window.</p></li><li><p>The orchestration loop that drives programmatic data exploration.</p></li><li><p>The specific use cases where this pattern outperforms retrieval systems.</p></li><li><p>A practical method to approximate this behavior using Claude Code.</p></li></ul><div><hr></div><h2><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">If You Want To Go Deeper Into Production AI (Product)</a></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!59a6!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 424w, https://substackcdn.com/image/fetch/$s_!59a6!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 848w, https://substackcdn.com/image/fetch/$s_!59a6!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 1272w, https://substackcdn.com/image/fetch/$s_!59a6!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!59a6!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif" width="1200" height="1200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1200,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!59a6!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 424w, https://substackcdn.com/image/fetch/$s_!59a6!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 848w, https://substackcdn.com/image/fetch/$s_!59a6!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 1272w, https://substackcdn.com/image/fetch/$s_!59a6!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Patterns like RLMs show that the real challenge isn&#8217;t the model, but the infrastructure and systems around it, called the harness. If you want to master that harness, check out my <strong><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering course</a></strong>, built with Towards AI.</p><p>34 lessons. Three end-to-end portfolio projects. A certificate. And a Discord community with direct access to industry experts and me.</p><p>Rated 5/5 by 300+ students. The first 6 lessons are free:</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Start here&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Start here</span></a></p><div><hr></div><h2>The REPL Trick That Keeps Your Context Window Clean</h2><p>RLMs introduce a simple core idea. Do not feed the document into the model&#8217;s context window. Instead, load it as a variable in a persistent programming environment and let the model write code to interact with it <a href="https://www.primeintellect.ai/blog/rlm">[4]</a>.</p><p>The model never sees your 10-million-token document directly. In a traditional agent, the prompt goes into the model, completely blowing up your context window. In an RLM, the context stays outside as an external variable, and the model receives only a symbolic handle to it.</p><p>The system initializes a Read-Eval-Print Loop (REPL), which is a persistent interactive programming environment where variables and state persist across iterations <a href="https://arxiv.org/abs/2512.24601">[3]</a>.</p><p>The root model receives only metadata, such as the total character count and data structure. It also receives instructions on how to access the REPL. The model then writes code to peek into, filter with regex, chunk, or summarize the data.</p><p>When the model identifies a sub-task, it uses a specific primitive such as <code>llm_query(prompt, chunk)</code> to spawn a fresh, isolated worker sub-model <a href="https://arxiv.org/abs/2512.24601">[3]</a>. The system pauses, executes this sub-call, and returns the result to the root model&#8217;s REPL.</p><p>Variables persist across these REPL turns. The model aggregates findings into a buffer, building the response progressively across iterations. Once confident, it calls <code>FINAL(answer)</code> to stop the recursive loop and return the response <a href="https://dextralabs.com/blog/recursive-language-models-rlm/">[5]</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!i4L_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc48c578c-3a1e-4fbf-9c88-a08a748ee2bb_1400x1400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!i4L_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc48c578c-3a1e-4fbf-9c88-a08a748ee2bb_1400x1400.png 424w, https://substackcdn.com/image/fetch/$s_!i4L_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc48c578c-3a1e-4fbf-9c88-a08a748ee2bb_1400x1400.png 848w, https://substackcdn.com/image/fetch/$s_!i4L_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc48c578c-3a1e-4fbf-9c88-a08a748ee2bb_1400x1400.png 1272w, https://substackcdn.com/image/fetch/$s_!i4L_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc48c578c-3a1e-4fbf-9c88-a08a748ee2bb_1400x1400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!i4L_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc48c578c-3a1e-4fbf-9c88-a08a748ee2bb_1400x1400.png" width="1400" height="1400" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c48c578c-3a1e-4fbf-9c88-a08a748ee2bb_1400x1400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1400,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The RLM REPL mechanism&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The RLM REPL mechanism" title="The RLM REPL mechanism" srcset="https://substackcdn.com/image/fetch/$s_!i4L_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc48c578c-3a1e-4fbf-9c88-a08a748ee2bb_1400x1400.png 424w, https://substackcdn.com/image/fetch/$s_!i4L_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc48c578c-3a1e-4fbf-9c88-a08a748ee2bb_1400x1400.png 848w, https://substackcdn.com/image/fetch/$s_!i4L_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc48c578c-3a1e-4fbf-9c88-a08a748ee2bb_1400x1400.png 1272w, https://substackcdn.com/image/fetch/$s_!i4L_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc48c578c-3a1e-4fbf-9c88-a08a748ee2bb_1400x1400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 2: The RLM mechanism. The document stays outside the context window as a REPL variable. The model writes code to explore, decompose, and recursively process it.</em></figcaption></figure></div><p>RLMs essentially perform context engineering on autopilot. Traditional context engineering requires you to carefully curate what goes into the context window through retrieval and compression <a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents">[1]</a>. RLMs automate this by letting the model itself decide what to extract, filter, and process.</p><p>Costs and performance stay intact because the model filters the input context without explicitly seeing it. By writing Python scripts, the model processes only the relevant portions through sub-calls. Only constant-size metadata about execution results is appended to the root model&#8217;s history, keeping its context window small and clean.</p><p>Understanding this mechanical loop allows us to map the pattern directly to production harness engineering.</p><h2>Turn Any Agent Into a Plan-Execute-Validate Machine</h2><p>RLMs are an inference-time orchestration pattern that maps directly to production harness engineering. If you have built agent systems, you already know the components: a planning loop, tool execution and validation <a href="https://blog.langchain.com/the-anatomy-of-an-agent-harness/">[7]</a>. RLMs formalize this into a programmable, recursive architecture.</p><p>A robust RLM harness uses a multi-tiered architecture. The root controller is a frontier model that acts as the project manager. It plans the reasoning process, writes code, and coordinates execution, but never directly interacts with tools or the full document <a href="https://www.anthropic.com/engineering/building-effective-agents">[8]</a>.</p><p>Worker sub-models are cheaper, faster models spawned via an operation such as <code>llm_query()</code> to handle specific, localized sub-tasks. This reduces overall costs while maintaining high quality. The aggregation layer is the REPL environment that combines recursive step results into a final structured response via persistent variables.</p><p>This setup naturally follows the plan-execute-validate mapping. In the plan phase, the root controller reviews the query, creates a reasoning plan, and decides how to decompose the problem. It might plan to regex-filter a codebase, chunk a document, or batch sub-calls for parallel analysis.</p><p>In the execute phase, the model translates the plan into code. It writes Python scripts, issues <code>llm_query()</code> calls, and spawns worker sub-models for parallel execution in isolated REPL environments. External tools, like web search, are provided ONLY to worker sub-models, keeping the root model&#8217;s context perfectly clean.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OWkF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec670a03-124b-4954-93e7-745b5cf1a5d3_1400x1400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OWkF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec670a03-124b-4954-93e7-745b5cf1a5d3_1400x1400.png 424w, https://substackcdn.com/image/fetch/$s_!OWkF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec670a03-124b-4954-93e7-745b5cf1a5d3_1400x1400.png 848w, https://substackcdn.com/image/fetch/$s_!OWkF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec670a03-124b-4954-93e7-745b5cf1a5d3_1400x1400.png 1272w, https://substackcdn.com/image/fetch/$s_!OWkF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec670a03-124b-4954-93e7-745b5cf1a5d3_1400x1400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OWkF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec670a03-124b-4954-93e7-745b5cf1a5d3_1400x1400.png" width="1400" height="1400" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ec670a03-124b-4954-93e7-745b5cf1a5d3_1400x1400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1400,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The plan-validate-execute orchestration loop&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The plan-validate-execute orchestration loop" title="The plan-validate-execute orchestration loop" srcset="https://substackcdn.com/image/fetch/$s_!OWkF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec670a03-124b-4954-93e7-745b5cf1a5d3_1400x1400.png 424w, https://substackcdn.com/image/fetch/$s_!OWkF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec670a03-124b-4954-93e7-745b5cf1a5d3_1400x1400.png 848w, https://substackcdn.com/image/fetch/$s_!OWkF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec670a03-124b-4954-93e7-745b5cf1a5d3_1400x1400.png 1272w, https://substackcdn.com/image/fetch/$s_!OWkF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec670a03-124b-4954-93e7-745b5cf1a5d3_1400x1400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 3: The plan-execute-validate loop. The root controller plans, worker sub-models execute, the system validates, and the cycle repeats until FINAL().</em></figcaption></figure></div><p>After execution, the system enters the validation phase, where results feed back as observations. The root model assesses accuracy, launches verification sub-calls, and handles errors by dynamically adjusting its plan. If the Python code fails, the error traceback is yielded back to the model as an event.</p><p>This allows the model to adapt and fix its code on the next turn. The cycle repeats until the model calls <code>FINAL(answer)</code>.</p><p>Deploying this in the real world requires strict production guardrails. You must configure <code>maxIterations</code> to cap the number of REPL turns, typically between 10 and 50. You need <code>maxDepth</code> to limit the recursive stack depth, where a depth of 1 is usually sufficient.</p><p>You also need <code>maxStdoutLength</code> to truncate REPL output returned to the model to prevent context overflow. Finally, permission gating is required to provide sandboxed execution with explicit approval for sensitive operations.</p><p>Neither Claude Code nor OpenAI Codex uses true RLM patterns. They rely on summarization-based context compression, file-system state tracking and progressive disclosure techniques <a href="https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents">[9]</a>. This creates a succession of agents connected by prompts and file state, rather than maintaining a persistent REPL environment with programmatic sub-calls.</p><p>With this architecture in place, we can identify the specific real-world scenarios where this pattern outperforms traditional data processing.</p><h2>Four Scenarios Where RLMs Beat Traditional Approaches</h2><p>RLMs are best suited for deep thinking applications that require accuracy, multi-step reasoning, and reliability over massive contexts. They are not suited for real-time, low-latency chat applications.</p><p>The <strong>first scenario</strong> is parsing large files without building retrieval infrastructure. Instead of building a hybrid index with vector and graph search, you keep everything in one file or directory and use an RLM agent to extract information on demand.</p><p>We can view the relationship between RAG and RLMs as a spectrum. For simple cases, RLMs replace RAG entirely, removing the need for chunking and embeddings. For advanced scenarios, RLMs complement retrieval beautifully.</p><p>You use semantic search to find your first pool of candidates, write the results to disk as cached short-term memory, and use an RLM to query that refined dataset on demand.</p><p>The retrieval narrows the haystack, and the RLM reasons deeply over what is left. I use this exact workflow for my research, dumping everything into a massive text file and using an RLM to extract relevant information.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!K9A3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41df1265-00a8-4373-8862-dd260870cd6c_1400x1208.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!K9A3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41df1265-00a8-4373-8862-dd260870cd6c_1400x1208.png 424w, https://substackcdn.com/image/fetch/$s_!K9A3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41df1265-00a8-4373-8862-dd260870cd6c_1400x1208.png 848w, https://substackcdn.com/image/fetch/$s_!K9A3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41df1265-00a8-4373-8862-dd260870cd6c_1400x1208.png 1272w, https://substackcdn.com/image/fetch/$s_!K9A3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41df1265-00a8-4373-8862-dd260870cd6c_1400x1208.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!K9A3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41df1265-00a8-4373-8862-dd260870cd6c_1400x1208.png" width="1400" height="1208" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41df1265-00a8-4373-8862-dd260870cd6c_1400x1208.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1208,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1040009,&quot;alt&quot;:&quot;RLM replacing RAG for large file parsing&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="RLM replacing RAG for large file parsing" title="RLM replacing RAG for large file parsing" srcset="https://substackcdn.com/image/fetch/$s_!K9A3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41df1265-00a8-4373-8862-dd260870cd6c_1400x1208.png 424w, https://substackcdn.com/image/fetch/$s_!K9A3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41df1265-00a8-4373-8862-dd260870cd6c_1400x1208.png 848w, https://substackcdn.com/image/fetch/$s_!K9A3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41df1265-00a8-4373-8862-dd260870cd6c_1400x1208.png 1272w, https://substackcdn.com/image/fetch/$s_!K9A3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41df1265-00a8-4373-8862-dd260870cd6c_1400x1208.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 4: RLM replaces the entire RAG pipeline for large file parsing. One file, one agent, no retrieval infrastructure.</em></figcaption></figure></div><p>The <strong>second scenario</strong> is complex software engineering and codebase comprehension. RLMs ingest massive codebases containing millions of tokens to answer questions about architecture, map dependencies, and perform reviews.</p><p>The RLM paper tested this on LongBench-v2 CodeQA using Qwen3-Coder with a Python REPL. The model writes code to break down the codebase, launches sub-queries to smaller language models, and aggregates findings <a href="https://arxiv.org/abs/2512.24601">[3]</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HwsN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe733bf97-fdcf-4e10-9d2c-4d242b1baf1d_1400x1400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HwsN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe733bf97-fdcf-4e10-9d2c-4d242b1baf1d_1400x1400.png 424w, https://substackcdn.com/image/fetch/$s_!HwsN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe733bf97-fdcf-4e10-9d2c-4d242b1baf1d_1400x1400.png 848w, https://substackcdn.com/image/fetch/$s_!HwsN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe733bf97-fdcf-4e10-9d2c-4d242b1baf1d_1400x1400.png 1272w, https://substackcdn.com/image/fetch/$s_!HwsN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe733bf97-fdcf-4e10-9d2c-4d242b1baf1d_1400x1400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HwsN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe733bf97-fdcf-4e10-9d2c-4d242b1baf1d_1400x1400.png" width="1400" height="1400" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e733bf97-fdcf-4e10-9d2c-4d242b1baf1d_1400x1400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1400,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;RLM decomposing a codebase through recursive sub-queries&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="RLM decomposing a codebase through recursive sub-queries" title="RLM decomposing a codebase through recursive sub-queries" srcset="https://substackcdn.com/image/fetch/$s_!HwsN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe733bf97-fdcf-4e10-9d2c-4d242b1baf1d_1400x1400.png 424w, https://substackcdn.com/image/fetch/$s_!HwsN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe733bf97-fdcf-4e10-9d2c-4d242b1baf1d_1400x1400.png 848w, https://substackcdn.com/image/fetch/$s_!HwsN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe733bf97-fdcf-4e10-9d2c-4d242b1baf1d_1400x1400.png 1272w, https://substackcdn.com/image/fetch/$s_!HwsN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe733bf97-fdcf-4e10-9d2c-4d242b1baf1d_1400x1400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 5: An RLM decomposes a codebase question into parallel sub-queries, each handled by a worker sub-model, then aggregates the results.</em></figcaption></figure></div><p>The <strong>third scenario</strong> is enterprise legal and financial analysis. RLMs provide consistent interpretation across thousands of contracts, case files, and policies that would overwhelm a standard context window. They also excel at financial audits and due diligence by tracing, validating, and reasoning through massive financial datasets.</p><p>The <strong>fourth scenario</strong> is deep research and information synthesis. RLMs synthesize research across thousands of files by programmatically filtering, chunking, and summarizing. They enable knowledge graph exploration and multi-hop reasoning over large document dumps.</p><p>At scale, RLMs become both more accurate and cheaper than standard long-context approaches. They avoid paying for n-squared attention over massive contexts by having the model process only relevant slices via sub-calls. In all these scenarios, the RLM pattern succeeds because it treats the LLM as a project manager that decides what to look at and delegates sub-tasks to workers.</p><p>Knowing these optimal use cases helps us approximate the pattern using tools you likely already have installed.</p><h2>Build a Naive RLM SKILL in Claude Code</h2><p>Claude Code does not natively use the RLM pattern. It relies on summarization-based context compression, file-system state tracking, and progressive disclosure. However, you can approximate RLM behavior using Claude Code&#8217;s existing harness features to build a naive RLM SKILL.</p><p>First, you set up the environment by having the SKILL load the target file or directory as a reference. Instead of feeding it into the context window, it writes the file path and metadata to a prompt for the root agent.</p><p>Second, the root Claude Code agent receives only this metadata and a set of instructions for how to interact with it. It uses its Explore subagent type <br>to examine the data structure, identify relevant sections, and plan its approach.</p><p>Third, the SKILL uses Claude Code&#8217;s Agent tool to spawn subagents. Each subagent receives a focused prompt to read specific lines and extract mentions, returning a condensed summary of a few thousand tokens. This mirrors the RLM pattern of spawning isolated sub-calls that process slices of the input.</p><p>Finally, the root agent collects these subagent results. It aggregates them into a coherent answer and decides whether more exploration is needed or whether to finalize the output.</p><p>Here is what this naive RLM SKILL looks like as a <em>SKILL.md</em> file:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;markdown&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-markdown">---
name: rlm-research-analyzer
description: "Analyze large research files by treating
  them as an external environment. Instead of stuffing
  content into context, the model explores, decomposes,
  and recursively processes the data through subagents."
---

# Analyze Large Research Files Using the RLM Pattern

## Step 1 &#8212; Initialize the environment

Accept the target file path as an argument. Do NOT read
the file into context. Instead, run a Bash command to
collect metadata:

wc -l &lt;file_path&gt;   # total lines
wc -c &lt;file_path&gt;   # total bytes
head -5 &lt;file_path&gt;  # short prefix

Write the metadata and file path to a temporary prompt
file at &lt;working_dir&gt;/rlm_prompt.md. The root agent
receives ONLY this metadata, never the full content.

## Step 2 &#8212; Plan the exploration

Read rlm_prompt.md. Based on the metadata and prefix,
decide how to decompose the file. Use an Explore
subagent to scan the file structure:

- Identify section boundaries, headings, or delimiters
- Estimate which regions are relevant to the query
- Produce a ranked list of target ranges to process

## Step 3 &#8212; Delegate to worker subagents

For each target range, spawn an Agent subagent with a
focused prompt:

"Read lines {start}-{end} of {file_path}. Extract all
findings related to {query}. Return a summary under
2000 tokens."

Launch multiple subagents in parallel when ranges are
independent. Write each subagent's output to
&lt;working_dir&gt;/slice_{n}.md.

## Step 4 &#8212; Aggregate and finalize

Read all slice files. Synthesize the findings into a
single coherent answer. If gaps remain, return to
Step 3 with new target ranges. Otherwise, write the
final output to &lt;working_dir&gt;/answer.md and present
it to the user.</code></pre></div><p>Notice how the four steps map directly to RLM primitives. Step 1 mirrors REPL initialization, where the data becomes an external variable rather than context input. Step 3 replaces the theoretical <code>llm_query()</code> operation with Claude Code&#8217;s Agent tool. Step 4 mirrors the <code>FINAL()</code> call that terminates the recursive loop.</p><p>This naive approximation lacks several critical features. It has no true REPL persistence, as Claude Code subagents do not share a persistent variable space. The filesystem serves as a proxy for REPL state, but it is slower and less elegant.</p><p>It also lacks sandboxing, as Claude Code runs directly in your environment. Then you miss out on configurable guardrails like <code>max_iterations</code> and <code>max_output_chars</code>, requiring manual limits instead. You get the idea.</p><p>Still, I&#8217;ve been using a similar technique in all my current projects: instead of stuffing the research into a file, I dump everything into a dir and link everything together in an <code>index.yaml</code> file that contains URIs to all the files, plus metadata such as the title and a 1-2 sentence summary of each source. Like this, through the <code>index.yaml</code> file, Claude Code can efficiently navigate the whole research dump token through progressive disclosure.</p><p>My structure looks something like this:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">research/
&#9500;&#9472;&#9472; index.yaml
&#9500;&#9472;&#9472; file_1.md
&#9500;&#9472;&#9472; file_2.md
&#9500;&#9472;&#9472; ...
&#9492;&#9472;&#9472; file_N.md</code></pre></div><p>Also, the only out-of-the-box implementation I found is within the <a href="https://dspy.ai/api/modules/RLM/">DSPy framework</a>.</p><p>The naive SKILL is a useful thought exercise and a practical first step. For production use, you should reference the DSPy framework&#8217;s <code>dspy.RLM</code> module.</p><h2>What&#8217;s Next</h2><p>RLMs represent a fundamental shift in how we process large inputs. We are moving from asking how to fit data in the context window to asking how we let the model interact with it programmatically. This is a great thought exercise on integrating specialized inference-time functionality into your harness.</p><p>As models get better at writing code and REPL environments become more sophisticated, the boundary between the model and its infrastructure will blur. The model does not just use tools, it writes the tools on the fly to solve the specific problem in front of it.</p><p>Your next practical step is to experiment with our SKILL or with the DSPy framework&#8217;s <code>dspy.RLM</code> module on a real problem. Point it at a large codebase you need to understand or a research corpus you need to synthesize. Start with something you have been using RAG or context stuffing on, and see whether the RLM approach is more effective.</p><p><em>But here is what I&#8217;m wondering: </em></p><p><em><strong>How have you been passing large files, such as deep research results or books, to your agents so far? RAG, CAG or other creative techniques?</strong></em></p><p><em>Click the button below and tell me. I read every response.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/recursive-language-models/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/recursive-language-models/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to restack this for your readers. </em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/recursive-language-models?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/recursive-language-models?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h4>Whenever you&#8217;re ready, here is how I can help you</h4><p>If you want to go from zero to shipping production-grade AI agents, check out my <strong><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering course</a></strong>, built with Towards AI.</p><p>34 lessons. Three end-to-end portfolio projects. A certificate. And a Discord community with direct access to industry experts and me.   </p><p><em>Rated 5/5</em> by 300+ students. The first 6 lessons are free:</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Start here&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Start here</span></a></p><p><em>Not ready to commit?</em> Start with our <strong><a href="https://email-course.towardsai.net/?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">free Agentic AI Engineering Guide</a></strong>, a 6-day email course on the mistakes that silently break AI agents in production.</p><div><hr></div><h2>References</h2><ol><li><p>(n.d.). Effective Context Engineering for AI Agents. Anthropic. <a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents">https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents</a></p></li><li><p>(n.d.). MIT&#8217;s new &#8216;recursive&#8217; framework lets LLMs process 10 million tokens without context rot. VentureBeat. <a href="https://venturebeat.com/orchestration/mits-new-recursive-framework-lets-llms-process-10-million-tokens-without-context-rot/">https://venturebeat.com/orchestration/mits-new-recursive-framework-lets-llms-process-10-million-tokens-without-context-rot/</a></p></li><li><p>Zhang, A. L., Kraska, T., &amp; Khattab, O. (2025). Recursive Language Models. arXiv. <a href="https://venturebeat.com/orchestration/mits-new-recursive-framework-lets-llms-process-10-million-tokens-without-context-rot/">https://arxiv.org/abs/2512.24601</a></p></li><li><p>(n.d.). Recursive Language Models: the paradigm of 2026. Prime Intellect. <a href="https://venturebeat.com/orchestration/mits-new-recursive-framework-lets-llms-process-10-million-tokens-without-context-rot/">https://www.primeintellect.ai/blog/rlm</a></p></li><li><p>(n.d.). Why Recursive Language Models (RLMs) Beat Long-Context LLMs. Dextra Labs. <a href="https://venturebeat.com/orchestration/mits-new-recursive-framework-lets-llms-process-10-million-tokens-without-context-rot/">https://dextralabs.com/blog/recursive-language-models-rlm/</a></p></li><li><p>Mansurova, M. (2026, March 30). Going Beyond the Context Window: Recursive Language Models in Action. Towards Data Science. <a href="https://towardsdatascience.com/going-beyond-the-context-window-recursive-language-models-in-action/">https://towardsdatascience.com/going-beyond-the-context-window-recursive-language-models-in-action/</a></p></li><li><p>(2026, March 21). The Anatomy of an Agent Harness. LangChain Blog. <a href="https://towardsdatascience.com/going-beyond-the-context-window-recursive-language-models-in-action/">https://blog.langchain.com/the-anatomy-of-an-agent-harness/</a></p></li><li><p>(2025, December 24). Building Effective AI Agents. Anthropic. <a href="https://towardsdatascience.com/going-beyond-the-context-window-recursive-language-models-in-action/">https://www.anthropic.com/engineering/building-effective-agents</a></p></li><li><p>(2026, March 25). Effective Harnesses for Long-Running Agents. Anthropic. <a href="https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents">https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents</a></p></li></ol><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[Agentic Harness Engineering]]></title><description><![CDATA[Building systems that transform the LLM into the new operating system]]></description><link>https://www.decodingai.com/p/agentic-harness-engineering</link><guid isPermaLink="false">https://www.decodingai.com/p/agentic-harness-engineering</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Tue, 31 Mar 2026 11:03:40 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!imx1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7112216b-b547-4d35-a6f4-b4ad5d7324d1_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>At the AI start-up I&#8217;ve been working at, building a financial personal assistant, we implemented LlamaIndex, added the Model Context Protocol (MCP), and built complex Retrieval-Augmented Generation (RAG) pipelines. Each piece added complexity without adding direct business value.</p><p>When we stripped everything back to plain Python, simple API calls, and a custom ReAct engine, things finally worked. What we accidentally built was a harness featuring specialized financial tools, domain-specific guardrails, and purpose-built context engineering.</p><p>We did not know the term yet, but the lesson was clear. The model was never the problem. The system and infrastructure around it were.</p><p>Most engineering teams obsess over which model to use. They debate GPT-4o versus Claude Opus versus Gemini. They chase LLM benchmark scores and swap models, hoping for better results.</p><p>But the model is only half the equation. The system and infrastructure around it determine whether your agent actually works in production.</p><p>TerminalBench 2.0 proved this. Changing only the harness moved the DeepAgent from LangChain from outside the top 30 to the top 5 <a href="https://blog.langchain.com/the-anatomy-of-an-agent-harness/">[1]</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!imx1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7112216b-b547-4d35-a6f4-b4ad5d7324d1_1456x1048.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!imx1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7112216b-b547-4d35-a6f4-b4ad5d7324d1_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!imx1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7112216b-b547-4d35-a6f4-b4ad5d7324d1_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!imx1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7112216b-b547-4d35-a6f4-b4ad5d7324d1_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!imx1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7112216b-b547-4d35-a6f4-b4ad5d7324d1_1456x1048.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!imx1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7112216b-b547-4d35-a6f4-b4ad5d7324d1_1456x1048.png" width="1456" height="1048" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7112216b-b547-4d35-a6f4-b4ad5d7324d1_1456x1048.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1048,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Agent = Model + Harness&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Agent = Model + Harness" title="Agent = Model + Harness" srcset="https://substackcdn.com/image/fetch/$s_!imx1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7112216b-b547-4d35-a6f4-b4ad5d7324d1_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!imx1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7112216b-b547-4d35-a6f4-b4ad5d7324d1_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!imx1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7112216b-b547-4d35-a6f4-b4ad5d7324d1_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!imx1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7112216b-b547-4d35-a6f4-b4ad5d7324d1_1456x1048.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 1: Agent = Model + Harness. The harness is everything that isn&#8217;t the model.</em></figcaption></figure></div><p>This is what usually happens. You have a powerful model. You gave it tools and a prompt. It works in demos.</p><p>But shipping it to production means solving problems the model cannot solve alone. You must bridge context windows, recover from failures, serve multiple interfaces, and manage state across sessions.</p><p>The solution is harness engineering. This is the discipline of building the infrastructure around the model so it can do useful work reliably. As Mitchell Hashimoto noted, harness engineering is the practice of engineering a solution every time an agent makes a mistake, ensuring it never makes that specific mistake again <a href="https://mitchellh.com/writing/my-ai-adoption-journey">[2]</a>.</p><p>By the end of this article, you will learn:</p><ul><li><p>What an agent harness actually is.</p></li><li><p>The core components powering production AI systems.</p></li><li><p>How the planning loop dictates agent actions.</p></li><li><p>The design principles behind an effective toolset.</p></li><li><p>How to manage memory using the filesystem.</p></li></ul><p>Before we look at all its components and how they fit together, we must first define what a harness actually is.</p><div><hr></div><h2><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Your Path to Agentic AI Engineering for Production (Product)</a></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!59a6!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 424w, https://substackcdn.com/image/fetch/$s_!59a6!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 848w, https://substackcdn.com/image/fetch/$s_!59a6!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 1272w, https://substackcdn.com/image/fetch/$s_!59a6!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!59a6!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif" width="1200" height="1200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1200,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!59a6!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 424w, https://substackcdn.com/image/fetch/$s_!59a6!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 848w, https://substackcdn.com/image/fetch/$s_!59a6!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 1272w, https://substackcdn.com/image/fetch/$s_!59a6!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Most engineers know the theory behind agents, context engineering, and RAG. What they lack is the confidence to architect, evaluate, and deploy these systems in production. The <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering course</a>, built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Towards AI,</a> closes that gap across 34 lessons (articles, videos, and a lot of code).</p><p>By the end, you will have gone from <em>&#8220;I built a demo&#8221;</em> to <em>&#8220;I shipped a production-grade multi-agent system with evals, observability, and CI/CD.&#8221;</em> Three portfolio projects, a certificate to back them up in interviews, and a Discord community with direct access to industry experts.</p><p><strong>Rated 5/5</strong> &#11088;&#65039; by 300+ early students saying <em>&#8220;Every AI Engineer needs a course like this&#8221;</em> and that is <em>&#8220;An excellent bridge from experimental LLM projects to real-world AI engineering.&#8221;</em></p><p><em>Start learning today. The first 6 lessons are free:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Enroll here&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Enroll here</span></a></p><div><hr></div><h2>So... What the Heck Is a Harness?</h2><p>While talking with Jonathan Gennick from Manning, he said that the first time he heard about the term &#8220;harness&#8221; was in the context of horses. Let me explain. A horse is powerful on its own, but useless for farming without a harness. The straps and reins let you direct its strength toward useful work. The same applies to LLMs.</p><p>The model has intelligence. But without tools, memory, state, guardrails, and orchestration, you cannot put it to work reliably.</p><p>LangChain offers the clearest definition. <strong>An agent equals a model plus a harness.</strong> The harness is every piece of code, configuration, and execution logic that is not the model itself <a href="https://blog.langchain.com/the-anatomy-of-an-agent-harness/">[1]</a>.</p><p>A basic agent, as we know it so far, is just a model, a prompt, tools, and a planning loop. A harness extends this by adding memory systems, guardrails, advanced orchestration, context engineering, and multi-agent coordination.</p><p>Usually, it also includes a serving layer that connects the agent to various user interfaces, such as terminal apps, web dashboards, IDE plugins, and messaging apps like Telegram.</p><p>Ultimately, a harness is a term for building real software applications using LLMs or other models as the operating system. Applications like Claude Code, OpenCode, OpenClaw, and Codex are all harnesses. You could swap the model inside them, but the real engineering value lives in the harness itself.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Oe0K!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a23aea-fa54-4dad-a0c4-5b5bcabc096b_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Oe0K!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a23aea-fa54-4dad-a0c4-5b5bcabc096b_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!Oe0K!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a23aea-fa54-4dad-a0c4-5b5bcabc096b_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!Oe0K!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a23aea-fa54-4dad-a0c4-5b5bcabc096b_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!Oe0K!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a23aea-fa54-4dad-a0c4-5b5bcabc096b_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Oe0K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a23aea-fa54-4dad-a0c4-5b5bcabc096b_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/23a23aea-fa54-4dad-a0c4-5b5bcabc096b_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:757851,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/192391298?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a23aea-fa54-4dad-a0c4-5b5bcabc096b_1200x630.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Oe0K!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a23aea-fa54-4dad-a0c4-5b5bcabc096b_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!Oe0K!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a23aea-fa54-4dad-a0c4-5b5bcabc096b_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!Oe0K!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a23aea-fa54-4dad-a0c4-5b5bcabc096b_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!Oe0K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a23aea-fa54-4dad-a0c4-5b5bcabc096b_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image 2: The three levels of engineering: Prompt engineering is crafting instructions, context engineering is managing what the model sees, and harness engineering is the full infrastructure.</figcaption></figure></div><p>This introduces three distinct levels of engineering. Prompt engineering crafts the instructions. Context engineering dictates what goes into the context window and when.</p><p>Harness engineering is the full application and infrastructure. It controls when context loads, which tools are available, which actions are allowed, and how failures are handled. Each level encompasses the previous one <a href="https://youtube.com/watch?v=zYerCzIexCg">[3]</a>.</p><p>Now that you understand what a harness is, the next step is to explore the internal architecture and see how these pieces connect.</p><h2>The Anatomy of a Harness</h2><p>A complete harness consists of the LLM, tools, a planning loop, context engineering, a sandbox, memory, an orchestration layer, and a serving layer. In other words, everything that has been hovering within the AI space is finally falling into one beautiful system.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!T0f9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3bcf6c6-4ab2-4dc4-937e-8b4ae2456620_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!T0f9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3bcf6c6-4ab2-4dc4-937e-8b4ae2456620_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!T0f9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3bcf6c6-4ab2-4dc4-937e-8b4ae2456620_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!T0f9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3bcf6c6-4ab2-4dc4-937e-8b4ae2456620_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!T0f9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3bcf6c6-4ab2-4dc4-937e-8b4ae2456620_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!T0f9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3bcf6c6-4ab2-4dc4-937e-8b4ae2456620_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d3bcf6c6-4ab2-4dc4-937e-8b4ae2456620_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Full harness architecture&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Full harness architecture" title="Full harness architecture" srcset="https://substackcdn.com/image/fetch/$s_!T0f9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3bcf6c6-4ab2-4dc4-937e-8b4ae2456620_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!T0f9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3bcf6c6-4ab2-4dc4-937e-8b4ae2456620_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!T0f9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3bcf6c6-4ab2-4dc4-937e-8b4ae2456620_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!T0f9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3bcf6c6-4ab2-4dc4-937e-8b4ae2456620_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 3: The full harness architecture &#8212; from the model at the center to the serving layer at the edge.</em></figcaption></figure></div><p>One of the most distinctive features of modern harnesses is the multi-surface architecture. OpenClaw serves the same agent across a command-line interface (known as TUI), a web UI, desktop apps, Slack and Telegram/WhatsApp through a centralized Gateway using a typed WebSocket protocol.</p><p>Codex evolved from a simple terminal tool to an App Server using JSON-RPC over standard input and output. OpenCode uses a Bun JS HTTP server where any client connects via HTTP, utilizing an Event Bus to broadcast results in real-time <a href="https://theagentstack.substack.com/p/openclaw-architecture-part-1-control">[4]</a>, <a href="https://cefboud.com/posts/coding-agents-internals-opencode-deepdive/">[5]</a>, <a href="https://blog.bytebytego.com/p/how-openai-codex-works">[6]</a>.</p><p>This architecture introduces challenges. Multiple messages arrive in parallel from different clients. Users ask questions while the model is still processing.</p><p>To solve this, systems use priority queues and message buses. OpenClaw uses a lane-aware FIFO queue to ensure only one active run per session while allowing parallelism across different sessions.</p><p>At the core of all this infrastructure, the filesystem is king. As the most foundational harness primitive, it enables durable storage, workspace management, multi-agent collaboration, and versioning.</p><p>You heard me right, there is no fancy vector database in place. With AI, we are going back to basics, and nothing is purer than the filesystem itself.</p><p>Every production harness uses the filesystem as its primary state mechanism <a href="https://blog.langchain.com/the-anatomy-of-an-agent-harness/">[1]</a>.</p><p>You might wonder if this is just traditional orchestration like Airflow. It is different in three key ways. The agent loop is non-deterministic, context management is a first-class concern, and the programmer inside the loop is the LLM itself. It is common to add durability to the harness using tools such as Prefect, Temporal or DBOS that natively support dynamic pipelines rather than predefined, rigid DAGs.</p><p>Let us zoom in on the first and most fundamental component: the planning loop.</p><h2>How the Agent Decides What to Do Next</h2><p>The most common pattern for the planning loop is ReAct, which stands for Reasoning and Acting. The model receives the current state, reasons about what to do next, takes an action via a tool call, and observes the result. This cycle repeats continuously until a strict stopping condition is met <a href="https://cefboud.com/posts/coding-agents-internals-opencode-deepdive/">[5]</a>.</p><p>Consider a concrete example. A user asks the agent to fix a failing test. First, the model reads the test output, reasons that the import path is wrong, and edits the file through a tool.</p><p>Second, it re-runs the tests, sees a new type mismatch error, and fixes it. Third, it runs the tests again.</p><p>They pass, the model reasons the job is done, and it stops. The harness orchestrates this loop, while the model reasons and picks actions.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sIaN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41090c36-9a2e-4157-8f5a-df8cc2d9bb67_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sIaN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41090c36-9a2e-4157-8f5a-df8cc2d9bb67_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!sIaN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41090c36-9a2e-4157-8f5a-df8cc2d9bb67_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!sIaN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41090c36-9a2e-4157-8f5a-df8cc2d9bb67_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!sIaN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41090c36-9a2e-4157-8f5a-df8cc2d9bb67_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sIaN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41090c36-9a2e-4157-8f5a-df8cc2d9bb67_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41090c36-9a2e-4157-8f5a-df8cc2d9bb67_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;ReAct loop and orchestrator-worker pattern&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="ReAct loop and orchestrator-worker pattern" title="ReAct loop and orchestrator-worker pattern" srcset="https://substackcdn.com/image/fetch/$s_!sIaN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41090c36-9a2e-4157-8f5a-df8cc2d9bb67_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!sIaN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41090c36-9a2e-4157-8f5a-df8cc2d9bb67_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!sIaN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41090c36-9a2e-4157-8f5a-df8cc2d9bb67_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!sIaN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41090c36-9a2e-4157-8f5a-df8cc2d9bb67_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 4: The ReAct loop drives every agent action. For complex tasks, an orchestrator delegates to specialized workers, each with its own context window.</em></figcaption></figure></div><p>When tasks are too complex for a single agent, harnesses use orchestrator-worker patterns. The orchestrator decomposes a task, delegates subtasks to specialized workers, and aggregates the results.</p><p>In OpenCode, a dedicated <em>task</em> tool spawns subagents. Each subagent gets its own session, context window, and restricted tool set <a href="https://www.anthropic.com/research/building-effective-agents">[7]</a>.</p><p>For tasks that span multiple context windows, Claude Code implements <em>Ralph Loops</em>. This harness mechanism intercepts the model&#8217;s attempt to exit via a hook. It reinjects the original prompt in a clean context window, forcing the agent to continue against a completion goal using the state persisted on the filesystem <a href="https://blog.langchain.com/the-anatomy-of-an-agent-harness/">[1]</a>.</p><p>While automating my business with agents, I learned a hard lesson about orchestration. I initially built five specialized agents, each handling one step.</p><p>I eventually found that a single agent with memory and smart context engineering outperformed the whole swarm. Always start with one well-harnessed agent before reaching for multi-agent complexity.</p><blockquote><p><em>Here is a deep dive into planning:</em></p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;b381af34-4084-454a-8a68-0de07e50251c&quot;,&quot;caption&quot;:&quot;Welcome to the AI Agents Foundations series: A 9-part journey from Python developer to AI Engineer. Made by busy people. For busy people.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How Does Memory for AI Agents Work?&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:110559689,&quot;name&quot;:&quot;Paul Iusztin&quot;,&quot;bio&quot;:&quot;Senior AI Engineer &#8226; Founder @ Decoding AI &#8226; Author @ LLM Engineer&#8217;s Handbook I ship AI products and teach you about the process.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0714d360-396c-4b41-a676-1b58dc1dc5f3_1470x1470.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2025-12-02T12:03:49.149Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!G5CM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c6f2d58-f21f-4b49-b4f0-fb553fc28e36_1200x1200.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.decodingai.com/p/how-does-memory-for-ai-agents-work&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:180239220,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:76,&quot;comment_count&quot;:7,&quot;publication_id&quot;:1526003,&quot;publication_name&quot;:&quot;Decoding AI Magazine&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!k2ig!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00bc74e0-3601-49ce-8ab9-4c7b499ce597_1280x1280.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div></blockquote><p>While the planning loop decides the next step, the agent still needs a way to interact with its environment.</p><h2>The Tools That Let Agents Act</h2><p>This interaction happens through a specific toolkit designed for autonomous execution.</p><p>First, <em>Bash</em> is a general-purpose tool. The agent can run any shell command to execute tests, linters, or builds. This gives the model code execution capabilities so it can design its own tools on the fly rather than being constrained by fixed options.</p><p>For example, the agent runs Python code and executes it through <code>python -c "..."</code>, generates a script and runs it through <code>python main.py</code> or runs your code as <code>python -m my_module.main</code>.</p><p>Second, specialized filesystem tools handle common operations like reading, writing, editing, and searching. Doing file operations via Bash is slow and error-prone.</p><p>Specialized tools include safety checks. For instance, a read tool enforces absolute paths and line limits, while an edit tool validates the uniqueness of replacement strings.</p><p>Third, state management tools track session-scoped tasks. These give the agent working memory within a single session. For example, OpenCode has <code>ToDoAdd</code> and <code>ToDoRead</code> tools that add/read tasks from a queue to keep track of the plan it has to execute.</p><p>Finally, orchestration tools launch subagents with their own isolated prompts and context windows, such as OpenCode&#8217;s <code>task</code> tool or Claude Code&#8217;s <code>agent</code> tool.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kMuj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e3f1927-4655-4c2b-902a-ac42d37d9a13_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kMuj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e3f1927-4655-4c2b-902a-ac42d37d9a13_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!kMuj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e3f1927-4655-4c2b-902a-ac42d37d9a13_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!kMuj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e3f1927-4655-4c2b-902a-ac42d37d9a13_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!kMuj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e3f1927-4655-4c2b-902a-ac42d37d9a13_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kMuj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e3f1927-4655-4c2b-902a-ac42d37d9a13_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0e3f1927-4655-4c2b-902a-ac42d37d9a13_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Standard harness toolkit&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Standard harness toolkit" title="Standard harness toolkit" srcset="https://substackcdn.com/image/fetch/$s_!kMuj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e3f1927-4655-4c2b-902a-ac42d37d9a13_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!kMuj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e3f1927-4655-4c2b-902a-ac42d37d9a13_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!kMuj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e3f1927-4655-4c2b-902a-ac42d37d9a13_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!kMuj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e3f1927-4655-4c2b-902a-ac42d37d9a13_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 5: The standard harness toolkit organized by design principle &#8212; from general-purpose bash to specialized filesystem tools to orchestration.</em></figcaption></figure></div><p>Feedback loops are the most important principle around tooling. Boris Cherny, the creator of Claude Code, noted that giving the model a way to verify its work improves quality by two to three times. For example, OpenCode integrates the Language Server Protocol (LSP) to get real-time code definitions and diagnostics.</p><p>Undefined variables and type errors are fed back to the LLM for immediate correction. These tools do not act on the world. They feed vital information back to the planning loop.</p><p>Harnesses also enforce tool access control. In OpenCode, the planning agent cannot call edit tools. This prevents exploratory agents from accidentally modifying your code <a href="https://cefboud.com/posts/coding-agents-internals-opencode-deepdive/">[5]</a>.</p><blockquote><p><em>Here is a deep dive into tool calling:</em></p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;8cbbec02-7a2c-4e79-a765-07ca9904f17e&quot;,&quot;caption&quot;:&quot;Welcome to the AI Agents Foundations series&#8212;a 9-part journey from Python developer to AI Engineer. Made by busy people. For busy people.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Tool Calling From Scratch to Production: The Complete Guide&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:110559689,&quot;name&quot;:&quot;Paul Iusztin&quot;,&quot;bio&quot;:&quot;Senior AI Engineer &#8226; Founder @ Decoding AI &#8226; Author @ LLM Engineer&#8217;s Handbook I ship AI products and teach you about the process.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0714d360-396c-4b41-a676-1b58dc1dc5f3_1470x1470.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2025-10-28T08:00:55.938Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!cv8k!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F362128ff-7821-482e-b08a-8252d0faab99_1200x1200.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.decodingai.com/p/tool-calling-from-scratch-to-production&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:176436971,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:50,&quot;comment_count&quot;:5,&quot;publication_id&quot;:1526003,&quot;publication_name&quot;:&quot;Decoding AI Magazine&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!k2ig!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00bc74e0-3601-49ce-8ab9-4c7b499ce597_1280x1280.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div></blockquote><p>Once the agent has its tools, it needs a secure place to use them. In production, this requires strict isolation.</p><h2>Where Agents Run</h2><p>Agents execute code, and that code can fail, crash, or delete all your files. I know I want my precious notes protected. Sandboxes isolate agent execution so failures do not affect the host system or other agents. The cherry on top is that they also enable horizontal scaling across parallel environments.</p><p>There is a strict tradeoff between security and capability. Not every harness uses the same approach. Codex uses a hard sandbox.</p><p>Each task runs in an isolated cloud container preloaded with the repository. This provides maximum safety, but the agent cannot access the host filesystem <a href="https://blog.bytebytego.com/p/how-openai-codex-works">[6]</a>.</p><p>Conversely, OpenClaw uses a soft sandbox. The workspace is the default working directory. This grants maximum capability but introduces more risk.</p><p>OpenClaw deliberately avoids hard sandboxing to preserve full filesystem access. Most production harnesses sit somewhere between these extremes, depending on the trust model.</p><p>When you submit a task to Codex, the harness spins up a fresh cloud container. The agent works inside this container to read files, run tests, and install packages.</p><p>It cannot touch your local machine. When the job finishes, the results are extracted, and the container is destroyed.</p><p>Along with security, a major benefit of cloud sandbox environments is that they give the agent access to powerful computing resources. For example, if you want to train a model using a GPU, you can ask the agent to implement and run a training pipeline hosted in a sandbox powered by a GPU.</p><p>This is similar to manually SSHing to different VMs and running the code manually there. Based on the same principles, you can easily spin up multiple cloud sandboxes and run your agents in parallel.</p><p>On the other side of the spectrum, you can also run sandbox environments locally through Docker containers or isolated processes, similar to what Cursor does. Super useful when you want to try something out and give the agent full permissions to avoid having to supervise it.</p><p>While sandboxes provide a safe space for execution, they are ephemeral by design.</p><h2>Memory Is Just the Filesystem</h2><p>To survive across sessions and context windows, every harness manages state across three distinct memory layers. The first layer is the filesystem. This is the long-term memory.</p><p>It is durable and persistent, surviving across sessions. This is where progress files, git history, and session transcripts live.</p><p>The second layer is the RAM. This is the short-term memory, also known as the working memory. It holds the conversation history and tool results during an active session. It is fast but volatile.</p><p>The third layer is the context window. This is what the model actually sees. It is the strictest constraint, as everything the model knows about the current task must fit here.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jici!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e1df83-93f4-4c49-93c5-d11a372173d1_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jici!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e1df83-93f4-4c49-93c5-d11a372173d1_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!jici!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e1df83-93f4-4c49-93c5-d11a372173d1_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!jici!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e1df83-93f4-4c49-93c5-d11a372173d1_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!jici!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e1df83-93f4-4c49-93c5-d11a372173d1_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jici!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e1df83-93f4-4c49-93c5-d11a372173d1_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c1e1df83-93f4-4c49-93c5-d11a372173d1_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:681800,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/192391298?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e1df83-93f4-4c49-93c5-d11a372173d1_1200x630.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jici!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e1df83-93f4-4c49-93c5-d11a372173d1_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!jici!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e1df83-93f4-4c49-93c5-d11a372173d1_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!jici!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e1df83-93f4-4c49-93c5-d11a372173d1_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!jici!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e1df83-93f4-4c49-93c5-d11a372173d1_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image 6: The three-layer memory dynamics &#8212; filesystem as long-term state, RAM as working memory, context window as what the model sees. The cycle repeats: load &#8594; process &#8594; flush.</figcaption></figure></div><p>The harness orchestrates the dynamics between these layers. On the read path, the harness selectively loads relevant state from the disk into the RAM.</p><p>It then assembles the context window using context engineering techniques such as compaction, progressive disclosure, and just-in-time retrieval. On the write path, the harness persists important state back to the disk after processing.</p><p>OpenClaw enforces a strict invariant that memory is always flushed to disk before being discarded from context. Rehydration is treated as a tool-shaped action, where the agent searches and then retrieves specific data, rather than dumping everything into the context window <a href="https://theagentstack.substack.com/p/openclaw-architecture-part-3-memory">[8]</a>.</p><p>Context engineering makes this possible. When token counts exceed ninety percent of the limit, OpenCode automatically summarizes the conversation. Codex assembles prompts from multiple sources and exploits prompt caching.</p><p>Anthropic recommends using structured note-taking files and sub-agent architectures to isolate context <a href="https://cefboud.com/posts/coding-agents-internals-opencode-deepdive/">[5]</a>, <a href="https://blog.bytebytego.com/p/how-openai-codex-works">[6]</a>, <a href="https://www.anthropic.com/engineering/effective-context-engineering">[9]</a>.</p><p>In Anthropic&#8217;s long-running agent pattern, an initializer agent creates a script, a progress file, and a feature list. The coding agent reads the git logs and progress files at the start of each session and updates the progress file as it progresses.</p><p>The beauty? There is no database or vector store. It is just the filesystem <a href="https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents">[10]</a>.</p><blockquote><p><em>Here is a deep dive into memory:</em></p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;df376f07-24cf-47ea-b865-72794d074c9d&quot;,&quot;caption&quot;:&quot;Welcome to the AI Agents Foundations series: A 9-part journey from Python developer to AI Engineer. Made by busy people. For busy people.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How Does Memory for AI Agents Work?&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:110559689,&quot;name&quot;:&quot;Paul Iusztin&quot;,&quot;bio&quot;:&quot;Senior AI Engineer &#8226; Founder @ Decoding AI &#8226; Author @ LLM Engineer&#8217;s Handbook I ship AI products and teach you about the process.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0714d360-396c-4b41-a676-1b58dc1dc5f3_1470x1470.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2025-12-02T12:03:49.149Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!G5CM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c6f2d58-f21f-4b49-b4f0-fb553fc28e36_1200x1200.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.decodingai.com/p/how-does-memory-for-ai-agents-work&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:180239220,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:76,&quot;comment_count&quot;:7,&quot;publication_id&quot;:1526003,&quot;publication_name&quot;:&quot;Decoding AI Magazine&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!k2ig!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00bc74e0-3601-49ce-8ab9-4c7b499ce597_1280x1280.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div></blockquote><p>Now that you have seen all the pieces, from planning and tools to sandboxes and memory, the question is what this means for how you build software.</p><h2>What&#8217;s Next</h2><p>We are witnessing a new way of building software. Instead of software engineers building traditional frontend and backend applications, the next generation of production software will be harnesses. Harness engineering is merging software engineering with AI, moving it one level up <a href="https://youtube.com/watch?v=zYerCzIexCg">[3]</a>.</p><p>Popular tools like Claude Code are just the beginning. In the long run, no company will want to depend entirely on proprietary harnesses. Even open-source solutions like OpenCode will not cover every specific use case.</p><p>Companies will inevitably build their own. As we experienced at ZTRON, custom systems and infrastructure are what finally make an agent work in production.</p><p>However, we must be honest about current limitations. Memory still breaks across long sessions. Validation loops still miss edge cases. Furthermore, orchestrating hundreds of parallel agents on shared codebases remains an open research problem.</p><p>Harness engineering is real engineering. Your harness becomes its own product with its own bugs, its own drift, and its own maintenance burden.</p><div><hr></div><p><em>What&#8217;s your opinion? Do you agree, disagree, or is there something I missed?</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/agentic-harness-engineering/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/agentic-harness-engineering/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to share our work.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/agentic-harness-engineering?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/agentic-harness-engineering?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h2>Go Deeper</h2><p><strong>Go from zero to production-grade AI agents</strong> with the <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering self-paced course</a>. Built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Towards AI</a>.</p><p>Across <strong>34 lessons</strong> (articles, videos, and a lot of code), you&#8217;ll design, build, evaluate, and deploy production-grade AI agents end to end. By the final lesson, you&#8217;ll have <strong>built a multi-agent system</strong> and a <strong>capstone project</strong> where you apply everything you&#8217;ve learned on your own.</p><p><em>Three portfolio projects and a certificate to showcase in interviews. Plus a Discord community where you have direct access to other industry experts and me.</em></p><p>Rated 4.9/5 &#11088;&#65039; by 300+ students saying <em>&#8220;Every AI Engineer needs a course like this.&#8221;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Learn more&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Learn more</span></a></p><p><em>Not ready to commit?</em> We also prepared a free 6-day email course to reveal the <em><strong>6 critical mistakes that silently destroy agentic systems. </strong><a href="https://email-course.towardsai.net/?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Get the free email course.</a></em></p><div><hr></div><h2>References</h2><ol><li><p>LangChain. (2026, March 21). The Anatomy of an Agent Harness. LangChain Blog. <a href="https://blog.langchain.com/the-anatomy-of-an-agent-harness/">https://blog.langchain.com/the-anatomy-of-an-agent-harness/</a></p></li><li><p>Hashimoto, M. (2026, March 25). My AI Adoption Journey. Mitchell Hashimoto. <a href="https://mitchellh.com/writing/my-ai-adoption-journey">https://mitchellh.com/writing/my-ai-adoption-journey</a></p></li><li><p>Bouchard, L. (2026, March 25). What Harness Engineering Actually Means. What&#8217;s AI by Louis-Fran&#231;ois Bouchard.  <a href="https://youtube.com/watch?v=zYerCzIexCg">https://youtube.com/watch?v=zYerCzIexCg</a></p></li><li><p>Govindarajan, V. (2026, March 21). OpenClaw Architecture Part 1 - The Agent Stack. The Agent Stack. <a href="https://theagentstack.substack.com/p/openclaw-architecture-part-1-control">https://theagentstack.substack.com/p/openclaw-architecture-part-1-control</a></p></li><li><p>Abboud, M. (2026, March 17). How Coding Agents Actually Work: Inside OpenCode. Moncef Abboud. <a href="https://cefboud.com/posts/coding-agents-internals-opencode-deepdive/">https://cefboud.com/posts/coding-agents-internals-opencode-deepdive/</a></p></li><li><p>ByteByteGo. (2026, March 26). How OpenAI Codex Works. ByteByteGo. <a href="https://blog.bytebytego.com/p/how-openai-codex-works">https://blog.bytebytego.com/p/how-openai-codex-works</a></p></li><li><p>Anthropic. (2025, December 24). Building Effective AI Agents. Anthropic. <a href="https://www.anthropic.com/research/building-effective-agents">https://www.anthropic.com/research/building-effective-agents</a></p></li><li><p>Govindarajan, V. (2026, March 24). OpenClaw Architecture Part 3: Memory and State Ownership. The Agent Stack. <a href="https://theagentstack.substack.com/p/openclaw-architecture-part-3-memory">https://theagentstack.substack.com/p/openclaw-architecture-part-3-memory</a></p></li><li><p>Anthropic. (2025, October 22). Effective Context Engineering for AI Agents. Anthropic. <a href="https://www.anthropic.com/research/building-effective-agents">https://www.anthropic.com/engineering/effective-context-engineering</a></p></li><li><p>Anthropic. (2026, March 25). Effective Harnesses for Long-Running Agents. Anthropic. <a href="https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents">https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents</a></p></li></ol><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[From 12 Agents to 1]]></title><description><![CDATA[The mental model that prevents you from overengineering your next AI system.]]></description><link>https://www.decodingai.com/p/from-12-agents-to-1-ai-agent-architecture-decision-guide</link><guid isPermaLink="false">https://www.decodingai.com/p/from-12-agents-to-1-ai-agent-architecture-decision-guide</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Thu, 26 Mar 2026 12:01:52 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Gnrt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b250f4b-eb33-479b-9791-4fa8a58ae816_1200x630.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tlx1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b68fec-e7dc-49ea-a30a-5394c6ffc2d1_1456x932.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tlx1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b68fec-e7dc-49ea-a30a-5394c6ffc2d1_1456x932.png 424w, https://substackcdn.com/image/fetch/$s_!tlx1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b68fec-e7dc-49ea-a30a-5394c6ffc2d1_1456x932.png 848w, https://substackcdn.com/image/fetch/$s_!tlx1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b68fec-e7dc-49ea-a30a-5394c6ffc2d1_1456x932.png 1272w, https://substackcdn.com/image/fetch/$s_!tlx1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b68fec-e7dc-49ea-a30a-5394c6ffc2d1_1456x932.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tlx1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b68fec-e7dc-49ea-a30a-5394c6ffc2d1_1456x932.png" width="1456" height="932" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b3b68fec-e7dc-49ea-a30a-5394c6ffc2d1_1456x932.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:932,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1604167,&quot;alt&quot;:&quot;The complexity spectrum from workflows to single agents to multi-agent systems, with decision triggers between each stage.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The complexity spectrum from workflows to single agents to multi-agent systems, with decision triggers between each stage." title="The complexity spectrum from workflows to single agents to multi-agent systems, with decision triggers between each stage." srcset="https://substackcdn.com/image/fetch/$s_!tlx1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b68fec-e7dc-49ea-a30a-5394c6ffc2d1_1456x932.png 424w, https://substackcdn.com/image/fetch/$s_!tlx1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b68fec-e7dc-49ea-a30a-5394c6ffc2d1_1456x932.png 848w, https://substackcdn.com/image/fetch/$s_!tlx1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b68fec-e7dc-49ea-a30a-5394c6ffc2d1_1456x932.png 1272w, https://substackcdn.com/image/fetch/$s_!tlx1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b68fec-e7dc-49ea-a30a-5394c6ffc2d1_1456x932.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>It is 2026. People across the industry still mix up words like workflows, agents, tools, and multi-agent systems. Beyond terminology, this confusion has led to massively overengineered solutions.</p><p>Teams jump to multi-agent architectures because it sounds impressive and helps raise money. In reality, a simple workflow would have been faster to build, cheaper to run, and easier to debug. The result is bloated systems, wasted tokens, and debugging nightmares.</p><p>Our goal is to provide a clear mental model of what architecture to choose for your AI project: workflows vs. single agents vs. multi-agent systems.</p><p><span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Louis-Fran&#231;ois Bouchard&quot;,&quot;id&quot;:130571458,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!f-b9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c5d976-f699-4595-8b6d-6ffa3e42a5e5_400x400.jpeg&quot;,&quot;uuid&quot;:&quot;9e0714fe-b02b-42ac-bcc8-906b6ba10cb9&quot;}" data-component-name="MentionToDOM"></span> from Towards AI has been working on this exact problem with his clients and distilled his decision framework into two YouTube videos: <a href="https://www.youtube.com/watch?v=_rO2fv6tSsQ">Stop Overengineering: Workflows vs AI Agents Explained</a> and <a href="https://www.youtube.com/watch?v=iOpLKJYOvXs">From Workflows to Multi-Agent Systems: How to Choose</a>. He allowed me to take that framework and turn it into this article. Kudos to Louis-Fran&#231;ois!</p><p>This decision framework is a spectrum from simple to complex that tells you exactly what to build based on your actual constraints. The goal is to stay as far left on the complexity spectrum as possible while still solving your problem.</p><p>Here is what you will learn:</p><ul><li><p>The fundamental difference between an agent and a workflow.</p></li><li><p>How to use the complexity spectrum to make architecture decisions.</p></li><li><p>When to rely on simple workflows for predictable tasks.</p></li><li><p>Why a single agent with tools is often enough for dynamic problems.</p></li><li><p>The exact breaking points that justify moving to a multi-agent system.</p></li></ul><p>To apply this spectrum effectively, you must first define the terms. Here are the core misconceptions that lead to bad architecture decisions.</p><p><em>Before we continue, a quick word from the Decoding AI team.</em> &#8595;</p><div><hr></div><h2><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Go Deeper: Your Path to Agentic AI for Production</a></h2><p>Most engineers know the theory behind agents, context engineering, and RAG. What they lack is the confidence to architect, evaluate, and deploy these systems in production. The <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering course</a>, built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Towards AI,</a> closes that gap across 34 lessons (articles, videos, and a lot of code).</p><p>By the end, you will have gone from <em>&#8220;I built a demo&#8221;</em> to <em>&#8220;I shipped a production-grade multi-agent system with evals, observability, and CI/CD.&#8221;</em> Three portfolio projects, a certificate to back them up in interviews, and a Discord community with direct access to industry experts like <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Louis-Fran&#231;ois Bouchard&quot;,&quot;id&quot;:130571458,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!f-b9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c5d976-f699-4595-8b6d-6ffa3e42a5e5_400x400.jpeg&quot;,&quot;uuid&quot;:&quot;3ba979eb-b2db-4d6f-85c0-178e50443138&quot;}" data-component-name="MentionToDOM"></span> and me.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6Qcm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 424w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 848w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6Qcm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg" width="1280" height="720" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:720,&quot;width&quot;:1280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6Qcm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 424w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 848w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>34 lessons from first principles to production. Learn about context engineering, workflows, agents, evals, and the design of AI systems.</em></figcaption></figure></div><p>Rated 4.9/5 &#11088;&#65039; by 300+ early students saying <em>&#8221;Every AI Engineer needs a course like this&#8221;</em> and that is <em>&#8221;an excellent bridge from experimental LLM projects to real-world AI engineering.&#8221;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Start learning today&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Start learning today</span></a></p><div><hr></div><p>&#8595; <em>Now, back to the article.</em></p><h2>Clarifying the Confusion: Not Everything Is an Agent</h2><p>The first major misconception is that every LLM application is an agent. The key difference is autonomy. In a workflow, you control the flow.</p><p>You decide the steps and their order. In an agent, the model controls the flow. It decides what to do next based on the goal you give it.</p><p>If you can write down the exact sequence of steps in advance, you are building a workflow. You are not building an agent.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5bE2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5af291e-e122-48ee-87c8-4ee8283b3ff9_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5bE2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5af291e-e122-48ee-87c8-4ee8283b3ff9_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!5bE2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5af291e-e122-48ee-87c8-4ee8283b3ff9_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!5bE2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5af291e-e122-48ee-87c8-4ee8283b3ff9_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!5bE2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5af291e-e122-48ee-87c8-4ee8283b3ff9_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5bE2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5af291e-e122-48ee-87c8-4ee8283b3ff9_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a5af291e-e122-48ee-87c8-4ee8283b3ff9_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;A side-by-side comparison of a predetermined workflow and an autonomous agent.&quot;,&quot;title&quot;:&quot;A side-by-side comparison of a predetermined workflow and an autonomous agent.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A side-by-side comparison of a predetermined workflow and an autonomous agent." title="A side-by-side comparison of a predetermined workflow and an autonomous agent." srcset="https://substackcdn.com/image/fetch/$s_!5bE2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5af291e-e122-48ee-87c8-4ee8283b3ff9_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!5bE2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5af291e-e122-48ee-87c8-4ee8283b3ff9_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!5bE2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5af291e-e122-48ee-87c8-4ee8283b3ff9_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!5bE2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5af291e-e122-48ee-87c8-4ee8283b3ff9_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 1: A side-by-side comparison of a predetermined workflow and an autonomous agent, highlighting who controls the flow.</em></figcaption></figure></div><p>The second misconception is that tools are agents. A tool is a capability. It can be a calculator, a database query, a web browser, a validator, or an API call.</p><p>It can even be another LLM. An agent is the decision maker who chooses which tools to use and when.</p><p>If someone tells you they built a multi-agent system, but it is actually one model calling ten different APIs, that is not multi-agent. That is a single agent with ten tools.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Hg-F!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f18d144-8860-4b4b-baf2-c1df3722fb66_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Hg-F!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f18d144-8860-4b4b-baf2-c1df3722fb66_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!Hg-F!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f18d144-8860-4b4b-baf2-c1df3722fb66_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!Hg-F!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f18d144-8860-4b4b-baf2-c1df3722fb66_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!Hg-F!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f18d144-8860-4b4b-baf2-c1df3722fb66_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Hg-F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f18d144-8860-4b4b-baf2-c1df3722fb66_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9f18d144-8860-4b4b-baf2-c1df3722fb66_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;A central agent connected to multiple tools &#8212; calculator, database, web browser, validator, API, and another LLM.&quot;,&quot;title&quot;:&quot;A central agent connected to multiple tools &#8212; calculator, database, web browser, validator, API, and another LLM.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A central agent connected to multiple tools &#8212; calculator, database, web browser, validator, API, and another LLM." title="A central agent connected to multiple tools &#8212; calculator, database, web browser, validator, API, and another LLM." srcset="https://substackcdn.com/image/fetch/$s_!Hg-F!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f18d144-8860-4b4b-baf2-c1df3722fb66_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!Hg-F!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f18d144-8860-4b4b-baf2-c1df3722fb66_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!Hg-F!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f18d144-8860-4b4b-baf2-c1df3722fb66_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!Hg-F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f18d144-8860-4b4b-baf2-c1df3722fb66_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 2: A visual showing the distinction between tools and agents, with a central agent utilizing various tools.</em></figcaption></figure></div><p>This distinction matters. It defines how you architect, debug, and scale your system. It drives your core architecture choice between a workflow, a single agent with tools, or multiple agents.</p><h2>The Complexity Spectrum: A Mental Model for Architecture Decisions</h2><p>To make this architecture choice easier, we use a complexity spectrum. It is a slider going from the most control to the most autonomy. Your goal is to stay as far left as possible while still solving the problem.</p><p><strong>Level 1</strong> represents workflows. Here, you chain multiple LLM calls together in a predefined sequence. You control every step.</p><p><strong>Level 2</strong> represents a single agent with tools. The model makes decisions about what to do next. You have one decision maker and multiple capabilities.</p><p><strong>Level 3</strong> represents multi-agent systems. Here, you have multiple decision makers who need to coordinate with each other.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!idxn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F294a748b-0308-412d-a1f5-9d75ff5b3bdc_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!idxn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F294a748b-0308-412d-a1f5-9d75ff5b3bdc_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!idxn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F294a748b-0308-412d-a1f5-9d75ff5b3bdc_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!idxn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F294a748b-0308-412d-a1f5-9d75ff5b3bdc_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!idxn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F294a748b-0308-412d-a1f5-9d75ff5b3bdc_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!idxn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F294a748b-0308-412d-a1f5-9d75ff5b3bdc_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/294a748b-0308-412d-a1f5-9d75ff5b3bdc_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;A horizontal spectrum with three levels: Workflows, Single Agent, and Multi-Agent, with a cost and complexity slider.&quot;,&quot;title&quot;:&quot;A horizontal spectrum with three levels: Workflows, Single Agent, and Multi-Agent, with a cost and complexity slider.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A horizontal spectrum with three levels: Workflows, Single Agent, and Multi-Agent, with a cost and complexity slider." title="A horizontal spectrum with three levels: Workflows, Single Agent, and Multi-Agent, with a cost and complexity slider." srcset="https://substackcdn.com/image/fetch/$s_!idxn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F294a748b-0308-412d-a1f5-9d75ff5b3bdc_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!idxn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F294a748b-0308-412d-a1f5-9d75ff5b3bdc_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!idxn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F294a748b-0308-412d-a1f5-9d75ff5b3bdc_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!idxn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F294a748b-0308-412d-a1f5-9d75ff5b3bdc_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 3: A horizontal spectrum showing three levels of autonomy with increasing cost and complexity.</em></figcaption></figure></div><p>The core principle is straightforward. Move right on this spectrum only when you absolutely have to. Each step to the right increases costs, latency, and debugging complexity.</p><p>More LLM calls mean more tokens, more traces to follow, and more places where things can go wrong.</p><p>In practice, start simple and escalate only where things break. Write a prompt first. Test it.</p><p>Implement it with minimal complexity. Measure the results. Add what is missing.</p><p>If the model lacks information, add retrieval. If it needs calculations, add a tool. Only when you genuinely need autonomous decision-making should you reach for an agent.</p><p>Even then, start with one. The best AI systems are the simplest ones that reliably solve the problem. That usually means starting with workflows.</p><h2>When a Workflow Is the Right Answer</h2><p>Workflows are the right answer when your steps are known and stable. If the process is largely the same each time, regardless of input, a workflow is almost always the best choice.</p><p>Workflows win because they are predictable. They are easy to test because you can write unit tests for each step. They are easy to debug because you can trace exactly what happened when something goes wrong.</p><p>They are also cheap because you are not burning tokens on the model, figuring out what to do next.</p><p>Consider a support ticket system. A ticket comes in. You classify it.</p><p>You route it to the right team. You draft a response from templates and context. You validate it against the policy.</p><p>Finally, you send it. Each step might involve an LLM call, but the model does not need to decide whether to classify before routing. That is always the order.</p><p>Building this as an agent adds overhead without adding capability.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Za9K!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94971ee8-618a-40c5-8687-d0dedfa73bd6_1200x489.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Za9K!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94971ee8-618a-40c5-8687-d0dedfa73bd6_1200x489.png 424w, https://substackcdn.com/image/fetch/$s_!Za9K!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94971ee8-618a-40c5-8687-d0dedfa73bd6_1200x489.png 848w, https://substackcdn.com/image/fetch/$s_!Za9K!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94971ee8-618a-40c5-8687-d0dedfa73bd6_1200x489.png 1272w, https://substackcdn.com/image/fetch/$s_!Za9K!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94971ee8-618a-40c5-8687-d0dedfa73bd6_1200x489.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Za9K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94971ee8-618a-40c5-8687-d0dedfa73bd6_1200x489.png" width="1200" height="489" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/94971ee8-618a-40c5-8687-d0dedfa73bd6_1200x489.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:489,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:516854,&quot;alt&quot;:&quot;A horizontal flowchart showing six sequential steps of a support ticket workflow.&quot;,&quot;title&quot;:&quot;A horizontal flowchart showing six sequential steps of a support ticket workflow.&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A horizontal flowchart showing six sequential steps of a support ticket workflow." title="A horizontal flowchart showing six sequential steps of a support ticket workflow." srcset="https://substackcdn.com/image/fetch/$s_!Za9K!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94971ee8-618a-40c5-8687-d0dedfa73bd6_1200x489.png 424w, https://substackcdn.com/image/fetch/$s_!Za9K!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94971ee8-618a-40c5-8687-d0dedfa73bd6_1200x489.png 848w, https://substackcdn.com/image/fetch/$s_!Za9K!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94971ee8-618a-40c5-8687-d0dedfa73bd6_1200x489.png 1272w, https://substackcdn.com/image/fetch/$s_!Za9K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94971ee8-618a-40c5-8687-d0dedfa73bd6_1200x489.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 4: A horizontal flowchart illustrating the support ticket workflow with six sequential steps.</em></figcaption></figure></div><p>Do not underestimate workflows. They are not limited to simple sequential chains. They can include routing to pick different models based on input.</p><p>They can use parallel execution with majority voting to aggregate answers. They can also use generator-evaluator loops where one LLM generates and another validates until quality criteria are met. They can even leverage tools in designs like the orchestrator-worker. These patterns handle complex tasks without any agent overhead.</p><p>If you can write down the exact sequence of steps in advance, like a recipe, it is a workflow.</p><h2>When a Single Agent with Tools Wins</h2><p>Sometimes the order of work is not fixed. You genuinely cannot write down the steps in advance. This happens when the path changes depending on what you discover along the way.</p><p>Maybe the first API call fails, and you need to try an alternative. Maybe the retrieved data is incomplete, and you need clarification. This is what agents handle well.</p><p>When is an agent worth the risk? Anthropic offers a useful framework. Agents make sense when the task is complex enough to need autonomous decisions and delivers real value.</p><p>Critically, the cost of errors and the cost of discovering those errors must be low. This is why AI coding agents are great. A human reviews the code before production, so mistakes are cheap to fix.</p><p>A purchasing agent who accidentally buys the wrong hardware makes an expensive error. You must match your architecture to your error tolerance <a href="https://www.anthropic.com/engineering/building-effective-agents">[3]</a>.</p><p>The rule is to always start with one agent. A single agent with tools works best when tasks are tightly coupled and mostly sequential. It works well when global context matters, meaning step one affects step five.</p><p>It is also ideal when you need fewer than twenty tools and face strict budget or latency constraints.</p><p>Take a marketing content platform from Louis-Fran&#231;ois&#8217;s client work at Towards AI. The client wanted AI-assisted content generation for emails, text messages, and promotional materials. Their initial specification called for a multi-agent setup with over a dozen specialized agents.</p><p>They wanted an orchestrator, a request analyzer, a content generator, and many others. On paper, it looked clean with specialists doing specialist work.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!z1td!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ea9f84f-dc70-4c98-9f72-af3a1dfb68c9_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!z1td!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ea9f84f-dc70-4c98-9f72-af3a1dfb68c9_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!z1td!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ea9f84f-dc70-4c98-9f72-af3a1dfb68c9_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!z1td!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ea9f84f-dc70-4c98-9f72-af3a1dfb68c9_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!z1td!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ea9f84f-dc70-4c98-9f72-af3a1dfb68c9_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!z1td!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ea9f84f-dc70-4c98-9f72-af3a1dfb68c9_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3ea9f84f-dc70-4c98-9f72-af3a1dfb68c9_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Comparison of initial multi-agent setup versus actual single-agent solution for a marketing platform.&quot;,&quot;title&quot;:&quot;Comparison of initial multi-agent setup versus actual single-agent solution for a marketing platform.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Comparison of initial multi-agent setup versus actual single-agent solution for a marketing platform." title="Comparison of initial multi-agent setup versus actual single-agent solution for a marketing platform." srcset="https://substackcdn.com/image/fetch/$s_!z1td!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ea9f84f-dc70-4c98-9f72-af3a1dfb68c9_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!z1td!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ea9f84f-dc70-4c98-9f72-af3a1dfb68c9_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!z1td!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ea9f84f-dc70-4c98-9f72-af3a1dfb68c9_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!z1td!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ea9f84f-dc70-4c98-9f72-af3a1dfb68c9_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 5: Comparison of initial multi-agent setup versus actual single-agent solution for a marketing platform.</em></figcaption></figure></div><p>A single agent was the right call. The tasks were tightly coupled and sequential. The template choice affects the content.</p><p>Personalization depends on both content and contact data. Splitting this across multiple decision makers creates information silos and handoff errors. They did not need parallelism.</p><p>The flow was to plan, generate, validate, and fix if needed.</p><p>The key insight is that tools can be smart. A tool can have its own system prompt and use a different model. The validation tool can use its own LLM with instructions to catch errors.</p><p>The text message tool can treat character limits as deterministic engineering constraints instead of prompting problems. You get specialists, but you keep one brain to maintain context and make final decisions.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vqMr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b7f3d18-2a3d-48a3-8442-704aa5ba8c34_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vqMr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b7f3d18-2a3d-48a3-8442-704aa5ba8c34_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!vqMr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b7f3d18-2a3d-48a3-8442-704aa5ba8c34_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!vqMr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b7f3d18-2a3d-48a3-8442-704aa5ba8c34_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!vqMr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b7f3d18-2a3d-48a3-8442-704aa5ba8c34_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vqMr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b7f3d18-2a3d-48a3-8442-704aa5ba8c34_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6b7f3d18-2a3d-48a3-8442-704aa5ba8c34_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;An agentic loop diagram showing how a single agent plans, executes, and reflects.&quot;,&quot;title&quot;:&quot;An agentic loop diagram showing how a single agent plans, executes, and reflects.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="An agentic loop diagram showing how a single agent plans, executes, and reflects." title="An agentic loop diagram showing how a single agent plans, executes, and reflects." srcset="https://substackcdn.com/image/fetch/$s_!vqMr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b7f3d18-2a3d-48a3-8442-704aa5ba8c34_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!vqMr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b7f3d18-2a3d-48a3-8442-704aa5ba8c34_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!vqMr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b7f3d18-2a3d-48a3-8442-704aa5ba8c34_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!vqMr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b7f3d18-2a3d-48a3-8442-704aa5ba8c34_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 6: An agentic loop diagram showing how a single agent plans, executes, and reflects.</em></figcaption></figure></div><p>This results in a system that is faster to build, cheaper to run, and easier to debug. You get the same capabilities without the coordination overhead.</p><h2>The Tool Count Problem: When One Agent Isn&#8217;t Enough</h2><p>As your tool list grows, tool selection gets harder. This is one of the main ways agent systems quietly break down. It is also one of the clearest signals that splitting into multiple agents might be worth it.</p><p>Every tool has a name, description, and schema that the model needs in context to use correctly. The more tools you add, the more of your context budget you burn before the agent even starts thinking about the actual task. You also have to add system instructions, a few-shot examples, retrieved documents, and conversation history on top of that.</p><p>A single agent tends to work best with roughly 10 to 20 tools. Past that threshold, tool selection degrades. The agent has to choose among too many options in an already packed context.</p><p>This mechanism is known as context rot. LLM performance measurably degrades as context grows, well before hitting the advertised limit. Two forces drive this issue.</p><p>First, more context means more noise competing for the model&#8217;s attention. Second, models suffer from loss in the middle. They tend to attend more to the beginning and end of their context, underweighting information in the middle.</p><p>As your tool schemas and instructions pile up, the model gets worse at picking the right tool.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YqbJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F584e3d74-3c06-40b9-9ed8-ccee10d8da97_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YqbJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F584e3d74-3c06-40b9-9ed8-ccee10d8da97_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!YqbJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F584e3d74-3c06-40b9-9ed8-ccee10d8da97_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!YqbJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F584e3d74-3c06-40b9-9ed8-ccee10d8da97_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!YqbJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F584e3d74-3c06-40b9-9ed8-ccee10d8da97_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YqbJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F584e3d74-3c06-40b9-9ed8-ccee10d8da97_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/584e3d74-3c06-40b9-9ed8-ccee10d8da97_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The context window budget problem &#8212; comparing 10 tools vs 25 tools.&quot;,&quot;title&quot;:&quot;The context window budget problem &#8212; comparing 10 tools vs 25 tools.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The context window budget problem &#8212; comparing 10 tools vs 25 tools." title="The context window budget problem &#8212; comparing 10 tools vs 25 tools." srcset="https://substackcdn.com/image/fetch/$s_!YqbJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F584e3d74-3c06-40b9-9ed8-ccee10d8da97_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!YqbJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F584e3d74-3c06-40b9-9ed8-ccee10d8da97_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!YqbJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F584e3d74-3c06-40b9-9ed8-ccee10d8da97_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!YqbJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F584e3d74-3c06-40b9-9ed8-ccee10d8da97_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 7: The context window budget problem, more tools mean less room for actual task reasoning.</em></figcaption></figure></div><p>Managing context can reduce history and retrieved content, but not the tool schema load. Those definitions must always be there. The only approach that actually reduces how many tool definitions the model sees per call is splitting across agents.</p><p>If one agent sees only email tools and another only sees validation tools, each call stays smaller. Tool selection gets easier. Once you split tools across agents to keep calls small, you enter multi-agent territory.</p><h2>When Multi-Agent Is Actually the Right Call</h2><p>Specific reasons justify multiple agents, not because the architecture sounds impressive. There are four legitimate reasons to go multi-agent. First, you need true parallelism where tasks are genuinely independent and run simultaneously.</p><p>Second, you face context overload where instructions and tools degrade performance. Third, you need modularity to connect with third-party agent systems you do not control. Fourth, you have hard separation requirements like security boundaries or sensitive data handling.</p><p>Consider the professional article generation system that Louis-Fran&#231;ois and I built as one of the projects for our <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering course</a>. We started with a single agent for research and writing but had to pivot because the two phases have fundamentally different needs.</p><p>The research phase is exploratory and dynamic. It needs flexibility and broad tool access across web search, video transcription, and document processing. The agent searches, reads, pivots based on what it finds, and iterates based on human feedback.</p><p>The writing phase is constrained and deterministic. It needs focused constraints, consistent style enforcement, and iterative refinement against fixed rubrics.</p><p>These agents communicate through explicit artifacts. The research agent produces a structured markdown file that the writer agent consumes as context. There is no complex runtime orchestration.</p><p>It is just a sequential handoff with a clear contract between them. Each agent has its own optimized context without the bloat of carrying the other&#8217;s tools.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8P9n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d7b4acc-9771-4b65-a99e-4a48eb2a278b_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8P9n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d7b4acc-9771-4b65-a99e-4a48eb2a278b_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!8P9n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d7b4acc-9771-4b65-a99e-4a48eb2a278b_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!8P9n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d7b4acc-9771-4b65-a99e-4a48eb2a278b_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!8P9n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d7b4acc-9771-4b65-a99e-4a48eb2a278b_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8P9n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d7b4acc-9771-4b65-a99e-4a48eb2a278b_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4d7b4acc-9771-4b65-a99e-4a48eb2a278b_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The article generation multi-agent system with Research Agent and Writing Agent.&quot;,&quot;title&quot;:&quot;The article generation multi-agent system with Research Agent and Writing Agent.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The article generation multi-agent system with Research Agent and Writing Agent." title="The article generation multi-agent system with Research Agent and Writing Agent." srcset="https://substackcdn.com/image/fetch/$s_!8P9n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d7b4acc-9771-4b65-a99e-4a48eb2a278b_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!8P9n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d7b4acc-9771-4b65-a99e-4a48eb2a278b_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!8P9n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d7b4acc-9771-4b65-a99e-4a48eb2a278b_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!8P9n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d7b4acc-9771-4b65-a99e-4a48eb2a278b_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 8: The article generation multi-agent system with a Research Agent, a Writing Agent, and an artifact handoff.</em></figcaption></figure></div><p>If you do go multi-agent, we recommend the plan-and-execute combined with the orchestrator-worker pattern. You do not want everyone talking to everyone. One orchestrator maintains the main context and delegates specific tasks to worker agents.</p><p>Then, it synthesizes the results. This prevents the information silos that kill multi-agent systems.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Gnrt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b250f4b-eb33-479b-9791-4fa8a58ae816_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Gnrt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b250f4b-eb33-479b-9791-4fa8a58ae816_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!Gnrt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b250f4b-eb33-479b-9791-4fa8a58ae816_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!Gnrt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b250f4b-eb33-479b-9791-4fa8a58ae816_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!Gnrt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b250f4b-eb33-479b-9791-4fa8a58ae816_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Gnrt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b250f4b-eb33-479b-9791-4fa8a58ae816_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5b250f4b-eb33-479b-9791-4fa8a58ae816_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The Orchestrator-Worker pattern with delegation and result arrows.&quot;,&quot;title&quot;:&quot;The Orchestrator-Worker pattern with delegation and result arrows.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Orchestrator-Worker pattern with delegation and result arrows." title="The Orchestrator-Worker pattern with delegation and result arrows." srcset="https://substackcdn.com/image/fetch/$s_!Gnrt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b250f4b-eb33-479b-9791-4fa8a58ae816_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!Gnrt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b250f4b-eb33-479b-9791-4fa8a58ae816_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!Gnrt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b250f4b-eb33-479b-9791-4fa8a58ae816_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!Gnrt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b250f4b-eb33-479b-9791-4fa8a58ae816_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 9: The Orchestrator-Worker pattern with no direct communication between workers.</em></figcaption></figure></div><p>Multi-agent systems can simplify individual contexts and enable specialization. However, they increase coordination costs. You will face more token usage, added latency, more failure points, and handoff complexity.</p><p>Only accept those costs when you hit a real constraint that simpler architectures cannot solve.</p><h2>To Wrap Up</h2><p>To build reliable AI applications, you must stay as far left on the complexity spectrum as possible while still solving your problem.</p><p>Keep these key takeaways in mind:</p><ul><li><p>Not every LLM application is an agent, and not every tool is an agent.</p></li><li><p>Always start with workflows because they are predictable, cheap, and testable.</p></li><li><p>Use one agent when the path cannot be predetermined, but keep the tool count manageable.</p></li><li><p>Move to multi-agent architectures only when you hit a real constraint like true parallelism or context overload.</p></li></ul><p>Each step right on the spectrum increases cost, latency, and debugging complexity. The simplest system that reliably solves the problem is always the best system.</p><blockquote><p>&#128161; If you want <strong>a step-by-step framework to help you decide what architecture to pick for your next project,</strong> Louis-Fran&#231;ois and the Towards AI team put together a <strong><a href="https://academy.towardsai.net/products/digital_downloads/agents-cheatsheet?ref=b3ab31">free cheatsheet</a></strong> that walks you through the decision process from workflows to multi-agent systems.</p></blockquote><div><hr></div><p><em>What&#8217;s your opinion? Do you agree, disagree, or is there something I missed?</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/from-12-agents-to-1-ai-agent-architecture-decision-guide/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/from-12-agents-to-1-ai-agent-architecture-decision-guide/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to share our work.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/from-12-agents-to-1-ai-agent-architecture-decision-guide?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/from-12-agents-to-1-ai-agent-architecture-decision-guide?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h2>Go Deeper</h2><p><strong>Go from zero to production-grade AI agents</strong> with the <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering self-paced course</a>. Built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Towards AI</a>.</p><p>Across <strong>34 lessons</strong> (articles, videos, and a lot of code), you&#8217;ll design, build, evaluate, and deploy production-grade AI agents end to end. By the final lesson, you&#8217;ll have <strong>built a multi-agent system</strong> and a <strong>capstone project</strong> where you apply everything you&#8217;ve learned on your own.</p><p><em>Three portfolio projects and a certificate to showcase in interviews. Plus a Discord community where you have direct access to other industry experts and me.</em></p><p>Rated 4.9/5 &#11088;&#65039; by 300+ students saying <em>&#8220;Every AI Engineer needs a course like this.&#8221;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Learn more&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Learn more</span></a></p><p><em>Not ready to commit?</em> We also prepared a free 6-day email course to reveal the <em><strong>6 critical mistakes that silently destroy agentic systems. </strong><a href="https://email-course.towardsai.net/?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Get the free email course.</a></em></p><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[The AI Evals Roadmap I Wish I Had]]></title><description><![CDATA[From vibe checking to trusted agents in production]]></description><link>https://www.decodingai.com/p/the-ai-evals-roadmap-i-wish-i-had</link><guid isPermaLink="false">https://www.decodingai.com/p/the-ai-evals-roadmap-i-wish-i-had</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Tue, 24 Mar 2026 12:04:28 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!RTZT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9caacf1c-71bf-48f1-8b03-ff89346e15f8_1200x1200.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RTZT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9caacf1c-71bf-48f1-8b03-ff89346e15f8_1200x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RTZT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9caacf1c-71bf-48f1-8b03-ff89346e15f8_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!RTZT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9caacf1c-71bf-48f1-8b03-ff89346e15f8_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!RTZT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9caacf1c-71bf-48f1-8b03-ff89346e15f8_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!RTZT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9caacf1c-71bf-48f1-8b03-ff89346e15f8_1200x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RTZT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9caacf1c-71bf-48f1-8b03-ff89346e15f8_1200x1200.png" width="1200" height="1200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9caacf1c-71bf-48f1-8b03-ff89346e15f8_1200x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1200,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:183160,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/191463108?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9caacf1c-71bf-48f1-8b03-ff89346e15f8_1200x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RTZT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9caacf1c-71bf-48f1-8b03-ff89346e15f8_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!RTZT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9caacf1c-71bf-48f1-8b03-ff89346e15f8_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!RTZT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9caacf1c-71bf-48f1-8b03-ff89346e15f8_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!RTZT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9caacf1c-71bf-48f1-8b03-ff89346e15f8_1200x1200.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Welcome to the <strong><a href="https://www.decodingai.com/t/ai-evals-and-observability">AI Evals &amp; Observability series</a></strong>: A 7-part journey from shipping AI apps to systematically improving them. Made by busy people. For busy people.</p><p>AI Evals is the topic most AI engineers know they should invest in, but do not know where to start. I remember struggling with this myself.</p><p>I did not know how to properly integrate evals into my app until I understood there are three core layers: optimization during development, regression testing before merging, and production monitoring on live traffic. Once that clicked, everything else fell into place.</p><p>I did not know how to build LLM judges and evaluators that I could actually trust and use. Every guide I found either hand-waved the details or dumped a generic &#8220;helpfulness&#8221; metric and moved on. Instead, I needed evaluators grounded in my actual business requirements.</p><p>I did not know how to gather custom datasets without wasting too much time. I tried generating hundreds of synthetic test cases up front, but the real unlock came from learning how to organically grow a high-quality dataset from production data, starting small and letting the error-analysis flywheel do the heavy lifting.</p><p>The information was scattered across blog posts, talks, and vendor docs. Most of it focused on isolated techniques without showing how everything connects. I built this series as the structured, end-to-end guide I wish I had.</p><p>This 7-lesson series breaks it all down from first principles. By the end, you will know how to integrate AI evaluations that actually track and improve your product's performance. No vibe checking required.</p><p>The series follows a natural progression. You start by understanding where evals fit. Then, you build the dataset.</p><p>Next, you design and validate the evaluators. Finally, you handle specialized domains like RAG and see how it all works in production.</p><p>You can read front-to-back for the full journey. Alternatively, jump to the lesson that matches your current pain point. Each lesson stands on its own but references the others.</p><p>Without more yada, yada, here are the 7 lessons of the series:<br><em>(Scroll down to find more about each lesson individually.)</em></p><ol><li><p><a href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app">Integrating AI Evals Into Your AI App</a></p></li><li><p><a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Build an AI Evals Dataset from Scratch</a></p></li><li><p><a href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals">Generate Synthetic Datasets for AI Evals</a></p></li><li><p><a href="https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures">How to Design Evaluators</a></p></li><li><p><a href="https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge">How to Evaluate the Evaluator</a></p></li><li><p><a href="https://www.decodingai.com/p/rag-evaluation-6-metrics-framework">RAG Evaluation: The Only 6 Metrics You Need</a></p></li><li><p><a href="https://www.decodingai.com/p/behind-the-scenes-of-ai-observability">Lessons from 6 Months of Evals on a Production AI Companion</a></p></li></ol><p><em>Everything is completely free, without any hidden costs, thanks to our sponsor, Opik</em> &#8595;</p><div><hr></div><h2><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik: Open-Source LLMOps Platform (Sponsored)</a></h2><p>This <strong>AI Evals &amp; Observability</strong> series is brought to you by <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a>, the LLMOps open-source platform used by Uber, Etsy, Netflix, and more. </p><p>We use Opik daily across our courses and AI products. Not just for observability, but as our <strong>end-to-end evaluation harness</strong>, all from the same platform.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yCWf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe3381bc-dda5-4624-8bb9-a961bb331c6d_1764x694.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yCWf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe3381bc-dda5-4624-8bb9-a961bb331c6d_1764x694.png 424w, https://substackcdn.com/image/fetch/$s_!yCWf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe3381bc-dda5-4624-8bb9-a961bb331c6d_1764x694.png 848w, https://substackcdn.com/image/fetch/$s_!yCWf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe3381bc-dda5-4624-8bb9-a961bb331c6d_1764x694.png 1272w, https://substackcdn.com/image/fetch/$s_!yCWf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe3381bc-dda5-4624-8bb9-a961bb331c6d_1764x694.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yCWf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe3381bc-dda5-4624-8bb9-a961bb331c6d_1764x694.png" width="1764" height="694" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/be3381bc-dda5-4624-8bb9-a961bb331c6d_1764x694.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:694,&quot;width&quot;:1764,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:169484,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/191463108?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcce72ecb-bb9c-42b8-98eb-e99d51a624d4_1784x702.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yCWf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe3381bc-dda5-4624-8bb9-a961bb331c6d_1764x694.png 424w, https://substackcdn.com/image/fetch/$s_!yCWf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe3381bc-dda5-4624-8bb9-a961bb331c6d_1764x694.png 848w, https://substackcdn.com/image/fetch/$s_!yCWf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe3381bc-dda5-4624-8bb9-a961bb331c6d_1764x694.png 1272w, https://substackcdn.com/image/fetch/$s_!yCWf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe3381bc-dda5-4624-8bb9-a961bb331c6d_1764x694.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Try Opik for free here</a> (25k spans/month free)</figcaption></figure></div><p>This series teaches you how to build evals from scratch (custom datasets, LLM judges, optimization loops, and production monitoring), while Opik gives you the platform to run everything at scale. </p><p><em>Here is how we use it:</em></p><ul><li><p><strong>Custom LLM judges</strong>: Build evaluators by defining your criteria, adding a few-shot examples, and running them across hundreds of traces automatically.</p></li><li><p><strong>Run experiments, compare results</strong>: Test different prompts, models, or parameters from your AI app side by side. Opik scores each variant with your evaluators and shows you which one wins.</p></li><li><p><strong>Plug evaluators into production</strong>: The same LLM judges you design for offline testing run on live traces too. Set up alarms when scores drop below your threshold so you catch regressions before users do.</p></li></ul><p><strong><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a></strong> is fully <strong>open-source</strong> and works with custom code and with every popular AI framework or tool (<em>including OpenClaw</em>). You can also use the managed version for free (with 25K spans/month on their generous free tier):</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;text&quot;:&quot;Try Opik for free&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul"><span>Try Opik for free</span></a></p><div><hr></div><p><em>&#8595;</em>  <em>Now, let&#8217;s move back to the article.</em></p><h2>Lesson 1: Integrating AI Evals Into Your AI App</h2><p>To build a reliable system, you first need to know where evaluation fits into the development lifecycle.</p><p>Most teams start by <em>&#8220;vibe checking&#8221;</em> their AI app. They manually test a few inputs and eyeball whether the outputs look right. That works for the first version.</p><p>But the moment you start adding features, onboarding real users, or trying to improve existing capabilities, vibe checking collapses. This first article gives you the holistic map of where AI Evals fit, so you never feel lost again.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Y_0d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Y_0d!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 424w, https://substackcdn.com/image/fetch/$s_!Y_0d!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 848w, https://substackcdn.com/image/fetch/$s_!Y_0d!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 1272w, https://substackcdn.com/image/fetch/$s_!Y_0d!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Y_0d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png" width="1200" height="1075" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1075,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Y_0d!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 424w, https://substackcdn.com/image/fetch/$s_!Y_0d!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 848w, https://substackcdn.com/image/fetch/$s_!Y_0d!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 1272w, https://substackcdn.com/image/fetch/$s_!Y_0d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here is what you will learn:</p><ul><li><p>The three core scenarios where evals matter: optimization during development, regression testing before merging, and production monitoring on live traffic.</p></li><li><p>The difference between guardrails and evaluators. Confusing them leads to gaps in your system.</p></li><li><p>The minimum viable tech stack required to start: a custom annotation tool and an LLMOps platform.</p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app&quot;,&quot;text&quot;:&quot;Go to Lesson 1&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app"><span>Go to Lesson 1</span></a></p><h2>Lesson 2: Build an AI Evals Dataset from Scratch</h2><p>Once you understand where evals fit, the next step is gathering the data required to measure performance.</p><p>You cannot evaluate what you cannot measure. You cannot measure without data. Most teams either skip this step entirely or fire off a generic prompt to create 100 test cases and call it done.</p><p>This article teaches the error analysis framework. It is a practical flywheel that turns 20-50 real production traces into a growing, high-quality evals dataset.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HoRg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HoRg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!HoRg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!HoRg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!HoRg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HoRg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png" width="1200" height="1200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1200,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HoRg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!HoRg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!HoRg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!HoRg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here is what you will learn:</p><ul><li><p>The error analysis flywheel: sample traces, label manually, build evaluators iteratively, perform error analysis, and create specialized evaluators.</p></li><li><p>Why one &#8220;<em>benevolent dictator&#8221;</em> should own labeling consistency across your team.</p></li><li><p>How to graduate from generic to specialized evaluators as your understanding deepens.</p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis&quot;,&quot;text&quot;:&quot;Go to Lesson 2&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis"><span>Go to Lesson 2</span></a></p><h2>Lesson 3: Generate Synthetic Datasets for AI Evals</h2><p>Production traces alone have limits. You need traffic to get data, and that traffic rarely covers every scenario. What about before you have users?</p><p>What about rare failure modes you have never seen in production? Yet! Synthetic data solves the cold start problem and fills coverage gaps.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FVJv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FVJv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png 424w, https://substackcdn.com/image/fetch/$s_!FVJv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png 848w, https://substackcdn.com/image/fetch/$s_!FVJv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png 1272w, https://substackcdn.com/image/fetch/$s_!FVJv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FVJv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png" width="1200" height="676" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:676,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Synthetic data and production traces both feed into the evals dataset, which drives the error analysis flywheel&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Synthetic data and production traces both feed into the evals dataset, which drives the error analysis flywheel" title="Synthetic data and production traces both feed into the evals dataset, which drives the error analysis flywheel" srcset="https://substackcdn.com/image/fetch/$s_!FVJv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png 424w, https://substackcdn.com/image/fetch/$s_!FVJv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png 848w, https://substackcdn.com/image/fetch/$s_!FVJv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png 1272w, https://substackcdn.com/image/fetch/$s_!FVJv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here is what you will learn:</p><ul><li><p>Why you should generate only inputs, not outputs, and let your real app produce the outputs.</p></li><li><p>How to think in dimensions like persona, feature, scenario, and input modality to avoid mode collapse.</p></li><li><p>Tester agents for simulating multi-turn conversations.</p></li><li><p>The reverse workflow for RAG: generate questions from your knowledge base, not the other way around.</p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals&quot;,&quot;text&quot;:&quot;Go to Lesson 3&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals"><span>Go to Lesson 3</span></a></p><h2>Lesson 4: How to Design Evaluators</h2><p>You have the dataset. Now you need evaluators who can actually tell you whether your app is working. This is where most teams make their biggest mistake.</p><p>They grab a generic helpfulness metric off the shelf and call it done. This article teaches you how to design evaluators grounded in your actual business requirements.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!a1uV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!a1uV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!a1uV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!a1uV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!a1uV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!a1uV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png" width="1456" height="1048" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1048,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Designing evaluators for AI applications: from code-based checks to LLM judges.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Designing evaluators for AI applications: from code-based checks to LLM judges." title="Designing evaluators for AI applications: from code-based checks to LLM judges." srcset="https://substackcdn.com/image/fetch/$s_!a1uV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!a1uV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!a1uV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!a1uV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here is what you will learn:</p><ul><li><p>The evaluation harness: the infrastructure that automates running evaluators across your dataset.</p></li><li><p>When to use fast, deterministic code-based evaluators versus flexible, nuanced LLM judges.</p></li><li><p>Common design mistakes</p></li><li><p>Advanced designs for multi-turn conversations and agentic workflows.</p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures&quot;,&quot;text&quot;:&quot;Go to Lesson 4&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures"><span>Go to Lesson 4</span></a></p><h2>Lesson 5: How to Evaluate the Evaluator</h2><p>You built an evaluator. It says everything is great. But is it?</p><p>An evaluator that validates every output is worse than no evaluator at all. It gives you false confidence. This article teaches you how to validate your evaluator against human judgment and close the gap when they disagree.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1am-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1am-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!1am-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!1am-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!1am-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1am-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png" width="1456" height="1048" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1048,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The evaluator validation workflow&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The evaluator validation workflow" title="The evaluator validation workflow" srcset="https://substackcdn.com/image/fetch/$s_!1am-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!1am-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!1am-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!1am-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here is what you will learn:</p><ul><li><p>The iterative refinement loop: measure alignment, diagnose disagreements, adjust few-shot examples, and re-measure.</p></li><li><p>Dealing with non-determinism: why LLM judges give different answers on the same input, and how to stabilize them.</p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge&quot;,&quot;text&quot;:&quot;Go to Lesson 5&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge"><span>Go to Lesson 5</span></a></p><h2>Lesson 6: RAG Evaluation: The Only 6 Metrics You Need</h2><p>After mastering general evaluators, you can apply these principles to specific architectures like RAG.</p><p>RAG evaluation feels overwhelming because everyone proposes different metrics. But it does not have to be complicated. This article proves that there are exactly three variables in any RAG system: Question, Context, and Answer.</p><p>There are exactly six possible relationships between them. That is it. Every RAG metric maps to one of these six relationships.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gtpu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gtpu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!gtpu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!gtpu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!gtpu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gtpu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png" width="1456" height="1048" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1048,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The six exhaustive relationships between the three RAG variables &#8212; Question, Context, and Answer.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The six exhaustive relationships between the three RAG variables &#8212; Question, Context, and Answer." title="The six exhaustive relationships between the three RAG variables &#8212; Question, Context, and Answer." srcset="https://substackcdn.com/image/fetch/$s_!gtpu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!gtpu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!gtpu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!gtpu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here is what you will learn:</p><ul><li><p>The three RAG variables and six exhaustive relationships.</p></li><li><p>Tier 1: Retrieval metrics. If retrieval is broken, nothing else matters.</p></li><li><p>Tier 2: The three core RAG metrics you always need.</p></li><li><p>Tier 3: When core metrics cannot explain the failure.</p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/rag-evaluation-6-metrics-framework&quot;,&quot;text&quot;:&quot;Go to Lesson 6&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/rag-evaluation-6-metrics-framework"><span>Go to Lesson 6</span></a></p><h2>Lesson 7: Lessons from 6 Months of Evals on a Production AI Companion</h2><p>Theory and isolated metrics are useful. But the ultimate test is running this entire system on live user traffic.</p><p>The first six articles teach you how to build the system. This final article shows you what it looks like after six months of running it in production.</p><p>Written as a guest post by <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Alejandro Aboy&quot;,&quot;id&quot;:22949723,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de90c745-7f5a-404e-b2d6-eaab9420dd98_881x881.png&quot;,&quot;uuid&quot;:&quot;1ba8f91f-628f-41c3-883d-003ee4b9e225&quot;}" data-component-name="MentionToDOM"></span>, Senior Data Engineer at Workpath, it shares the real lessons. We cover what worked, what failed, and what they wish they had known from the start.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0pKO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0pKO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png 424w, https://substackcdn.com/image/fetch/$s_!0pKO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png 848w, https://substackcdn.com/image/fetch/$s_!0pKO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png 1272w, https://substackcdn.com/image/fetch/$s_!0pKO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0pKO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png" width="616" height="672.2692307692307" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1589,&quot;width&quot;:1456,&quot;resizeWidth&quot;:616,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0pKO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png 424w, https://substackcdn.com/image/fetch/$s_!0pKO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png 848w, https://substackcdn.com/image/fetch/$s_!0pKO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png 1272w, https://substackcdn.com/image/fetch/$s_!0pKO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here is what you will learn:</p><ul><li><p>The three observability problems most teams hit: falling for generic metrics, skipping manual annotation, and not treating AI agents as data products.</p></li><li><p>How to use Opik&#8217;s architecture, including traces, spans, threads, and prompt versioning, for production monitoring and evals.</p></li><li><p>How to reverse-engineer evaluation criteria from real traces instead of guessing upfront.</p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/behind-the-scenes-of-ai-observability&quot;,&quot;text&quot;:&quot;Go to Lesson 7&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/behind-the-scenes-of-ai-observability"><span>Go to Lesson 7</span></a></p><h2>How to Take the Course?</h2><p>After completing these seven articles, you will have the complete mental model for AI Evals. You will understand everything from strategy to production.</p><p>As the course is 100% free, with no hidden costs or registration required, taking it is a no-brainer.</p><p>Each lesson is a free article hosted on the <a href="https://www.decodingai.com/t/ai-evals-and-observability">Decoding AI Magazine</a>.</p><p>Just open each lesson in the order provided by us, and you are good to go:</p><ol><li><p><a href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app">Integrating AI Evals Into Your AI App</a></p></li><li><p><a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Build an AI Evals Dataset from Scratch</a></p></li><li><p><a href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals">Generate Synthetic Datasets for AI Evals</a></p></li><li><p><a href="https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures">How to Design Evaluators</a></p></li><li><p><a href="https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge">How to Evaluate the Evaluator</a></p></li><li><p><a href="https://www.decodingai.com/p/rag-evaluation-6-metrics-framework">RAG Evaluation: The Only 6 Metrics You Need</a></p></li><li><p><a href="https://www.decodingai.com/p/behind-the-scenes-of-ai-observability">Lessons from 6 Months of Evals on a Production AI Companion</a></p></li></ol><p>Each lesson will guide you through the required steps.</p><p>Enjoy!</p><h2>Now What?</h2><p>After completing these lessons, if you want the information to stick, you have to put everything into practice by building a cool project!</p><p>I am sorry to say there is no other way to make learning worthwhile. Pick one problem and get your hands dirty with a project.</p><p><strong>&#128161;</strong><em><strong> Want to share your work on my socials with my 140k+ audience?</strong> If you build a project you are excited about, I will be too. Trust me! I love seeing people build cool stuff. To share it, you can contact me <a href="https://www.pauliusztin.ai/contact">here</a>.</em></p><p>See you next Tuesday.</p><p><a href="https://www.pauliusztin.ai/">Paul Iusztin</a></p><div><hr></div><p><em>What&#8217;s your opinion? Do you agree, disagree, or is there something I missed?</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/the-ai-evals-roadmap-i-wish-i-had/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.decodingai.com/p/the-ai-evals-roadmap-i-wish-i-had/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to share our work.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/the-ai-evals-roadmap-i-wish-i-had?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.decodingai.com/p/the-ai-evals-roadmap-i-wish-i-had?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h2>Go Deeper</h2><p><strong>Go from zero to production-grade AI agents</strong> with the <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering self-paced course</a>. Built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Towards AI</a>.</p><p>Across <strong>34 lessons</strong> (articles, videos, and a lot of code), you&#8217;ll design, build, evaluate, and deploy production-grade AI agents end to end. By the final lesson, you&#8217;ll have <strong>built a multi-agent system</strong> and a <strong>capstone project</strong> where you apply everything you've learned on your own.</p><p><em>Three portfolio projects and a certificate to showcase in interviews. Plus a Discord community where you have direct access to other industry experts and me.</em></p><p>Rated 4.9/5 &#11088;&#65039; by 300+ students &#8212; <em>&#8220;Every AI Engineer needs a course like this.&#8221;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Learn more&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Learn more</span></a></p><p><em>Not ready to commit?</em> We also prepared a free 6-day email course to reveal the <em><strong>6 critical mistakes that silently destroy agentic systems. </strong><a href="https://email-course.towardsai.net/?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Get the free email course.</a></em></p><div><hr></div><p><em>Thanks again to <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a> for sponsoring the series and keeping it free!</em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oSDm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oSDm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 424w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 848w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1272w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png" width="1456" height="364" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:364,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Opik Banner&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Opik Banner" title="Opik Banner" srcset="https://substackcdn.com/image/fetch/$s_!oSDm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 424w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 848w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1272w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Try Opik for free here</a> (25k spans/month free)</figcaption></figure></div><p><strong>If you want to monitor, evaluate and optimize your AI workflows and agents:</strong></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;text&quot;:&quot;Try Opik for free&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul"><span>Try Opik for free</span></a></p><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[Agentic AI Engineering Guide]]></title><description><![CDATA[The 6 critical mistakes that silently destroy agentic systems]]></description><link>https://www.decodingai.com/p/agentic-ai-engineering-guide-6-mistakes</link><guid isPermaLink="false">https://www.decodingai.com/p/agentic-ai-engineering-guide-6-mistakes</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Thu, 19 Mar 2026 12:03:14 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!dUK-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23767fe-eb70-41ea-89c6-3f403021f221_1200x1200.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I have spent the past two years building and breaking AI agents in production. Along the way, I have seen the same patterns destroy systems over and over. This happens not because the models are bad, but because the system design is wrong.</p><p>Most agents fail silently. They work well in demos but drift unpredictably in production. Costs spike with no clear explanation.</p><p>Behavior becomes erratic, and every release feels risky. Ultimately, teams end up stuck in PoC purgatory, unable to ship, debug, or trust their own system.</p><p>The root cause is almost never the model. It is subtle system design mistakes that individually look small but compound into production disasters.</p><p>To fix this, together with <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Louis-Fran&#231;ois Bouchard&quot;,&quot;id&quot;:130571458,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!f-b9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c5d976-f699-4595-8b6d-6ffa3e42a5e5_400x400.jpeg&quot;,&quot;uuid&quot;:&quot;f96a24cf-9c78-4536-a64d-226395c6b6bb&quot;}" data-component-name="MentionToDOM"></span>, we<strong> </strong>created a <strong>diagnostic framework for six specific mistakes that cause agentic systems to break in production.</strong> Each has a clear problem, a reason why it happens, and a proven fix. Once you know what to look for, you can trace most production failures back to one of these patterns.</p><p>The first and most common failure starts right at the input level, where engineers mishandle the context window.</p><h2>Mistake #1: Treating the Context Window as an Afterthought</h2><p>When something breaks, the instinct is to add more context. Engineers add more rules, more history, more tools, and more examples. The assumption is that if the model sees everything, it will behave better.</p><p>But this turns the context window into a dumping ground instead of a carefully scoped working memory. As the context grows, the model starts to ignore instructions and apply constraints inconsistently. It hallucinates more and drifts across runs.</p><p>Latency spikes and costs compound. This is the lost in the middle problem. Many teams respond by splitting one giant prompt into dozens of smaller ones.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!C1jF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b1db21-8c96-487b-9fe2-53e071434fc8_1096x549.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!C1jF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b1db21-8c96-487b-9fe2-53e071434fc8_1096x549.png 424w, https://substackcdn.com/image/fetch/$s_!C1jF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b1db21-8c96-487b-9fe2-53e071434fc8_1096x549.png 848w, https://substackcdn.com/image/fetch/$s_!C1jF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b1db21-8c96-487b-9fe2-53e071434fc8_1096x549.png 1272w, https://substackcdn.com/image/fetch/$s_!C1jF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b1db21-8c96-487b-9fe2-53e071434fc8_1096x549.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!C1jF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b1db21-8c96-487b-9fe2-53e071434fc8_1096x549.png" width="1096" height="549" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/18b1db21-8c96-487b-9fe2-53e071434fc8_1096x549.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:549,&quot;width&quot;:1096,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:90904,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/191159222?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b1db21-8c96-487b-9fe2-53e071434fc8_1096x549.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!C1jF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b1db21-8c96-487b-9fe2-53e071434fc8_1096x549.png 424w, https://substackcdn.com/image/fetch/$s_!C1jF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b1db21-8c96-487b-9fe2-53e071434fc8_1096x549.png 848w, https://substackcdn.com/image/fetch/$s_!C1jF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b1db21-8c96-487b-9fe2-53e071434fc8_1096x549.png 1272w, https://substackcdn.com/image/fetch/$s_!C1jF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b1db21-8c96-487b-9fe2-53e071434fc8_1096x549.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>But that introduces its own problems, such as more LLM calls, higher latency, and harder debugging.</p><blockquote><p>&#128161; <em>Treat the context window as a scarce resource.</em></p></blockquote><p>Every LLM call should have one clearly scoped job. You must curate context aggressively by selecting, compressing, and pruning before every call. Move persistence into a memory layer.</p><p>The context window holds only what matters for the next decision, and everything else lives in memory, which you write to and read from continuously.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dUK-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23767fe-eb70-41ea-89c6-3f403021f221_1200x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dUK-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23767fe-eb70-41ea-89c6-3f403021f221_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!dUK-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23767fe-eb70-41ea-89c6-3f403021f221_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!dUK-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23767fe-eb70-41ea-89c6-3f403021f221_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!dUK-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23767fe-eb70-41ea-89c6-3f403021f221_1200x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dUK-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23767fe-eb70-41ea-89c6-3f403021f221_1200x1200.png" width="1200" height="1200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f23767fe-eb70-41ea-89c6-3f403021f221_1200x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1200,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:318560,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/191159222?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23767fe-eb70-41ea-89c6-3f403021f221_1200x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dUK-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23767fe-eb70-41ea-89c6-3f403021f221_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!dUK-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23767fe-eb70-41ea-89c6-3f403021f221_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!dUK-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23767fe-eb70-41ea-89c6-3f403021f221_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!dUK-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23767fe-eb70-41ea-89c6-3f403021f221_1200x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As a rule of thumb, start with a single prompt. If it works, stop. If it fails, do not jump to agents.</p><p>Introduce a small number of specialized steps and tune until you hit the balance. Context engineering is about deliberate selection.</p><p>Once the context window is secure, the next trap is overengineering the architecture before the problem demands it.</p><h2>Mistake #2: Starting with Complicated Solutions</h2><p>You have a clear problem, so you immediately reach for multi-agent architectures or heavy frameworks. You build RAG pipelines, hybrid retrieval, multiple databases, or adopt new protocols like MCP. You do this not because the problem demands it, but because it feels like the right way to build serious AI.</p><p>Every layer adds a hidden tax. You get more dependencies, higher latency, higher costs, and harder debugging. Complexity compounds operational pain.</p><p>Teams end up spending months building infrastructure and shipping nothing.</p><p>At our startup, ZTRON, we built a multi-index RAG system. We had OCR pipelines, separate embedding pipelines, hybrid retrieval, and agentic RAG loops.</p><p>It worked, but simple queries took 10 to 15 seconds. Costs climbed, and debugging was a nightmare.</p><p>When we finally asked if we actually needed all this, the answer was no. Our data fit within modern context windows. We replaced agentic RAG with cache-augmented generation (CAG) for most workflows.</p><p>This gave us fewer LLM calls, lower latency, fewer errors, and an easier system to debug.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2ULn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27878c4a-d98c-4819-ac87-ad57e8a042c2_1024x796.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2ULn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27878c4a-d98c-4819-ac87-ad57e8a042c2_1024x796.png 424w, https://substackcdn.com/image/fetch/$s_!2ULn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27878c4a-d98c-4819-ac87-ad57e8a042c2_1024x796.png 848w, https://substackcdn.com/image/fetch/$s_!2ULn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27878c4a-d98c-4819-ac87-ad57e8a042c2_1024x796.png 1272w, https://substackcdn.com/image/fetch/$s_!2ULn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27878c4a-d98c-4819-ac87-ad57e8a042c2_1024x796.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2ULn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27878c4a-d98c-4819-ac87-ad57e8a042c2_1024x796.png" width="1024" height="796" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/27878c4a-d98c-4819-ac87-ad57e8a042c2_1024x796.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:796,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:314336,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/191159222?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27878c4a-d98c-4819-ac87-ad57e8a042c2_1024x796.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2ULn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27878c4a-d98c-4819-ac87-ad57e8a042c2_1024x796.png 424w, https://substackcdn.com/image/fetch/$s_!2ULn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27878c4a-d98c-4819-ac87-ad57e8a042c2_1024x796.png 848w, https://substackcdn.com/image/fetch/$s_!2ULn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27878c4a-d98c-4819-ac87-ad57e8a042c2_1024x796.png 1272w, https://substackcdn.com/image/fetch/$s_!2ULn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27878c4a-d98c-4819-ac87-ad57e8a042c2_1024x796.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Start with the simplest solution that could work. Prove the core task works first. Only add memory, tools, retrieval, or multiple agents when the problem demands it.</p><p>Production-grade AI is built by engineers who ship simple systems first and scale complexity intentionally.</p><p>Earning complexity often means realizing that you do not need an agent at all, which brings us to the third mistake.</p><h2>Mistake #3: Building Agents When a Workflow Will Do</h2><p>Predictable tasks like data ingestion, summarization, or report generation need predictable execution. That is a workflow. Open-ended tasks like deep research or dynamic decision-making under uncertainty may need autonomy.</p><p>Agents handle these open-ended scenarios. Most teams treat predictable problems as if they need agents. When you use an agent for a structured task, you pay for autonomy you do not need.</p><p>You get unpredictable behavior, variable latency, higher token usage, and inconsistent outputs. The system works 80% of the time and fails when it matters most.</p><p>Workflows and agents are not binary choices. They sit on a spectrum known as the autonomy slider. More autonomy buys flexibility but costs predictability, cost control, and debuggability.</p><p>You must set the slider intentionally.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-tHI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d9a94a6-742b-4f35-9edb-87a0649194cf_1170x816.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-tHI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d9a94a6-742b-4f35-9edb-87a0649194cf_1170x816.png 424w, https://substackcdn.com/image/fetch/$s_!-tHI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d9a94a6-742b-4f35-9edb-87a0649194cf_1170x816.png 848w, https://substackcdn.com/image/fetch/$s_!-tHI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d9a94a6-742b-4f35-9edb-87a0649194cf_1170x816.png 1272w, https://substackcdn.com/image/fetch/$s_!-tHI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d9a94a6-742b-4f35-9edb-87a0649194cf_1170x816.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-tHI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d9a94a6-742b-4f35-9edb-87a0649194cf_1170x816.png" width="1170" height="816" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8d9a94a6-742b-4f35-9edb-87a0649194cf_1170x816.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:816,&quot;width&quot;:1170,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:176414,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/191159222?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d9a94a6-742b-4f35-9edb-87a0649194cf_1170x816.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-tHI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d9a94a6-742b-4f35-9edb-87a0649194cf_1170x816.png 424w, https://substackcdn.com/image/fetch/$s_!-tHI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d9a94a6-742b-4f35-9edb-87a0649194cf_1170x816.png 848w, https://substackcdn.com/image/fetch/$s_!-tHI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d9a94a6-742b-4f35-9edb-87a0649194cf_1170x816.png 1272w, https://substackcdn.com/image/fetch/$s_!-tHI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d9a94a6-742b-4f35-9edb-87a0649194cf_1170x816.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Adopt a workflow-first approach. Start with prompt chaining, routing, parallelization, or an orchestrator-worker pattern. Introduce agents only when the system must autonomously plan, explore unknown paths, or recover from failures dynamically.</p><p>For vertical AI agents, use a hybrid approach. Route known patterns to workflows and open-ended requests to agents.</p><p>Whether you use a workflow or an agent, you must handle the data they produce, which exposes a flaw in how engineers process outputs.</p><h2>Mistake #4: Fragile Parsing of LLM Outputs</h2><p>You ask the model for something structured, and it responds with something that looks structured. You parse it with regex, string splitting, or custom logic. It works in staging.</p><p>Then one day, a missing comma or different bullet style crashes production. LLMs are non-deterministic. Even with identical prompts, output can drift due to context changes, model updates, or variations in tool outputs.</p><p>Fragile parsing is a time bomb. Many teams respond by prompting the model to output JSON. That is better than free-form text, but it still is not a contract.</p><p>You still get missing keys, wrong types, and drifting nested fields.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6-NP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F825874b4-daca-42ce-8f09-50f116c21a68_1175x1036.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6-NP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F825874b4-daca-42ce-8f09-50f116c21a68_1175x1036.png 424w, https://substackcdn.com/image/fetch/$s_!6-NP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F825874b4-daca-42ce-8f09-50f116c21a68_1175x1036.png 848w, https://substackcdn.com/image/fetch/$s_!6-NP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F825874b4-daca-42ce-8f09-50f116c21a68_1175x1036.png 1272w, https://substackcdn.com/image/fetch/$s_!6-NP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F825874b4-daca-42ce-8f09-50f116c21a68_1175x1036.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6-NP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F825874b4-daca-42ce-8f09-50f116c21a68_1175x1036.png" width="1175" height="1036" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/825874b4-daca-42ce-8f09-50f116c21a68_1175x1036.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1036,&quot;width&quot;:1175,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:233234,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/191159222?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F825874b4-daca-42ce-8f09-50f116c21a68_1175x1036.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6-NP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F825874b4-daca-42ce-8f09-50f116c21a68_1175x1036.png 424w, https://substackcdn.com/image/fetch/$s_!6-NP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F825874b4-daca-42ce-8f09-50f116c21a68_1175x1036.png 848w, https://substackcdn.com/image/fetch/$s_!6-NP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F825874b4-daca-42ce-8f09-50f116c21a68_1175x1036.png 1272w, https://substackcdn.com/image/fetch/$s_!6-NP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F825874b4-daca-42ce-8f09-50f116c21a68_1175x1036.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Stop treating LLM outputs like text and treating them like data. Define a schema, enforce it at generation time, validate at runtime, and fail fast when wrong. Use Pydantic as the bridge between probabilistic generation and deterministic code.</p><p>But only use structured outputs when structure is required. If you only need a plain string, accept a string and keep schemas shallow and minimal.</p><p>If you have secured your context, simplified your architecture, chosen the right autonomy, and enforced output schemas, you are ready to build an agent. However, many teams still fail by omitting actual planning from their loops.</p><h2>Mistake #5: Forgetting Agents Need Planning</h2><p>You give a model tools, let it pick one, feed the tool output back, and repeat. At a glance, it looks agentic, but it is just a workflow with randomness. The system is reacting to the last tool output, not driving toward a goal.</p><p>Without embedded planning, the loop cannot decompose tasks into meaningful steps. It cannot evaluate progress or choose next actions intentionally. The result is random behavior, unnecessary tool calls, infinite loops, and shallow reasoning.</p><p>Copying ReAct or Plan-and-Execute from blog posts without adapting them to your domain makes it worse.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UaKc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95526d4d-1c95-4cc4-9722-bbe9a8955f47_1175x1014.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UaKc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95526d4d-1c95-4cc4-9722-bbe9a8955f47_1175x1014.png 424w, https://substackcdn.com/image/fetch/$s_!UaKc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95526d4d-1c95-4cc4-9722-bbe9a8955f47_1175x1014.png 848w, https://substackcdn.com/image/fetch/$s_!UaKc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95526d4d-1c95-4cc4-9722-bbe9a8955f47_1175x1014.png 1272w, https://substackcdn.com/image/fetch/$s_!UaKc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95526d4d-1c95-4cc4-9722-bbe9a8955f47_1175x1014.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UaKc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95526d4d-1c95-4cc4-9722-bbe9a8955f47_1175x1014.png" width="1175" height="1014" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/95526d4d-1c95-4cc4-9722-bbe9a8955f47_1175x1014.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1014,&quot;width&quot;:1175,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:226655,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/191159222?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95526d4d-1c95-4cc4-9722-bbe9a8955f47_1175x1014.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UaKc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95526d4d-1c95-4cc4-9722-bbe9a8955f47_1175x1014.png 424w, https://substackcdn.com/image/fetch/$s_!UaKc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95526d4d-1c95-4cc4-9722-bbe9a8955f47_1175x1014.png 848w, https://substackcdn.com/image/fetch/$s_!UaKc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95526d4d-1c95-4cc4-9722-bbe9a8955f47_1175x1014.png 1272w, https://substackcdn.com/image/fetch/$s_!UaKc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95526d4d-1c95-4cc4-9722-bbe9a8955f47_1175x1014.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You must embed planning into the loop. Before calling a tool, require a reasoning step. Ask what the goal is, what the next best action is, and what evidence you need.</p><p>Add progress checks and stop conditions like max steps, token budgets, and escalation when stuck. Make planning use-case specific, because generic ReAct is not a product. Tailor planning to your tools, data, constraints, and failure modes.</p><p>Even a well-planned agent will degrade over time if you do not measure its performance continuously.</p><h2>Mistake #6: Not Starting with AI Evals from Day Zero</h2><p>You build features without tracking how well your AI behaves. You have no tests, no evaluation metrics, and no defined success criteria. Every new feature is a gamble, and teams silently ship regressions.</p><p>AI systems do not fail all at once. They decay. A prompt change, a new tool, or a model upgrade causes subtle behavior shifts.</p><p>Without evals, nobody can answer whether a change made the system better or worse. Teams get stuck relying on vibe evals, which are manual, gut-feel testing that does not scale. Many teams think they are doing evals, but rely on generic scores like helpfulness or 1-5 star scales.</p><p>A score of 3.7 helpfulness tells you nothing about what to fix.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!c0La!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c9395e-8c34-4669-9e5a-4b3e56f2c934_1060x1010.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!c0La!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c9395e-8c34-4669-9e5a-4b3e56f2c934_1060x1010.png 424w, https://substackcdn.com/image/fetch/$s_!c0La!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c9395e-8c34-4669-9e5a-4b3e56f2c934_1060x1010.png 848w, https://substackcdn.com/image/fetch/$s_!c0La!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c9395e-8c34-4669-9e5a-4b3e56f2c934_1060x1010.png 1272w, https://substackcdn.com/image/fetch/$s_!c0La!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c9395e-8c34-4669-9e5a-4b3e56f2c934_1060x1010.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!c0La!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c9395e-8c34-4669-9e5a-4b3e56f2c934_1060x1010.png" width="1060" height="1010" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a7c9395e-8c34-4669-9e5a-4b3e56f2c934_1060x1010.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1010,&quot;width&quot;:1060,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:248948,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/191159222?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c9395e-8c34-4669-9e5a-4b3e56f2c934_1060x1010.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!c0La!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c9395e-8c34-4669-9e5a-4b3e56f2c934_1060x1010.png 424w, https://substackcdn.com/image/fetch/$s_!c0La!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c9395e-8c34-4669-9e5a-4b3e56f2c934_1060x1010.png 848w, https://substackcdn.com/image/fetch/$s_!c0La!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c9395e-8c34-4669-9e5a-4b3e56f2c934_1060x1010.png 1272w, https://substackcdn.com/image/fetch/$s_!c0La!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c9395e-8c34-4669-9e5a-4b3e56f2c934_1060x1010.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Use evals as your north star. Define task-specific, binary metrics tied to real system behavior and business requirements from day one. Use evals to drive the optimization flywheel.</p><p>Integrate evals into your development workflow to catch regressions before users do.</p><p>Recognizing these six mistakes is the first step to escaping PoC purgatory.</p><h2>Conclusion</h2><p>These six mistakes are not exotic edge cases. They are the exact patterns that repeatedly break real agentic systems. Individually, they look small, but in production, they compound into disasters.</p><p>Each of these mistakes deserves a deeper breakdown with real examples and production-tested fixes. That is why we turned them into a <strong><a href="https://email-course.towardsai.net/?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">free 6-day email course</a></strong>. We cover one mistake per day, with the exact patterns and solutions we use in production.</p><p><strong>&#128161;</strong><em><strong> If you want the complete breakdown, sign up <a href="https://email-course.towardsai.net/?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">here</a>.</strong></em></p><p>Otherwise, see you next Tuesday.</p><p><a href="https://www.pauliusztin.ai/">Paul Iusztin</a></p><div><hr></div><p><em>What&#8217;s your opinion? Do you agree, disagree, or is there something I missed?</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/agentic-ai-engineering-guide-6-mistakes/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/agentic-ai-engineering-guide-6-mistakes/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to share our work.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/agentic-ai-engineering-guide-6-mistakes?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/agentic-ai-engineering-guide-6-mistakes?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h2>Go Deeper</h2><p><strong>Go from zero to production-grade AI agents</strong> with the <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering self-paced course</a>. Built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Towards AI</a>.</p><p>Across <strong>34 lessons</strong> (articles, videos, and a lot of code), you&#8217;ll design, build, evaluate, and deploy production-grade AI agents end to end. By the final lesson, you&#8217;ll have <strong>built a multi-agent system</strong> and a <strong>capstone project</strong> where you apply everything you&#8217;ve learned on your own.</p><p><em>Three portfolio projects and a certificate to showcase in interviews. Plus a Discord community where you have direct access to other industry experts and me.</em></p><p>Rated 4.9/5 &#11088;&#65039; by 300+ students saying <em>&#8220;Every AI Engineer needs a course like this.&#8221;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Learn more&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Learn more</span></a></p><p><em>Not ready to commit?</em> We also prepared a free 6-day email course to reveal the <em><strong>6 critical mistakes that silently destroy agentic systems. </strong><a href="https://email-course.towardsai.net/?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Get the free email course.</a></em></p><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[Why RAG Has Exactly 6 Failure Modes. No More, No Less.]]></title><description><![CDATA[A complete guide for evaluating your retrieval-augmented generation systems.]]></description><link>https://www.decodingai.com/p/rag-evaluation-6-metrics-framework</link><guid isPermaLink="false">https://www.decodingai.com/p/rag-evaluation-6-metrics-framework</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Tue, 17 Mar 2026 12:03:16 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!gtpu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Welcome to the <strong><a href="https://www.decodingai.com/t/ai-evals-and-observability">AI Evals &amp; Observability series</a></strong>: A 7-part journey from shipping AI apps to systematically improving them. Made by busy people. For busy people.</em></p><p>&#129488; Everyone says you need AI evals. Few explain how to actually build them and answer questions such as&#8230;</p><p>How do we avoid creating evals that waste our time and resources? How do we build datasets and design evaluators that matter? How do we adapt them for RAG? ...and most importantly, how do we stop &#8220;vibe checking&#8221; and leverage evals to actually track and optimize our app?</p><p><em>This 7-article series breaks it all down from first principles:</em></p><ol><li><p><a href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app">Integrating AI Evals Into Your AI App</a><strong> </strong></p></li><li><p><a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Build an AI Evals Dataset from Scratch</a> </p></li><li><p><a href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals">Generate Synthetic Datasets for AI Evals </a></p></li><li><p><a href="https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures">How to Design Evaluators</a> </p></li><li><p><a href="https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge">How to Evaluate the Evaluator</a></p></li><li><p><em><strong>RAG Evaluation: The Only 6 Metrics You Need</strong> &#8592; You are here</em></p></li><li><p><a href="https://www.decodingai.com/p/behind-the-scenes-of-ai-observability">Lessons from 6 Months of Evals on a Production AI Companion</a></p></li></ol><p>By the end, you&#8217;ll know how to integrate AI evals that actually track and improve the performance of your AI product. No vibe checking required!</p><p><em>Let&#8217;s get started.</em></p><div><hr></div><h2>RAG Evaluation: The Only 6 Metrics You Need</h2><p>In our previous article, we covered how to validate your AI judges. We measured agreement with human judgment and iterated until alignment was high. Thus, you can now deploy with confidence.</p><p>However, a specialized challenge exists that general-purpose grading tools do not fully address. Evaluating RAG systems introduces a third variable, specifically the retrieved context. With this new element comes a distinct set of failure modes requiring their own metrics.</p><p>I am currently building a financial personal assistant at the stealth AI startup I work for. The application runs heavily on RAG. It pulls financial data from Postgres and integrates with external services such as email, Customer Relationship Management (CRM) tools, and cloud drives.</p><p>When it came time to evaluate the system, building the dataset proved harder than choosing metrics. Fortunately, we had a domain expert on the team who manually tested the application from the start. Therefore, we translated all of that Quality Assurance (QA) work into our AI evals collection using the error analysis workflow from <a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Article 2</a>.</p><p>Evaluating RAG systems introduces a unique difficulty. Each data sample required the correct context to be loaded into the database. We solved this by coupling every test case with a Postgres SQL export.</p><p>This file contained documents, chunks, embeddings, and metadata. We injected it directly into the storage system. This effectively created a cache that bypassed the ingestion pipeline during evals.</p><p>Once the data was in place, implementing the core RAG metrics became straightforward. We used tools like <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a> and foundational models like Gemini Pro as the LLM judge. We had the context, the query, and the answer, which is everything you need.</p><p>What surprised me was that not every capability needed this level of dissection. For our report generation feature, we expect an exact format with specific values pulled from the storage. Checking the final document against a ground truth served as a better proxy than tracing every retrieval step.</p><p><em>Sometimes assessing the destination matters more than checking the route.</em></p><p>RAG evaluation feels needlessly complex. Vendors have an incentive to make it difficult. Every framework ships with many metrics and a dashboard, making you feel like you need a PhD to know if your system works.</p><p>Underneath all the complexity, <strong>RAG systems possess exactly three core components</strong>. These are the Question (Q), the retrieved Context (C), and the generated Answer (A). Furthermore, with these elements, there are <strong>exactly six possible relationships</strong> between them. When your RAG system fails, it breaks along one of these six relationships every single time. </p><p>The beauty of this framework is its exhaustive nature. There are no hidden variables.</p><p>You do not always need to evaluate all six individually. For core conversational features, you need the primary metrics because there are many silent failure modes. However, for structured output tasks, an end-to-end check against expected results can be sufficient.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gtpu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gtpu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!gtpu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!gtpu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!gtpu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gtpu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png" width="1456" height="1048" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1048,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The six exhaustive relationships between the three RAG variables &#8212; Question, Context, and Answer.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The six exhaustive relationships between the three RAG variables &#8212; Question, Context, and Answer." title="The six exhaustive relationships between the three RAG variables &#8212; Question, Context, and Answer." srcset="https://substackcdn.com/image/fetch/$s_!gtpu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!gtpu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!gtpu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!gtpu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 1: The six exhaustive relationships between the three RAG variables &#8212; Question, Context, and Answer.</em></figcaption></figure></div><p><strong>Here is what you will learn in this article:</strong></p><ul><li><p>The only six relationships that exist in a RAG system.</p></li><li><p>How to evaluate your retrieval step before looking at generation.</p></li><li><p>The three core metrics every RAG application needs.</p></li><li><p>Advanced metrics for diagnosing subtle hallucinations.</p></li><li><p>How to match evaluation frequency and strictness to your domain.</p></li><li><p>How to collect and prepare the data your evaluators need.</p></li></ul><p><em>Before digging into the article, a quick word from our sponsor, Opik.</em> &#8595;</p><div><hr></div><h2><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik: Open-Source LLMOps Platform (Sponsored)</a></h2><p>This <strong>AI Evals &amp; Observability</strong> series is brought to you by <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a>, the LLMOps open-source platform used by Uber, Etsy, Netflix, and more. </p><p>We use Opik daily across our courses and AI products. Not just for observability, but to design and run the exact RAG evaluators this article teaches. All from the same platform.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Y-VS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35092-8407-4336-9b50-972f57252a3d_869x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Y-VS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35092-8407-4336-9b50-972f57252a3d_869x450.png 424w, https://substackcdn.com/image/fetch/$s_!Y-VS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35092-8407-4336-9b50-972f57252a3d_869x450.png 848w, https://substackcdn.com/image/fetch/$s_!Y-VS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35092-8407-4336-9b50-972f57252a3d_869x450.png 1272w, https://substackcdn.com/image/fetch/$s_!Y-VS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35092-8407-4336-9b50-972f57252a3d_869x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Y-VS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35092-8407-4336-9b50-972f57252a3d_869x450.png" width="869" height="450" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4ad35092-8407-4336-9b50-972f57252a3d_869x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:869,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:92542,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/191141901?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35092-8407-4336-9b50-972f57252a3d_869x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Y-VS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35092-8407-4336-9b50-972f57252a3d_869x450.png 424w, https://substackcdn.com/image/fetch/$s_!Y-VS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35092-8407-4336-9b50-972f57252a3d_869x450.png 848w, https://substackcdn.com/image/fetch/$s_!Y-VS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35092-8407-4336-9b50-972f57252a3d_869x450.png 1272w, https://substackcdn.com/image/fetch/$s_!Y-VS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35092-8407-4336-9b50-972f57252a3d_869x450.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Try Opik for free here</a> (25k spans/month free)</figcaption></figure></div><p>This article shows you how to evaluate RAG systems. Opik gives you the harness to run those evaluations at scale. Here is how we use it:</p><ul><li><p><strong>Custom LLM judges with rubrics</strong> &#8212; Build the evaluators this article describes: define your criteria, add few-shot examples, and run them across hundreds of traces automatically.</p></li><li><p><strong>Run experiments, compare results</strong> &#8212; Test different prompts, models, or configurations side by side. Opik scores each variant with your evaluators and shows you which one wins.</p></li><li><p><strong>Plug evaluators into production</strong> &#8212; The same LLM judges you design for testing run on live traces too. Set up alarms when scores drop below your threshold so you catch regressions before users do.</p></li></ul><p><strong><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a></strong> is fully <strong>open-source</strong> and works with custom code or most AI frameworks. You can also use the managed version for free (with 25K spans/month on their generous free tier):</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;text&quot;:&quot;Try Opik for free&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul"><span>Try Opik for free</span></a></p><div><hr></div><p><em>&#8595;</em>  <em>Now, let&#8217;s move back to the article.</em></p><h2>The Only 6 RAG Evaluation Metrics That Can Exist</h2><p>Jason Liu properly articulated the framework I am about to walk you through <a href="https://jxnl.co/writing/2025/05/19/there-are-only-6-rag-evals/">[1]</a>. Since I wrote the <a href="https://www.amazon.com/LLM-Engineers-Handbook-engineering-production/dp/1836200072/">LLM Engineer&#8217;s Handbook</a> two years ago, I have watched many RAG evaluation tools emerge. They overcomplicate everything with proprietary metric suites.</p><p>Through all of that, I already internalized that only three variables matter in any RAG system. Testing the combinations between them is the only thing you should actually do. Jason Liu gave a clean, formal articulation to what I had in mind.</p><p>He nailed the structure and deserves the recognition for that.</p><p>Every RAG system has three variables. We define <code>Q</code> as the user&#8217;s question, <code>C</code> as the retrieved context, and <code>A</code> representing the generated answer. Thus, we use the notation <code>X|Y</code> to mean the quality of <code>X</code> given <code>Y</code>.</p><p>There are <strong>exactly six relationships</strong> between these variables:</p><ol><li><p><code>C|Q</code> (Context Relevance) asks if the retrieved context addresses the question. This measures your retriever, because if it pulls irrelevant passages, the generator cannot fix the issue. </p></li><li><p><code>A|C</code> (Faithfulness) checks if the answer sticks to what is in the context. This measures your generator to see if the model hallucinated or stayed grounded in the documents. </p></li><li><p><code>A|Q</code> (Answer Relevance) verifies if the response actually addresses the prompt. This is the end-to-end user experience metric. Even if the context is good and the reply is faithful, it must help the person asking. </p></li><li><p><code>C|A</code> (Context Support) ensures the retrieved text contains everything needed to support every claim in the answer. This checks if the provided information was sufficient. </p></li><li><p><code>Q|C</code> (Question Answerability) evaluates if the prompt can even be resolved with this context. This determines whether the system should attempt to reply at all.</p></li><li><p><code>Q|A</code> (Self-Containment) asks if someone can infer the original question from reading the answer alone. This measures whether the output provides enough background to stand on its own.</p></li></ol><p>This framework is exhaustive. Three components produce exactly six conditional relationships. There are no hidden factors.</p><p>Therefore, when your RAG system fails, one of these six metrics is broken.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!j-9F!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8214a72c-1511-4a4d-9806-7914a70aba3d_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!j-9F!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8214a72c-1511-4a4d-9806-7914a70aba3d_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!j-9F!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8214a72c-1511-4a4d-9806-7914a70aba3d_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!j-9F!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8214a72c-1511-4a4d-9806-7914a70aba3d_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!j-9F!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8214a72c-1511-4a4d-9806-7914a70aba3d_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!j-9F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8214a72c-1511-4a4d-9806-7914a70aba3d_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8214a72c-1511-4a4d-9806-7914a70aba3d_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The complete grid of six RAG relationships &#8212; each mapped to the component it diagnoses.&quot;,&quot;title&quot;:&quot;The complete grid of six RAG relationships &#8212; each mapped to the component it diagnoses.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The complete grid of six RAG relationships &#8212; each mapped to the component it diagnoses." title="The complete grid of six RAG relationships &#8212; each mapped to the component it diagnoses." srcset="https://substackcdn.com/image/fetch/$s_!j-9F!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8214a72c-1511-4a4d-9806-7914a70aba3d_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!j-9F!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8214a72c-1511-4a4d-9806-7914a70aba3d_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!j-9F!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8214a72c-1511-4a4d-9806-7914a70aba3d_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!j-9F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8214a72c-1511-4a4d-9806-7914a70aba3d_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 2: The complete grid of six RAG relationships &#8212; each mapped to the component it diagnoses (Retriever, Generator, or End-to-End).</em></figcaption></figure></div><p>Not all six relationships matter equally in every context. We organize them into three tiers. Let us start with retrieval metrics as the prerequisite foundation.</p><h2>Tier 1: If Retrieval Is Broken, Nothing Else Matters</h2><p>RAG is first and foremost a retrieval problem. If the search mechanism does not retrieve the right documents, nothing downstream can save you. The generator will either hallucinate or produce irrelevant answers based on whatever junk it received.</p><p>Before evaluating any of the six RAG relationships, you need to know if your retriever even works. You can use classic information retrieval metrics that measure how well you find relevant documents before generation starts. They are fast to compute and do not require LLM judges.</p><p>Hence, these measurements give quick feedback for tuning your retriever.</p><p>You must establish ground-truth labels to compute these metrics. For each query, you must know which text blocks are actually relevant. You can build this dataset using the reverse workflow presented in depth in <a href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals">Article 3</a>.</p><p>As a quick recap, you start from your knowledge base of document chunks. Then, based on a set of closely related chunks, you generate realistic questions that can only be answered using that unique set of chunks.</p><p>Because the prompt derives from the source material, you know exactly which segment should be retrieved. This gives you a perfectly aligned ground-truth triplet: (question, answer, context). Thus, it becomes straightforward to check whether your search tool actually surfaces the right information.</p><p>There are four main metrics. <code>Precision@K</code> measures the fraction of the top K retrieved chunks that are actually relevant. If your retriever returns 5 chunks but only 2 are useful, your precision is 40%. <code>Recall@K</code> asks: of all the relevant chunks that exist in your entire corpus, how many did your retriever actually find in the top K? If the database has 4 chunks that could answer the question but you only retrieved 2 of them, your recall is 50%.</p><p>In addition, Mean Average Precision (<code>MAP@K</code>) averages precision across multiple queries, rewarding retrievers that consistently rank relevant chunks early. It works by computing precision at every position where a relevant item appears, then averaging those values. Here is a step-by-step example where the truly relevant items are A and C:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aVBN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d2e729b-b697-4ddb-b112-d86baa24b08e_1920x415.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aVBN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d2e729b-b697-4ddb-b112-d86baa24b08e_1920x415.png 424w, https://substackcdn.com/image/fetch/$s_!aVBN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d2e729b-b697-4ddb-b112-d86baa24b08e_1920x415.png 848w, https://substackcdn.com/image/fetch/$s_!aVBN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d2e729b-b697-4ddb-b112-d86baa24b08e_1920x415.png 1272w, https://substackcdn.com/image/fetch/$s_!aVBN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d2e729b-b697-4ddb-b112-d86baa24b08e_1920x415.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aVBN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d2e729b-b697-4ddb-b112-d86baa24b08e_1920x415.png" width="1456" height="315" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2d2e729b-b697-4ddb-b112-d86baa24b08e_1920x415.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:315,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;table&quot;,&quot;title&quot;:&quot;table&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="table" title="table" srcset="https://substackcdn.com/image/fetch/$s_!aVBN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d2e729b-b697-4ddb-b112-d86baa24b08e_1920x415.png 424w, https://substackcdn.com/image/fetch/$s_!aVBN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d2e729b-b697-4ddb-b112-d86baa24b08e_1920x415.png 848w, https://substackcdn.com/image/fetch/$s_!aVBN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d2e729b-b697-4ddb-b112-d86baa24b08e_1920x415.png 1272w, https://substackcdn.com/image/fetch/$s_!aVBN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d2e729b-b697-4ddb-b112-d86baa24b08e_1920x415.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Average Precision for this query = (1.0 + 0.66) / 2 = 0.83. We only average the precision values at positions where a relevant item appeared (ranks 1 and 3). <code>MAP@K</code> then takes this score and averages it across all your queries.</p><p>Finally, Mean Reciprocal Rank (<code>MRR@K</code>) focuses on the position of the first relevant match. If the first relevant chunk appears at position 3, the reciprocal rank is 1/3; if it appears at position 1, it is 1/1. Higher is better.</p><p>Use these for daily development. These indicators are great for tuning embeddings and chunk sizes, while also being perfect for A/B testing retrieval strategies. No LLM is needed, making the process cheap and fast.</p><p>These numbers tell you if the search phase works, as illustrated in Image 3. The six RAG relationships tell you if the whole system functions properly, meaning you need both.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EMDz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e356fc-efd5-420e-a2d4-5025295e9996_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EMDz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e356fc-efd5-420e-a2d4-5025295e9996_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!EMDz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e356fc-efd5-420e-a2d4-5025295e9996_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!EMDz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e356fc-efd5-420e-a2d4-5025295e9996_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!EMDz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e356fc-efd5-420e-a2d4-5025295e9996_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EMDz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e356fc-efd5-420e-a2d4-5025295e9996_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/32e356fc-efd5-420e-a2d4-5025295e9996_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Retrieval metrics applied to a financial assistant query.&quot;,&quot;title&quot;:&quot;Retrieval metrics applied to a financial assistant query.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Retrieval metrics applied to a financial assistant query." title="Retrieval metrics applied to a financial assistant query." srcset="https://substackcdn.com/image/fetch/$s_!EMDz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e356fc-efd5-420e-a2d4-5025295e9996_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!EMDz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e356fc-efd5-420e-a2d4-5025295e9996_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!EMDz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e356fc-efd5-420e-a2d4-5025295e9996_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!EMDz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e356fc-efd5-420e-a2d4-5025295e9996_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 3: Retrieval metrics applied to a financial assistant query &#8212; checking whether the retriever surfaces the right chunks.</em></figcaption></figure></div><p>With the retrieval confirmed working, you can evaluate the generation step. Let us look at the three core RAG relationships that every system needs.</p><h2>Tier 2: The Three RAG Metrics You Always Need</h2><p>These three metrics directly assess how well your RAG system functions. Most evaluation frameworks prioritize these specific measurements. They map to the three most critical of the six relationships.</p><p>First, we have <strong>Context Relevance</strong> (<code>C|Q</code>). This checks if the retrieved text actually addresses the prompt&#8217;s information needs. Therefore, it measures your search component similar to the metrics from Tier 1, but only looking at the dynamics between the context and question, without any ground truth.</p><p>Suppose we have a query about recent payouts from Q4. A good example is when the retrieved data contains the user&#8217;s dividend payment records from Q4, which passes. On the other side, a bad scenario occurs when the system returns general information about how these distributions work and their tax implications.</p><p>This represents the most common RAG failure mode. In our financial assistant, this often happened when the search tool pulled educational content instead of actual account data.</p><p>Second, we have <strong>Faithfulness</strong> (<code>A|C</code>). This asks if the reply restricts itself to claims that can be verified from the provided text. Hence, it measures whether your generator hallucinates or not.</p><p>In our use case, a good example is when the source contains a CRM record showing a client meeting scheduled for portfolio rebalancing. If the response states exactly that, it passes. A bad example happens when the model adds hallucinated agenda items like tax-loss harvesting strategies, resulting in a failure.</p><p>Third, we have <strong>Answer Relevance</strong> (<code>A|Q</code>). This checks if the output directly addresses the specific query from the prompt. This serves as the end-to-end user experience metric.</p><p>A good example is when a person asks how much their investments grew last month. The reply provides the specific percentage change and absolute dollar amount. A bad scenario is when the text discusses general market performance without mentioning the actual account.</p><p>We measure all three metrics using LLM judges as designed in <a href="https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures">Article 4</a> and validated in <a href="https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge">Article 5</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gNmS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13d9fabd-661e-47ec-a8fc-1afbdcead725_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gNmS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13d9fabd-661e-47ec-a8fc-1afbdcead725_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!gNmS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13d9fabd-661e-47ec-a8fc-1afbdcead725_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!gNmS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13d9fabd-661e-47ec-a8fc-1afbdcead725_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!gNmS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13d9fabd-661e-47ec-a8fc-1afbdcead725_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gNmS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13d9fabd-661e-47ec-a8fc-1afbdcead725_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/13d9fabd-661e-47ec-a8fc-1afbdcead725_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The three core RAG metrics illustrated with financial assistant examples.&quot;,&quot;title&quot;:&quot;The three core RAG metrics illustrated with financial assistant examples.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The three core RAG metrics illustrated with financial assistant examples." title="The three core RAG metrics illustrated with financial assistant examples." srcset="https://substackcdn.com/image/fetch/$s_!gNmS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13d9fabd-661e-47ec-a8fc-1afbdcead725_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!gNmS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13d9fabd-661e-47ec-a8fc-1afbdcead725_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!gNmS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13d9fabd-661e-47ec-a8fc-1afbdcead725_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!gNmS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13d9fabd-661e-47ec-a8fc-1afbdcead725_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 4: The three core RAG metrics illustrated with financial assistant examples &#8212; each measures a critical relationship between Question, Context, and Answer.</em></figcaption></figure></div><p>These three metrics cover the most common failure modes. For specific domains and failure cases, we have to dig deeper into the next 3 metrics.</p><h2>Tier 3: When the Core Metrics Can&#8217;t Explain the Failure</h2><p>The last three metrics provide deeper diagnostic insights usually required in sensitive domains or use cases.</p><p>First, we have <strong>Context Support</strong> (<code>C|A</code>). This checks if the retrieved context contains all the information needed to fully back every claim in the response. While this sounds similar to Faithfulness (<code>A|C</code>), the direction is different. Faithfulness asks: <em>&#8220;did the answer deviate from the context?&#8221;</em> , where you look at the answer and check if it introduced claims that aren&#8217;t there. Context Support asks: <em>&#8220;was the context sufficient to support the answer?&#8221;</em>, where you look at the context and check if it actually contains everything the answer needs.</p><p>Here is a concrete example. Suppose the answer says your total Q4 dividend income was 2,340 <em>across</em> 5 <em>holdings</em>, <em>with</em> <em>the</em> <em>largest</em> <em>payout</em> <em>from</em> <em>MSFT</em> <em>at </em>890. Now look at the context: it only contains the total dividend amount of $2,340. The per-holding breakdown is nowhere in the retrieved documents. The context was insufficient. It had the total but not the details. The LLM produced a plausible breakdown, but the context could not support it. This is low-context support.</p><p>Second, we have <strong>Question Answerability</strong> (<code>Q|C</code>). This asks if the user's question can even be resolved with the given information.</p><p>Suppose the user asks about crypto portfolio performance, but the retrieved documents only contain equity data. This makes the request unanswerable. The system should refuse rather than guess. This metric is important when you want to validate that your agent answers with &#8220;I don&#8217;t know&#8221; instead of confidently hallucinating an answer due to insufficient context.</p><p>In our financial assistant, this was important because some queries can only be resolved if the agent has permissions to access the right external tool first.</p><p>Third, we have <strong>Self-Containment</strong> (<code>Q|A</code>). This checks if someone can infer the original prompt from the reply alone.</p><p>A response stating your portfolio&#8217;s return is 12.4% stands alone. A reply stating just 12.4% does not. Prioritize this metric when outputs are forwarded via email, logged in CRM notes, or read without the original conversation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TV-5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef6c62c8-1af1-4dce-b455-48ed7fea0797_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TV-5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef6c62c8-1af1-4dce-b455-48ed7fea0797_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!TV-5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef6c62c8-1af1-4dce-b455-48ed7fea0797_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!TV-5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef6c62c8-1af1-4dce-b455-48ed7fea0797_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!TV-5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef6c62c8-1af1-4dce-b455-48ed7fea0797_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TV-5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef6c62c8-1af1-4dce-b455-48ed7fea0797_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ef6c62c8-1af1-4dce-b455-48ed7fea0797_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Faithfulness vs Context Support &#8212; two types of hallucination detection.&quot;,&quot;title&quot;:&quot;Faithfulness vs Context Support &#8212; two types of hallucination detection.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Faithfulness vs Context Support &#8212; two types of hallucination detection." title="Faithfulness vs Context Support &#8212; two types of hallucination detection." srcset="https://substackcdn.com/image/fetch/$s_!TV-5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef6c62c8-1af1-4dce-b455-48ed7fea0797_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!TV-5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef6c62c8-1af1-4dce-b455-48ed7fea0797_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!TV-5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef6c62c8-1af1-4dce-b455-48ed7fea0797_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!TV-5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef6c62c8-1af1-4dce-b455-48ed7fea0797_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 5: Faithfulness catches obvious hallucinations where the answer deviates from context. Context Support catches the subtler case where the context was insufficient, and the LLM silently filled the gaps.</em></figcaption></figure></div><p>You now know what to measure at each tier. Two questions remain. How often should you run each one? Which metrics deserve the most attention for your specific domain?</p><h2>Matching Frequency and Strictness to Your Domain</h2><p>Each tier maps to a different running frequency depending on how fast and cheap you can run the evaluations. It also depends on their overall impact on the system.</p><p><strong>Start with Tier 1</strong> on a daily basis. Implement fast retrieval metrics for everyday development and to tune your retrieval component. These are the cheapest to execute as they do not require LLM judges.</p><p>Furthermore, they provide quick feedback cycles. Use them for the improvement flywheel with synthetic data from day zero, focusing on these basic indicators before moving to more complex approaches.</p><p><strong>Move to Tier 2</strong> on a weekly basis. Implement the three primary RAG connections. These core metrics directly assess how well your system functions.</p><p>Use LLM-based grading for a more nuanced assessment of these interactions.</p><p><strong>Incorporate Tier 3</strong> on a monthly basis. Introduce advanced metrics when you need deeper insights. Run a full evaluation to identify prompts that the application should not be answering.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!P9iE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7db8998-edea-4843-96e6-f2767bb2d9a9_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!P9iE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7db8998-edea-4843-96e6-f2767bb2d9a9_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!P9iE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7db8998-edea-4843-96e6-f2767bb2d9a9_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!P9iE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7db8998-edea-4843-96e6-f2767bb2d9a9_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!P9iE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7db8998-edea-4843-96e6-f2767bb2d9a9_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!P9iE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7db8998-edea-4843-96e6-f2767bb2d9a9_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e7db8998-edea-4843-96e6-f2767bb2d9a9_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The tiered evaluation cadence.&quot;,&quot;title&quot;:&quot;The tiered evaluation cadence.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The tiered evaluation cadence." title="The tiered evaluation cadence." srcset="https://substackcdn.com/image/fetch/$s_!P9iE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7db8998-edea-4843-96e6-f2767bb2d9a9_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!P9iE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7db8998-edea-4843-96e6-f2767bb2d9a9_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!P9iE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7db8998-edea-4843-96e6-f2767bb2d9a9_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!P9iE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7db8998-edea-4843-96e6-f2767bb2d9a9_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 6: The tiered evaluation cadence &#8212; cheapest and fastest at the top, deepest and most expensive at the bottom.</em></figcaption></figure></div><p>Here, we focused only on RAG-related measurements. However, this actually applies to any type of AI evals layer. You could implement Tier 1 checks in your CI/CD pipeline to execute on each commit.</p><p>You can trigger Tier 2 evaluations manually before merging your code from your feature branch. Finally, manually run Tier 3 metrics before major releases and strategic decisions.</p><p>There is another dimension to consider when choosing metrics for your use case, which is the good old domain.</p><p><strong>Different domains require emphasis on distinct indicators</strong>. What matters most depends on the severity of the use case.</p><p><strong>High-severity domains</strong> include finance, medical, and legal applications. In these fields, Faithfulness (<code>A|C</code>) and Context Support (<code>C|A</code>) are non-negotiable because every claim must be traceable. Answerability (<code>Q|C</code>) is also critical, meaning the application must refuse rather than guess.</p><p>Thus, you want precision over recall, which is the exact profile we use for our financial assistant.</p><p><strong>Medium severity domains</strong> include customer support and technical documents. Answer Relevance (<code>A|Q</code>) leads here, as the output must be helpful and correct. Answerability (<code>Q|C</code>) helps you know when to hand off to a human, and you generally want recall over precision in retrieval.</p><p><strong>Low-severity domains</strong> include research, writing, and content generation, where synthesis and creative reframing are expected. Context Relevance (<code>C|Q</code>) and Answer Relevance (<code>A|Q</code>) is primary, while Faithfulness (<code>A|C</code>) thresholds remain lower. The generator is supposed to add value beyond the raw text.</p><p>Therefore, you want high recall in the search phase to cast a wide net across sources.</p><p>You know what to measure, when, and what to prioritize. None of this works without the right data and infrastructure. Let us explore how to build the evaluation harness.</p><h2>Building the RAG Evaluation Harness</h2><p>RAG evaluation requires inputs, outputs, and the retrieved context. You need the full triplet.</p><p>The most common blind spot involves treating RAG testing like any other LLM assessment. Teams measure the final reply&#8217;s quality, but never capture what background data the generator actually worked with. Without that information, half the metrics in this article are impossible to compute.</p><p>Next, you should ground your RAG dataset in real human judgment. In our financial assistant, we had a domain specialist on the team who manually QA&#8217;d the application from the start. They ran queries, checked whether the right data was retrieved, and verified that the replies made sense.</p><p>We translated all of that manual work into our AI evals collection using the error analysis workflow from <a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Article 2</a>.</p><p>Also, building RAG datasets introduces a unique difficulty. Each test case needs the right documents, chunks, and embeddings available in the database. Otherwise, the search tool has nothing to work with.</p><p>Running the full ingestion pipeline for every evaluation run is slow and introduces variability.</p><p>We solved this by coupling each data point with a Postgres SQL export containing the relevant documents, chunks, embeddings, and metadata. We loaded this file directly into the storage system for each test, effectively creating a context cache. This made the process fast and reproducible.</p><p>We inject the records, run the query, evaluate the trace, reset the environment, and move to the next item. Image 7 illustrates these data preparation paths.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1KvQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f9c048e-214e-481e-b46b-4ca2c6d9a397_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1KvQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f9c048e-214e-481e-b46b-4ca2c6d9a397_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!1KvQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f9c048e-214e-481e-b46b-4ca2c6d9a397_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!1KvQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f9c048e-214e-481e-b46b-4ca2c6d9a397_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!1KvQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f9c048e-214e-481e-b46b-4ca2c6d9a397_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1KvQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f9c048e-214e-481e-b46b-4ca2c6d9a397_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0f9c048e-214e-481e-b46b-4ca2c6d9a397_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Two paths for building your RAG evaluation dataset.&quot;,&quot;title&quot;:&quot;Two paths for building your RAG evaluation dataset.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Two paths for building your RAG evaluation dataset." title="Two paths for building your RAG evaluation dataset." srcset="https://substackcdn.com/image/fetch/$s_!1KvQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f9c048e-214e-481e-b46b-4ca2c6d9a397_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!1KvQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f9c048e-214e-481e-b46b-4ca2c6d9a397_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!1KvQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f9c048e-214e-481e-b46b-4ca2c6d9a397_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!1KvQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f9c048e-214e-481e-b46b-4ca2c6d9a397_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 7: Two paths for building your RAG evaluation dataset &#8212; manual expert QA (Article 2) and synthetic reverse workflow (Article 3) &#8212; both requiring proper context preparation.</em></figcaption></figure></div><p>If you do not have enough production data or expert QA samples, you can create synthetic RAG evaluation sets. Use the reverse workflow from <a href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals">Article 3</a> by starting from your knowledge base. Use an LLM to extract key facts from specific passages.</p><p>Then, formulate realistic user questions that can only be answered using that exact text block.</p><p>Because the prompt derives directly from the source material, the input, expected retrieval context, and expected reply are perfectly aligned by construction. This gives you a complete ground-truth triplet. Furthermore, this technique is especially powerful for bootstrapping coverage across your entire document corpus.</p><p>Include unanswerable queries in your collection. Do not only formulate prompts that the application should resolve correctly. Instead, create scenarios where the context deliberately lacks the information needed, forcing the agent to refuse or say it does not know.</p><p>Without these negative examples, your testing suite is one-sided. Your evals will optimize for always attempting a reply, whereas adding them directly exercises the Answerability metric from Tier 3.</p><p>Next, if your RAG architecture integrates with external services, the retrieval path is not just a vector database search. Your agent needs to decide which tool to call first. Should it query Postgres, search the CRM, or check the user&#8217;s email?</p><p>The best retrieval metrics will not help if your model invoked the wrong data source entirely.</p><p>In our financial assistant, this was critical. A query about a client meeting should hit the CRM, not the transaction database. Therefore, we added code-based checks for tool selection alongside our RAG metrics.</p><p>Another important trick is to run separate graders per RAG dimension. Do not ask one LLM to evaluate context relevance, faithfulness, and answer relevance in a single prompt. Isolated checks with dimension-specific rubrics produce more consistent results than a unified evaluation.</p><p>Ultimately, you need to log specific data for every trace using tools such as <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a>. Record retrieved chunks to see what the generator had access to. If faithfulness fails, check whether the reply used information that was not provided. Track metadata such as document IDs and scores, because when context relevance fails, you need to know which items ranked highest. This represents the same observability infrastructure from <a href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app">Article 1</a>.</p><h2>Next Steps</h2><p>RAG evaluation is not complex. It is just three variables and six relationships. When your RAG system fails, one of these specific links is broken.</p><p>Fix that exact issue and ignore the complexity theater.</p><p>Start with Tier 1 retrieval checks as daily prerequisites. Add Tier 2 primary indicators weekly. Extend to Tier 3 when specific failure modes demand it.</p><p>Ultimately, match your evaluation priorities to your domain&#8217;s risk profile.</p><p>Next time you see a vendor dashboard with dozens of RAG metrics, map each one back to the six relationships. If an indicator does not clearly measure one of the core links, it is noise. Drop it and focus on what actually diagnoses failures.</p><p>Next up is the <a href="https://www.decodingai.com/p/behind-the-scenes-of-ai-observability">final piece in the series</a>. We will explore real-world lessons from months of running evals on a production AI companion. We will discuss what worked, what failed, and what the team would do differently.</p><p>Also, remember that this article is part of a <strong><a href="https://www.decodingai.com/t/ai-evals-and-observability">7-piece series on AI Evals &amp; Observability</a></strong>. <strong>Here&#8217;s what&#8217;s ahead:</strong></p><ol><li><p><a href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app">Integrating AI Evals Into Your AI App</a> </p></li><li><p><a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Build an AI Evals Dataset from Scratch</a>  </p></li><li><p><a href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals">Generate Synthetic Datasets for AI Evals</a>  </p></li><li><p><a href="https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures">How to Design Evaluators</a></p></li><li><p><a href="https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge">How to Evaluate the Evaluator</a></p></li><li><p><em><strong>RAG Evaluation: The Only 6 Metrics You Need</strong> &#8592; You just finished this one</em></p></li><li><p><a href="https://www.decodingai.com/p/behind-the-scenes-of-ai-observability">Lessons from 6 Months of Evals on a Production AI Companion</a></p></li></ol><p>See you next Tuesday.</p><p><a href="https://www.pauliusztin.ai/">Paul Iusztin</a></p><div><hr></div><p><em>What&#8217;s your opinion? Do you agree, disagree, or is there something I missed?</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/rag-evaluation-6-metrics-framework/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.decodingai.com/p/rag-evaluation-6-metrics-framework/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to share our work.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/rag-evaluation-6-metrics-framework?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.decodingai.com/p/rag-evaluation-6-metrics-framework?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h2>Go Deeper</h2><p><strong>Go from zero to production-grade AI agents</strong> with the <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering self-paced course</a>. Built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Towards AI</a>.</p><p>Across <strong>34 lessons</strong> (articles, videos, and a lot of code), you&#8217;ll design, build, evaluate, and deploy production-grade AI agents end to end. By the final lesson, you&#8217;ll have <strong>built a multi-agent system</strong> and a <strong>capstone project</strong> where you apply everything you've learned on your own.</p><p><em>Three portfolio projects and a certificate to showcase in interviews. Plus a Discord community where you have direct access to other industry experts and me.</em></p><p>Rated 4.9/5 &#11088;&#65039; by 300+ students &#8212; <em>&#8220;Every AI Engineer needs a course like this.&#8221;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Learn more&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Learn more</span></a></p><p><em>Not ready to commit?</em> We also prepared a free 6-day email course to reveal the <em><strong>6 critical mistakes that silently destroy agentic systems. </strong><a href="https://email-course.towardsai.net/?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Get the free email course.</a></em></p><div><hr></div><p><em>Thanks again to <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a> for sponsoring the series and keeping it free!</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yeD8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 424w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 848w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1272w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yeD8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png" width="1200" height="400" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/deaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:400,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Opik Banner&quot;,&quot;title&quot;:&quot;Opik Banner&quot;,&quot;type&quot;:null,&quot;href&quot;:&quot;https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Opik Banner" title="Opik Banner" srcset="https://substackcdn.com/image/fetch/$s_!yeD8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 424w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 848w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1272w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Try Opik for free here</a> (25k spans/month free)</figcaption></figure></div><p><strong>If you want to monitor, evaluate and optimize your AI workflows and agents:</strong></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;text&quot;:&quot;Try Opik for free&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul"><span>Try Opik for free</span></a></p><div><hr></div><h2>References</h2><ol><li><p>Liu, J. (2025, May 19). There Are Only 6 RAG Evals. jxnl.co. <a href="https://jxnl.co/writing/2025/05/19/there-are-only-6-rag-evals/">https://jxnl.co/writing/2025/05/19/there-are-only-6-rag-evals/</a></p></li><li><p>Grace, M., Hadfield, J., Olivares, R., &amp; De Jonghe, J. (2026, January 09). Demystifying Evals for AI Agents. Anthropic Engineering Blog. <a href="https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents">https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents</a></p></li></ol><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[Why Most RAG Tutorials Fail You]]></title><description><![CDATA[How a senior architect learned RAG from scratch, the production way]]></description><link>https://www.decodingai.com/p/production-rag-from-scratch-senior-architect-guide</link><guid isPermaLink="false">https://www.decodingai.com/p/production-rag-from-scratch-senior-architect-guide</guid><dc:creator><![CDATA[Priya]]></dc:creator><pubDate>Thu, 12 Mar 2026 12:02:03 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!_xRY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d099bbe-2f2c-4c6e-bd90-33f5ea6b75e2_1200x657.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Paul:</strong> Today, the stage belongs to <a href="https://substack.com/@pmarwa">Priya</a>, a Senior Software Architect who&#8217;s spent years shipping production-scale systems at Publicis Sapient and Tesco.</p><p>She&#8217;s deconstructing RAG with a production-first mindset, skipping the theoretical demos to focus on building for architectural reliability.</p><p>This one is packed. Let&#8217;s get into it &#128064; &#8595;</p><div><hr></div><h2>The &#8220;Deer in the Headlights&#8221; Moment</h2><p>I&#8217;ve navigated many shifts since the early days of the web, from monoliths to cloud-native microservices and SOAP to REST. But the AI wave felt different. I found myself in a &#8220;deer in the headlights&#8221; moment, completely unsure of what to learn or even where to start. Should I dive into neural network math, focus on model training, or master context engineering (AI moves quickly)?</p><p>Eventually, the path became clear when I realized my real value lay in applying AI to complex business problems. In an enterprise context, that led me straight to RAG. It isn&#8217;t just about the model, it&#8217;s also about the robust system you build around it. It felt like a return to architecture, a concrete problem to solve where using AI could make a profound difference. However, as I started building, I hit a second roadblock...</p><h2>Why Most RAG Tutorials Didn&#8217;t Help Me Learn RAG</h2><p>Most RAG tutorials optimize for one outcome: getting an answer out of a model as quickly as possible. That&#8217;s fine for demos. It&#8217;s a poor way to learn how RAG systems behave in production.</p><p>I&#8217;m not new to building production software. I&#8217;ve spent decades shipping and maintaining systems where debuggability, operability, and failure modes matter. What&#8217;s new to me here is RAG, not the discipline of building systems that survive contact with reality. While learning RAG, I wanted to internalize the constraints I&#8217;d eventually face anyway: inspectability, idempotent ingestion, debuggable retrieval, and controllable generation. That meant resisting framework-managed chains and owning the control flow early, even if it slowed me down.</p><p>This post documents how I&#8217;m teaching myself RAG by building a production-grade system in deliberate phases, using frameworks as utilities rather than architecture.</p><p>That approach was heavily influenced by, and indeed, inspired by Paul Iusztin&#8217;s <em><a href="https://www.decodingai.com/p/my-ai-production-tech-stack">From 100+ AI Tools to 4: My Production Stack</a></em>, especially this idea:</p><p><em>AI frameworks are good utilities. They should not dictate the architecture or control flow of your system.</em></p><p>That became my guiding principle.</p><p><em>Before we continue, a quick word from the Decoding AI team.</em> &#8595;</p><div><hr></div><h2><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Go Deeper: Your Path to Agentic AI for Production</a></h2><p>Most engineers know the theory behind agents, context engineering, and RAG. What they lack is the confidence to architect, evaluate, and deploy these systems in production. The <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Agentic AI Engineering course</a>, built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Towards AI,</a> closes that gap across 34 lessons (articles, videos, and a lot of code).</p><p>By the end, you will have gone from <em>&#8220;I built a demo&#8221;</em> to <em>&#8220;I shipped a production-grade multi-agent system with evals, observability, and CI/CD.&#8221;</em> Three portfolio projects, a certificate to back it up in interviews, and a Discord community with direct access to industry experts and Paul Iusztin.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6Qcm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6Qcm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 424w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 848w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6Qcm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg" width="1280" height="720" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:720,&quot;width&quot;:1280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6Qcm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 424w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 848w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>34 lessons from first principles to production &#8212; context engineering, workflows, agents, evals, and deployment</em></figcaption></figure></div><p>Rated 4.9/5 &#11088;&#65039; by 290+ early students &#8212; <em>&#8221;Every AI Engineer needs a course like this&#8221;</em> and <em>&#8221;an excellent bridge from experimental LLM projects to real-world AI engineering.&#8221;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&quot;,&quot;text&quot;:&quot;Start learning today&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31"><span>Start learning today</span></a></p><div><hr></div><p>&#8595; <em>Now, back to the article.</em></p><h2>The Architecture</h2><p>Before diving into the details, here is the end-to-end architecture of the RAG system. This diagram serves as a reference model, and we&#8217;ll walk through each layer and the production considerations that shaped these choices.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_xRY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d099bbe-2f2c-4c6e-bd90-33f5ea6b75e2_1200x657.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_xRY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d099bbe-2f2c-4c6e-bd90-33f5ea6b75e2_1200x657.png 424w, https://substackcdn.com/image/fetch/$s_!_xRY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d099bbe-2f2c-4c6e-bd90-33f5ea6b75e2_1200x657.png 848w, https://substackcdn.com/image/fetch/$s_!_xRY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d099bbe-2f2c-4c6e-bd90-33f5ea6b75e2_1200x657.png 1272w, https://substackcdn.com/image/fetch/$s_!_xRY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d099bbe-2f2c-4c6e-bd90-33f5ea6b75e2_1200x657.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_xRY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d099bbe-2f2c-4c6e-bd90-33f5ea6b75e2_1200x657.png" width="1200" height="657" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8d099bbe-2f2c-4c6e-bd90-33f5ea6b75e2_1200x657.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:657,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_xRY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d099bbe-2f2c-4c6e-bd90-33f5ea6b75e2_1200x657.png 424w, https://substackcdn.com/image/fetch/$s_!_xRY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d099bbe-2f2c-4c6e-bd90-33f5ea6b75e2_1200x657.png 848w, https://substackcdn.com/image/fetch/$s_!_xRY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d099bbe-2f2c-4c6e-bd90-33f5ea6b75e2_1200x657.png 1272w, https://substackcdn.com/image/fetch/$s_!_xRY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d099bbe-2f2c-4c6e-bd90-33f5ea6b75e2_1200x657.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Phase 1. Ingestion: Own the Data</h3><p><strong>What I built:</strong> a pipeline that discovers files &#8594; loads documents &#8594; normalizes text &#8594; chunks &#8594; embeds &#8594; stores everything in Postgres.</p><p>From experience building production systems, ingestion pipelines are where complexity quietly accumulates if they lack idempotence, i.e., the ability to safely re-run without ending up in an inconsistent state, such as duplicate data, partial updates, or stale artifacts. The same applies to traceability, i.e., the ability to trace exactly what happened, to which data, and when. I assumed the same risks would apply here.</p><p>What I didn&#8217;t account for was how the nature of debugging would differ so vastly from what I was used to in the past. It wasn&#8217;t just about emitting log and error information at the right places anymore. A bad chunk doesn&#8217;t throw an exception, it just hallucinates an answer three steps later.</p><h4>Single database, many uses</h4><p>Instead of introducing a separate vector database, I used <strong>Postgres + pgvector</strong>. Chunks, embeddings, and metadata live together. That decision buys me a lot:</p><ul><li><p>I can inspect ingestion results with plain SQL</p></li><li><p>I can join vectors with relational metadata</p></li><li><p>I can reproduce retrieval behavior outside the application</p></li></ul><p>That inspectability matters when you&#8217;re still learning, and having less infrastructure to maintain pays off long after.</p><h4>Frameworks as utilities, not architecture</h4><p>I use LangChain&#8217;s document loaders (<em>TextLoader, PyMuPDFLoader</em>) for format handling. But the control flow is explicit and mine:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">for file_info in discover_files(folder_path):

    raw_docs = load_document(file_info.file_path)

    clean_text = normalize_text(raw_docs)

    chunks = chunk_text(clean_text, chunk_size=512)

    embeddings = await embed_chunks(chunks)

    await save_to_postgres(file_info, chunks, embeddings)</code></pre></div><p>Each step is isolated. Each step can be logged, rerun, or replaced independently. When something breaks, I debug <em>my</em> code, not a framework-managed chain. For instance, during my initial tests, I used PyPDFLoader for the document loading phase. When I inspected the chunking, I realised the chunks had incorrect spaces due to kerning (e.g., &#8221;P r e - C h u n k&#8221;). This was easy to address just by swapping PyPDFLoader with PyMuPDFLoader, which handled the complex layouts better.</p><h4>Idempotence and safe re-runs</h4><p>I mentioned earlier that pipelines break down when they lack idempotence. Here&#8217;s how I addressed it.</p><p>Every file&#8217;s contents are hashed. If the content hash matches what&#8217;s already stored, the file is skipped, no wasted compute, no risk. If the content has changed, its old chunks and embeddings are completely removed before the new ones are written. The database never ends up with a mix of old and new states for the same source.</p><p>During development, it makes experimentation safe. For instance, I can tweak chunk sizes, swap embedding models, or change preprocessing logic, then re-run the entire pipeline and trust the result. Without this, every experiment would mean manually cleaning up the database first, or worse, not realizing stale data was still there, silently affecting retrieval quality. More importantly, though, in production, it makes the pipeline resilient to failure. If ingestion crashes halfway through, I can simply restart it. Files already processed are skipped, and the rest pick up where they left off. No manual cleanup, no risk of corruption.</p><h3>Phase 2. Retrieval: Make Failure Visible</h3><p>Retrieval is where the quality of your results is determined, which makes debugging discipline more important than clever algorithms.</p><p><strong>What I built:</strong> query preprocessing &#8594; embedding &#8594; similarity search &#8594; optional reranking.</p><p>Most LangChain tutorials show you how to build a RAG pipeline as a &#8220;chain,&#8221; i.e.,  a single call where the framework retrieves context, sends it to the LLM, and returns the answer. I chose not to do that. Consistent with the architecture philosophy above,  retrieval is an explicit phase, and every step in the retrieval pipeline is an explicit function call I control and invoke directly:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">async def retrieve(query: str, top_k: int = 5, rerank: bool = False):

    processed_query = preprocess_query(query)

    query_embedding = embed_query(processed_query)

    results = await search_similar_chunks(query_embedding, top_k)

    if rerank:

        results = rerank_results(query, results, top_k)

    return RetrievalResponse(query=query, results=results)</code></pre></div><p>Keeping retrieval explicit makes failures legible. When an answer is wrong, I can tell whether the issue came from:</p><ul><li><p>query preprocessing</p></li><li><p>embedding quality</p></li><li><p>recall</p></li><li><p>ranking</p></li></ul><p>Because vectors live in Postgres, I can reproduce retrieval behavior with SQL alone.</p><p>That inspectability is invaluable when learning.</p><h4>Retrieval &#8594; Generation boundary</h4><p>This is the boundary where many RAG systems start to blur failure modes. But they are fundamentally different problems.</p><p>Retrieval, including reranking, decides <strong>what context is allowed to reach the model</strong>. It is a search problem. It fails by missing relevant information (poor recall) or burying it in noise (poor precision).</p><p>Generation decides <strong>what the model does with the provided context</strong>. It is a reasoning problem. It fails by misinterpreting the context, hallucinating facts, or ignoring instructions.</p><p>Keeping this boundary explicit helps you immediately diagnose which problem you effectively have. If the answer is wrong but the context contains the truth, you fix the prompt. If the context is missing the truth, you fix the search.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kNfZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65e695d8-f6a4-4909-b9ad-87314bc50dd5_1032x193.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kNfZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65e695d8-f6a4-4909-b9ad-87314bc50dd5_1032x193.png 424w, https://substackcdn.com/image/fetch/$s_!kNfZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65e695d8-f6a4-4909-b9ad-87314bc50dd5_1032x193.png 848w, https://substackcdn.com/image/fetch/$s_!kNfZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65e695d8-f6a4-4909-b9ad-87314bc50dd5_1032x193.png 1272w, https://substackcdn.com/image/fetch/$s_!kNfZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65e695d8-f6a4-4909-b9ad-87314bc50dd5_1032x193.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kNfZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65e695d8-f6a4-4909-b9ad-87314bc50dd5_1032x193.png" width="1032" height="193" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/65e695d8-f6a4-4909-b9ad-87314bc50dd5_1032x193.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:193,&quot;width&quot;:1032,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kNfZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65e695d8-f6a4-4909-b9ad-87314bc50dd5_1032x193.png 424w, https://substackcdn.com/image/fetch/$s_!kNfZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65e695d8-f6a4-4909-b9ad-87314bc50dd5_1032x193.png 848w, https://substackcdn.com/image/fetch/$s_!kNfZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65e695d8-f6a4-4909-b9ad-87314bc50dd5_1032x193.png 1272w, https://substackcdn.com/image/fetch/$s_!kNfZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65e695d8-f6a4-4909-b9ad-87314bc50dd5_1032x193.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h3>Phase 3. Generation: Treat the LLM as an Unreliable Dependency</h3><p><strong>What I built:</strong> context formatting &#8594; LLM invocation with retries &#8594; response assembly.</p><p>LLMs fail in ways traditional dependencies don&#8217;t. They are non-deterministic, occasionally unavailable, and can return plausible but wrong outputs. I treated the model as an unreliable dependency from day one, something to isolate, observe, and swap, not something to trust implicitly.</p><h4>Swappable LLMs via a factory</h4><p>A simple factory pattern makes experimentation cheap:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def get_llm():

    if provider == &#8220;openai&#8221;:

        return OpenAIChat(...)

    if provider == &#8220;gemini&#8221;:

        return GeminiChat(...)</code></pre></div><p>Switching providers requires only configuration changes. Call sites don&#8217;t care. This is exactly where frameworks like LangChain shine: as an abstraction layer. They handle the messy API differences between providers so that OpenAIChat and GeminiChat can expose the same interface to your application. Using them here makes swapping models trivial, without letting them dictate your control flow.</p><h4>Explicit orchestration over chains</h4><p>Generation is intentionally step-by-step:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">async def generate_answer(request):

    retrieval_response = await retrieve(query=request.query, ...)

    context_text = format_docs(retrieval_response)

    messages = get_rag_prompt().format_messages(

        context=context_text,

        question=request.query,

    )

    llm = get_llm()

    ai_message = await _invoke_llm_with_retry(llm, messages)

    return GenerateResponse(answer=ai_message.content, ...)</code></pre></div><p>I avoided using LangChain&#8217;s expression language (LCEL) or runnable abstractions to build this flow. While powerful, they can hide what&#8217;s happening. Explicit orchestration is easier to debug, instrument, and reason about, especially while learning. This resonated with me even more since I&#8217;m used to a hands-on approach where I can write code and truly understand how the logic flows.</p><h4>Retries are operational, not semantic</h4><p>LLM calls fail for mundane reasons: transient network issues, provider-side throttling, or brief outages. I treat those as operational failures, not model behavior, and handle them explicitly.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">from tenacity import retry, stop_after_attempt, wait_exponential

@retry(

    stop=stop_after_attempt(3),

    wait=wait_exponential(multiplier=1, min=1, max=10),

)

async def _invoke_llm_with_retry(llm, messages):

    return await llm.ainvoke(messages)</code></pre></div><p>Retries don&#8217;t make the model <em>correct</em>, they make the system resilient.</p><h3>Phase 4. Serving: Thin Adapters, Shared Core</h3><p><strong>What I built:</strong> two interfaces over the same RAG core:  a REST API and an MCP server.</p><p>In many RAG implementations, the retrieval logic is tightly coupled to the web framework (e.g., defined inside a FastAPI route). This makes it hard to test the logic in isolation or reuse it in different contexts (like a CLI or an evaluation script).</p><p>Instead, I treated my RAG system as a standalone library. The core function &#8216;<em>generate_answer</em>&#8217; takes a pure Pydantic object and returns one. It knows nothing about HTTP, headers, or JSON.</p><p>This architecture allowed me to treat serving as a <strong>thin adapter pattern</strong>.</p><h4>Adapter 1: REST API (FastAPI)</h4><p>The REST adapter serves traditional software systems that need deterministic access to the retrieval layer. This includes web applications, backend services, internal tooling, evaluation pipelines, and automation jobs. These are environments where the caller decides exactly when and how the capability should be invoked.</p><p>The web layer itself does no <em>extra</em> work. It merely deserializes JSON, calls the core, and serializes the result.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">@router.post(&#8221;&#8220;, response_model=GenerateResponse)

async def query(request: GenerateRequest) -&gt; GenerateResponse:

    return await generate_answer(request)</code></pre></div><h4>Adapter 2: MCP Server (Capability Interface for Tool-Using LLMs)</h4><p>Exposing the same core through the Model Context Protocol (MCP) transforms my RAG pipeline from an application-bound feature into a standardized capability.</p><p><strong>MCP standardizes how capabilities are exposed to tool-using LLMs</strong>,  regardless of whether the caller is a chat assistant, a coding copilot, or an autonomous agent.</p><p>I&#8217;m used to abstraction via careful refactoring, and it didn&#8217;t take long to understand that MCP was just another way of achieving this in the context of AI.</p><p>MCP-compatible clients such as Claude Desktop, Cowork, or Cursor can connect to the server and invoke the <em>query_rag</em> tool directly. This allows the underlying LLM to ground its responses in private data without requiring custom integrations, plugins, or connector logic.</p><p>Direct tool access is useful, but the MCP interface becomes far more valuable as agents adopt <a href="https://agentskills.io/home">skills</a> to carry out knowledge work and other multi-step tasks. For example, a &#8220;Market Research Skill&#8221; might combine web search, financial data lookup, and document retrieval. By exposing my RAG system as an MCP Tool, it becomes a standardized block that these skills can easily include in their workflows, without needing custom code.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">@mcp.tool()

async def query_rag(query: str, top_k: int = 5, rerank: bool = True) -&gt; dict:

    request = GenerateRequest(query=query, top_k=top_k, rerank=rerank)

    response = await generate_answer(request)

    return response.model_dump()</code></pre></div><p>Both interfaces share the same core logic, thus avoiding duplication. Serving is an adapter problem, not a RAG problem.</p><h4>Data lineage &amp; traceability</h4><p>Traceability isn&#8217;t new. Long before LLMs, production systems relied on lineage and identifiers to make failures debuggable. LLM non-determinism makes that discipline more important, not less.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KxBN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441edf26-9f7f-469f-889d-c0b16ff2451b_367x500.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KxBN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441edf26-9f7f-469f-889d-c0b16ff2451b_367x500.png 424w, https://substackcdn.com/image/fetch/$s_!KxBN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441edf26-9f7f-469f-889d-c0b16ff2451b_367x500.png 848w, https://substackcdn.com/image/fetch/$s_!KxBN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441edf26-9f7f-469f-889d-c0b16ff2451b_367x500.png 1272w, https://substackcdn.com/image/fetch/$s_!KxBN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441edf26-9f7f-469f-889d-c0b16ff2451b_367x500.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KxBN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441edf26-9f7f-469f-889d-c0b16ff2451b_367x500.png" width="367" height="500" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/441edf26-9f7f-469f-889d-c0b16ff2451b_367x500.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:500,&quot;width&quot;:367,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KxBN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441edf26-9f7f-469f-889d-c0b16ff2451b_367x500.png 424w, https://substackcdn.com/image/fetch/$s_!KxBN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441edf26-9f7f-469f-889d-c0b16ff2451b_367x500.png 848w, https://substackcdn.com/image/fetch/$s_!KxBN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441edf26-9f7f-469f-889d-c0b16ff2451b_367x500.png 1272w, https://substackcdn.com/image/fetch/$s_!KxBN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441edf26-9f7f-469f-889d-c0b16ff2451b_367x500.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Debugging RAG systems almost always means reasoning backward, from an answer, to retrieved chunks, to embeddings, and finally to source files.</p><p>In practice, this meant persisting identifiers at every step. Retrieved results carry chunk IDs forward. Generation logs include the IDs of the chunks used as context. When an answer looks wrong, I can trace it deterministically back to its source.</p><p>Without lineage, every bad answer looks like a model problem. With it, failures become diagnosable and fixable.</p><h4>Vendor-neutral observability</h4><p>This isn&#8217;t RAG specific. It&#8217;s the same observability discipline I&#8217;ve applied in other production systems. I deliberately kept it vendor-neutral, following a pattern I&#8217;ve used before to keep core logic decoupled from tooling.</p><p>Beyond tracing execution paths, tools like <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a> let me reason about operational realities: latency per phase, token usage, and cost per request. Being able to see which model was invoked, how many tokens were consumed, and where time was spent turns performance and cost from assumptions into measurable signals.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def track(name: str = None, phase: Phase = None):

    def decorator(func):

        tags = [f&#8221;phase:{phase.value}&#8221;] if phase else []

        @opik.track(name=name, tags=tags)

        def wrapper(*args, **kwargs):

            return func(*args, **kwargs)

        return wrapper

    return decorator</code></pre></div><p>If I ever switch observability tools, business code doesn&#8217;t change.</p><h2>What I&#8217;m Exploring Next</h2><p>Next steps include:</p><ol><li><p>Adding durable workflow orchestration (DBOS or Prefect)</p></li><li><p>Implementing systematic evaluation for retrieval quality and faithfulness</p></li><li><p>Exploring more advanced retrieval patterns</p></li></ol><p>Each will be added deliberately, one constraint at a time.</p><h2>Closing Thoughts</h2><p>Moving from keyword search to semantic and multimodal understanding is a massive leap in how we solve problems. While this technology introduces an ambiguity that contrasts with the deterministic systems I&#8217;ve built before, the incredible advantages and sheer problem-solving power it offers make the challenge truly exciting.</p><p>Building RAG this way slowed me down, deliberately.</p><p>What I have now is a system I can inspect, rerun, and reason about when something goes wrong. For me, that&#8217;s a better foundation than a faster demo.</p><p>I&#8217;m still learning RAG. But I&#8217;m learning it with the same instincts that shaped the rest of my career: make systems observable, design for failure, and own the control flow before adding abstraction.</p><p><strong>Code:</strong> <a href="https://github.com/CalvHobbes/rag-101">https://github.com/CalvHobbes/rag-101</a></p><p><strong>Inspired by:</strong> <em><a href="https://www.decodingai.com/p/my-ai-production-tech-stack">From 100+ AI Tools to 4: My Production Stack</a></em> by <a href="https://substack.com/@pauliusztin">Paul Iusztin</a></p><p>See you next time.</p><p><a href="https://substack.com/@pmarwa">Priya</a></p><div><hr></div><p><em>What&#8217;s your opinion? Do you agree, disagree, or is there something I missed?</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/production-rag-from-scratch-senior-architect-guide/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/production-rag-from-scratch-senior-architect-guide/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to share our work.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/production-rag-from-scratch-senior-architect-guide?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/production-rag-from-scratch-senior-architect-guide?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h2>Go Deeper</h2><p><strong>Go from zero to production-grade AI agents</strong> with the <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Agentic AI Engineering self-paced course</a>. Built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Towards AI</a>.</p><p>Across <strong>34 lessons</strong> (articles, videos, and a lot of code), you&#8217;ll design, build, evaluate, and deploy production-grade AI agents end to end. By the final lesson, you&#8217;ll have <strong>built a multi-agent system</strong> and a <strong>capstone project</strong> where you apply everything you&#8217;ve learned on your own.</p><p><em>Three portfolio projects and a certificate to showcase in interviews. Plus a Discord community where you have direct access to other industry experts and me.</em></p><p>Rated 4.9/5 &#11088;&#65039; by 290+ early students &#8212; <em>&#8220;Every AI Engineer needs a course like this.&#8221;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&quot;,&quot;text&quot;:&quot;Learn more&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31"><span>Learn more</span></a></p><p><em>Not ready to commit?</em> We also prepared a free 6-day email course to reveal the <em><strong>6 critical mistakes that silently destroy agentic systems. </strong><a href="https://email-course.towardsai.net/?ref=b3ab31">Get the free email course.</a></em></p><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[Our LLM Judge Passed Everything. It Was Wrong.]]></title><description><![CDATA[Align your evaluator with human judgment, or don't trust it at all.]]></description><link>https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge</link><guid isPermaLink="false">https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Tue, 10 Mar 2026 12:01:50 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!1am-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Welcome to the <strong><a href="https://www.decodingai.com/t/ai-evals-and-observability">AI Evals &amp; Observability series</a></strong>: A 7-part journey from shipping AI apps to systematically improving them. Made by busy people. For busy people.</em></p><p>&#129488; Everyone says you need AI evals. Few explain how to actually build them and answer questions such as&#8230;</p><p>How do we avoid creating evals that waste our time and resources? How do we build datasets and design evaluators that matter? How do we adapt them for RAG? ...and most importantly, how do we stop &#8220;vibe checking&#8221; and leverage evals to actually track and optimize our app?</p><p><em>This 7-article series breaks it all down from first principles:</em></p><ol><li><p><a href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app">Integrating AI Evals Into Your AI App</a><strong> </strong></p></li><li><p><a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Build an AI Evals Dataset from Scratch</a> </p></li><li><p><a href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals">Generate Synthetic Datasets for AI Evals </a></p></li><li><p><a href="https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures">How to Design Evaluators</a></p></li><li><p><strong>How to Evaluate the Evaluator</strong>  &#8592; <em>You are here</em></p></li><li><p><a href="https://www.decodingai.com/p/rag-evaluation-6-metrics-framework">RAG Evaluation: The Only 6 Metrics You Need</a></p></li><li><p><a href="https://www.decodingai.com/p/behind-the-scenes-of-ai-observability">Lessons from 6 Months of Evals on a Production AI Companion</a></p></li></ol><p>By the end, you&#8217;ll know how to integrate AI evals that actually track and improve the performance of your AI product. No vibe checking required!</p><p><strong>Let&#8217;s get started.</strong></p><div><hr></div><h2>How to Evaluate the Evaluator</h2><p>Your evaluators are running. They produce Pass or Fail verdicts on your agent&#8217;s outputs. But one open question remains: how do you know if those verdicts are correct?</p><p>While building Brown, a writer agent I developed with the Towards AI team for our <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Agentic AI Engineering course</a>, I set up an LLM judge to verify generated articles. I wanted to check the expected structure, idea flow, and content against a golden dataset. I ran it on a batch of traces, and the scores seemed reasonable. Then I manually compared the traces against the judge&#8217;s verdicts, only to realize it was fixating on the wrong things.</p><p>It scored 0 when an article used bullet points instead of H3 headers, which was perfectly acceptable for that section. It scored 0 when the agent used a different transition phrase than the few-shot examples, penalizing creativity when we wanted flexibility. Furthermore, it scored 1 when paragraphs did not flow smoothly into each other, completely overlooking a real quality issue we cared about. </p><p>We had to iterate on the judge until it reflected what we actually valued. Anthropic reports a similar pattern, seeing eval scores jump from 42% to 95% after fixing grading bugs and ambiguous task specifications <a href="https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents">[3]</a>. The agent was fine all along, but the evaluator was broken. That experience crystallized something for me: <strong>eval metrics you cannot trust are worse than no metrics at all.</strong></p><p>Unvalidated evals create false confidence. You see green dashboards, assume quality is fine, and stop looking. You push broken outputs because the numbers said they were good, and you hear about problems from frustrated users instead of your test suite. Worst of all, you cannot tell which evaluations are wrong, as the 10-20% of incorrect signals hide silently and contaminate every decision built on those scores.</p><p>Your evaluator is another AI model that makes binary predictions, so it needs a test set, metrics, and mapped failure modes like any other model.</p><p>Also, LLM judges are inherently non-deterministic, meaning they hallucinate, carry biases, and drift. Alignment with human evaluators varies widely by task, with some teams achieving high agreement after careful iteration, while others struggle to break 70% on subjective criteria. The gap between your judge and reality could mean hundreds of bad signals across a thousand evaluations, which you will not know without validation <a href="https://hamel.dev/blog/posts/llm-judge/">[2]</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1am-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1am-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!1am-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!1am-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!1am-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1am-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png" width="1456" height="1048" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1048,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The evaluator validation workflow&quot;,&quot;title&quot;:&quot;The evaluator validation workflow&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The evaluator validation workflow" title="The evaluator validation workflow" srcset="https://substackcdn.com/image/fetch/$s_!1am-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!1am-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!1am-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!1am-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 1: The evaluator validation workflow: Comparing judge verdicts against expert labels with classification metrics.</em></figcaption></figure></div><p>Here is what you will learn to solve this problem:</p><ul><li><p>Partitioning your labeled data to prevent data leakage.</p></li><li><p>Quantifying agreement using standard classification metrics.</p></li><li><p>Systematically closing the gap between your judge and domain experts.</p></li><li><p>Dealing with the randomness of LLMs.</p></li></ul><p>To start this process, we first need to structure our dataset correctly.</p><p><em>But before digging into the article, a quick word from our sponsor, Opik.</em> &#8595;</p><div><hr></div><h2><a href="https://www.comet.com/docs/opik/agent_optimization/quickstart?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik: Automated Agent Optimization Using Your Data (Sponsored)</a></h2><p>This AI Evals &amp; Observability series is brought to you by <strong><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a></strong>, the LLMOps open-source platform used by Uber, Netflix, Etsy, and more.</p><p>We use Opik daily across our courses and AI products. Not just for observability, but now to <strong>automatically optimize our agents&#8217; prompts</strong> using the same datasets and metrics we already have in the platform.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ecvh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c8d01a-ea3a-4074-af28-c6a13d28f1d7_800x450.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ecvh!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c8d01a-ea3a-4074-af28-c6a13d28f1d7_800x450.gif 424w, https://substackcdn.com/image/fetch/$s_!Ecvh!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c8d01a-ea3a-4074-af28-c6a13d28f1d7_800x450.gif 848w, https://substackcdn.com/image/fetch/$s_!Ecvh!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c8d01a-ea3a-4074-af28-c6a13d28f1d7_800x450.gif 1272w, https://substackcdn.com/image/fetch/$s_!Ecvh!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c8d01a-ea3a-4074-af28-c6a13d28f1d7_800x450.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ecvh!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c8d01a-ea3a-4074-af28-c6a13d28f1d7_800x450.gif" width="800" height="450" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/88c8d01a-ea3a-4074-af28-c6a13d28f1d7_800x450.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;optimization_studio_walkthrough.mp4 [video-to-gif output image]&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="optimization_studio_walkthrough.mp4 [video-to-gif output image]" title="optimization_studio_walkthrough.mp4 [video-to-gif output image]" srcset="https://substackcdn.com/image/fetch/$s_!Ecvh!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c8d01a-ea3a-4074-af28-c6a13d28f1d7_800x450.gif 424w, https://substackcdn.com/image/fetch/$s_!Ecvh!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c8d01a-ea3a-4074-af28-c6a13d28f1d7_800x450.gif 848w, https://substackcdn.com/image/fetch/$s_!Ecvh!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c8d01a-ea3a-4074-af28-c6a13d28f1d7_800x450.gif 1272w, https://substackcdn.com/image/fetch/$s_!Ecvh!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c8d01a-ea3a-4074-af28-c6a13d28f1d7_800x450.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You are learning how to build diverse synthetic datasets to evaluate your AI app. But once you have those datasets and metrics, why stop at measuring quality?<strong> <a href="https://www.comet.com/docs/opik/agent_optimization/quickstart?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik&#8217;s agent optimizer</a></strong> closes the loop. It uses <strong>your</strong> <strong>eval dataset to automatically improve your prompts</strong>. Here is why we love it:</p><ul><li><p><strong>Same datasets, zero extra setup</strong> &#8212; Opik&#8217;s optimizer reuses the exact datasets, metrics, and tracing you already have. <a href="https://www.comet.com/docs/opik/agent_optimization/quickstart?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Quick start guide</a>.</p></li><li><p><strong>Six optimization algorithms</strong> &#8212; Choose from strategies like HRPO (our favorite), which performs root-cause analysis on failures and proposes targeted fixes, or evolutionary optimization to explore diverse prompt structures. <a href="https://www.comet.com/docs/opik/agent_optimization/algorithms/overview?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">See all algorithms.</a></p></li><li><p><strong>No-code Optimization Studio</strong> &#8212; For quick iterations, run optimization directly from the <a href="https://www.comet.com/docs/opik/agent_optimization/optimization_studio?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Optimization Studio UI</a>. Start from your prompt, pick your dataset, choose an algorithm, and watch Opik test prompt variations against your metrics in real time.</p></li></ul><p><strong><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a></strong> is fully open source and integrates with OpenAI, Anthropic, Gemini, and 100+ providers. <em><strong>Start optimizing your agents for free:</strong></em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.comet.com/docs/opik/agent_optimization/quickstart?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;text&quot;:&quot;Automated agent optimization guide&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.comet.com/docs/opik/agent_optimization/quickstart?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul"><span>Automated agent optimization guide</span></a></p><div><hr></div><p><em>&#8595;</em>  <em>Now, let&#8217;s move back to the article.</em></p><h2>Structuring Your Data for Validation</h2><p>You already have your ground truth. As explained in <a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Article 2</a> and <a href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals">Article 3</a> of the series, your domain expert labeled each trace as Pass or Fail with a critique. Those labels are the reference standard your automated judge must match. If the task is highly subjective, consider having multiple people label the same examples to discover the agreement ceiling, but for most teams, the single expert is sufficient.</p><p>Now you need to partition that labeled data correctly. Why? Because you cannot build and validate on the same examples, as that is like grading your own exam. You must calculate the error on unseen data only to make sure you are not getting biased results, so split your dataset into three sets: train, dev and test <a href="https://hamel.dev/blog/posts/llm-judge/">[2]</a>.</p><p>The train set takes 60% of the data, representing the examples your evaluator learns from. They go into the few-shot prompt, inform the rubric, and set the standard for what Pass and Fail look like. The dev set takes 20% of the data, acting as your iteration sandbox. Run the judge here, check where it disagrees with the expert, adjust the prompt, and repeat to refine the system. Finally, the test set takes the remaining 20% and must be kept locked until you are done iterating. You use it only at the end when the LLM judge is aligned with the expert on the dev set. This gives you an unbiased final score on data that the evaluator has never seen.</p><p>The 60/20/20 split is a good starting point, but as your data grows and you don&#8217;t want to overload your few-shot-examples (they grow your context window), you can start moving more data to the dev and test splits.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!c8_b!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F321f8ad5-8329-4ed0-bbc9-df6d1db1820f_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!c8_b!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F321f8ad5-8329-4ed0-bbc9-df6d1db1820f_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!c8_b!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F321f8ad5-8329-4ed0-bbc9-df6d1db1820f_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!c8_b!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F321f8ad5-8329-4ed0-bbc9-df6d1db1820f_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!c8_b!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F321f8ad5-8329-4ed0-bbc9-df6d1db1820f_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!c8_b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F321f8ad5-8329-4ed0-bbc9-df6d1db1820f_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/321f8ad5-8329-4ed0-bbc9-df6d1db1820f_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Data partitioning for evaluator development&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Data partitioning for evaluator development" title="Data partitioning for evaluator development" srcset="https://substackcdn.com/image/fetch/$s_!c8_b!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F321f8ad5-8329-4ed0-bbc9-df6d1db1820f_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!c8_b!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F321f8ad5-8329-4ed0-bbc9-df6d1db1820f_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!c8_b!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F321f8ad5-8329-4ed0-bbc9-df6d1db1820f_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!c8_b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F321f8ad5-8329-4ed0-bbc9-df6d1db1820f_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 2: How to partition your labeled data into train, dev, and test sets for evaluator development.</em></figcaption></figure></div><p>In practice, 100 labeled examples mean 60 powering the prompt, 20 for tuning, and 20 for the final honest check. Aim for at least 100 labeled examples to get stable metrics. Below 50, your numbers become too noisy to act on. Watch out for class imbalance. If 90% of your traces are Pass and only 10% are Fail, you need a way to balance the classes, either by synthetically increasing your negative class or removing samples from your positive class, until a balance is achieved.</p><p>With data properly structured, let us quantify how well your judge actually agrees with the expert.</p><h2>Measuring Alignment With Human Judgment</h2><p>Your judge outputs Pass or Fail for each trace, which means you are building a binary classifier. You are using LLMs instead of other models, but ultimately, it&#8217;s still just a classifier.</p><p>Thus, you need to quantify the performance of the LLM Judge against the golden dataset we just split in the previous section. Standard classification metrics give you this visibility.</p><p>The <strong>confusion matrix</strong> shows four possible outcomes. True Positive (TP) means both judge and expert say Pass, agreeing the output is good. True Negative (TN) means both say Fail, agreeing the output is bad. False Positive (FP) means the judge says Pass, but the expert says Fail, letting a bad output through. False Negative (FN) means the judge says Fail, but the expert says Pass, meaning the judge was overly harsh.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_JaU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15c2a0f6-b059-4a17-9cc3-2ba9b59c1611_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_JaU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15c2a0f6-b059-4a17-9cc3-2ba9b59c1611_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!_JaU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15c2a0f6-b059-4a17-9cc3-2ba9b59c1611_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!_JaU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15c2a0f6-b059-4a17-9cc3-2ba9b59c1611_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!_JaU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15c2a0f6-b059-4a17-9cc3-2ba9b59c1611_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_JaU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15c2a0f6-b059-4a17-9cc3-2ba9b59c1611_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/15c2a0f6-b059-4a17-9cc3-2ba9b59c1611_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The confusion matrix for evaluator validation&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The confusion matrix for evaluator validation" title="The confusion matrix for evaluator validation" srcset="https://substackcdn.com/image/fetch/$s_!_JaU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15c2a0f6-b059-4a17-9cc3-2ba9b59c1611_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!_JaU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15c2a0f6-b059-4a17-9cc3-2ba9b59c1611_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!_JaU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15c2a0f6-b059-4a17-9cc3-2ba9b59c1611_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!_JaU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15c2a0f6-b059-4a17-9cc3-2ba9b59c1611_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 3: The four outcomes when comparing judge verdicts against expert labels.</em></figcaption></figure></div><p>Combining TP, TN, FP, and FN yields <strong>three fundamental metrics</strong>:</p><ol><li><p><strong>Accuracy</strong> is the overall agreement rate, calculated as <code>(TP + TN) / total</code>. If the judge matches the expert on 170 out of 200 traces, that is 85% accuracy. This is useful when Pass and Fail are roughly balanced, but it is highly misleading when they are not.</p></li><li><p><strong>Precision</strong> measures how trustworthy the Pass verdicts are, representing the fraction of judge-approved traces that the expert also labeled Pass. You calculate it as <code>TP / (TP + FP)</code>. If the judge approves 50 articles and the expert disagrees on 8, precision is <code>42 / 50 = 84%</code>, meaning when the judge says the output is good, you can generally believe it.</p></li><li><p><strong>Recall</strong> measures how many actual Passes the judge finds out of all the traces the expert labeled Pass. You calculate it as <code>TP / (TP + FN)</code>. If 60 articles are genuinely good but the judge only catches 48, recall is <code>48 / 60 = 80%</code>, meaning the judge finds most quality output but still misses some.</p></li></ol><p>Ultimately, we have the <strong>F1 score</strong> as an aggregate metric that provides a balanced view as the harmonic mean of precision and recall, calculated as <code>2 &#215; (Precision &#215; Recall) / (Precision + Recall)</code>. Use this when both false positives and false negatives matter equally. The right F1 target depends on the metric. With Brown, we accepted around 60% for subjective metrics like style, but demanded over 90% for objective ones like article structure. As a general rule, aim for an F1 above 0.70.</p><p>These metrics seem simple enough. But there is a common trap most teams fall into when their datasets are not balanced.</p><h2>When High Scores Hide Real Failures</h2><p>We can best understand this phenomenon by looking at a few examples.</p><p>For example, let&#8217;s assume Brown generates 80 articles. 70 are correct, and 10 are broken. Your judge labels every single one as Pass. Accuracy sits at <code>70 / 80 = 87.5%</code>, which looks reasonable, but it never caught a single failure, making it completely useless.</p><p>Let us look at another example in more depth. Out of 80 generated articles, 60 are genuinely well-structured, while 20 have real problems like missing sections or disconnected paragraphs. The judge correctly approves 55 of the good ones and wrongly rejects 5. Of the 20 broken articles, it catches only 4 and lets 16 slip through. That gives us TP=55, FN=5, FP=16, TN=4.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QBBm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eb6277e-3fb0-439b-b559-8d3e7538e33f_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QBBm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eb6277e-3fb0-439b-b559-8d3e7538e33f_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!QBBm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eb6277e-3fb0-439b-b559-8d3e7538e33f_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!QBBm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eb6277e-3fb0-439b-b559-8d3e7538e33f_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!QBBm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eb6277e-3fb0-439b-b559-8d3e7538e33f_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QBBm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eb6277e-3fb0-439b-b559-8d3e7538e33f_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9eb6277e-3fb0-439b-b559-8d3e7538e33f_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Accuracy vs precision and recall breakdown&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Accuracy vs precision and recall breakdown" title="Accuracy vs precision and recall breakdown" srcset="https://substackcdn.com/image/fetch/$s_!QBBm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eb6277e-3fb0-439b-b559-8d3e7538e33f_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!QBBm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eb6277e-3fb0-439b-b559-8d3e7538e33f_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!QBBm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eb6277e-3fb0-439b-b559-8d3e7538e33f_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!QBBm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eb6277e-3fb0-439b-b559-8d3e7538e33f_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 4: Decent overall accuracy can disguise a judge who barely detects real failures.</em></figcaption></figure></div><p>Overall accuracy reads <code>(55 + 4) / 80 = 73.75%</code>, which looks reasonable. But Fail-class recall is just <code>TN / (FP + TN) = 4 / (16 + 4) = 20%</code>, meaning the judge misses 80% of structural failures. The lesson here is to always check precision and recall on the minority class. If those numbers are low, enrich your few-shot prompts with more failure examples, focusing particularly on the subtle cases where individual paragraphs look fine but do not connect fluidly <a href="https://hamel.dev/blog/posts/llm-judge/">[2]</a>.</p><p>Now that you know what to measure and what to watch out for, let us walk through the process of systematically improving your judge.</p><h2>Closing the Gap Between Judge and Expert</h2><p>This is the core workflow for making your judge reliable. Start with 10-20 few-shot examples from the train set to build your initial judge, and run it against the dev set while leaving the test set untouched. Compute precision, recall, and F1, then identify every disagreement where the judge and expert diverge. Expand your few-shot examples by incorporating those disagreements into the prompt when they reveal real patterns, re-run, and re-measure until the dev set alignment hits your target threshold.</p><p>Remember that your few-shot examples translate to input tokens, which translate to extra costs. Thus, ideally, you want to keep your few-shot examples as minimal, yet diverse, as possible, while maximizing performance on your dev and test splits.</p><p>Lock the test set for the final check. Only run the judge on the test set after you stop iterating on the dev set, giving you an uncontaminated measurement of real performance.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xx5O!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23934a2d-8368-4af9-a544-ee01ce75600c_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xx5O!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23934a2d-8368-4af9-a544-ee01ce75600c_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!xx5O!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23934a2d-8368-4af9-a544-ee01ce75600c_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!xx5O!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23934a2d-8368-4af9-a544-ee01ce75600c_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!xx5O!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23934a2d-8368-4af9-a544-ee01ce75600c_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xx5O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23934a2d-8368-4af9-a544-ee01ce75600c_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/23934a2d-8368-4af9-a544-ee01ce75600c_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The judge refinement cycle&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The judge refinement cycle" title="The judge refinement cycle" srcset="https://substackcdn.com/image/fetch/$s_!xx5O!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23934a2d-8368-4af9-a544-ee01ce75600c_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!xx5O!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23934a2d-8368-4af9-a544-ee01ce75600c_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!xx5O!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23934a2d-8368-4af9-a544-ee01ce75600c_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!xx5O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23934a2d-8368-4af9-a544-ee01ce75600c_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 5: The judge refinement cycle: build, measure, diagnose disagreements, adjust, and repeat until alignment is sufficient.</em></figcaption></figure></div><p>Expect at least 3 rounds of iteration. If you are still far below target after 10 iterations, the task may require human judgment that no prompt can replicate. Start by hand, as manual prompt refinement teaches you where your judge&#8217;s reasoning diverges from the expert&#8217;s. Carefully studying each disagreement is the most informative signal you have, and once your labeled dataset is large and high-quality enough, you can explore automated prompt optimization tools.</p><p>Read the LLM Judge critiques instead of just looking at metrics, as critiques tell you whether the judge was wrong or the expert missed something. As highlighted by Anthropic, you shouldn&#8217;t take eval scores at face value until someone digs into the details and reads the critiques of the judge <a href="https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents">[3]</a>.</p><p>Once your judge passes validation, put it to work for regression testing, optimization, and production monitoring as explained in <a href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app">Article 1</a>.</p><p><strong>What if the agreement stays low?</strong> If after 10 rounds your agreement is still low, here is what to look out for. Your few-shot examples might be too narrow, so as you keep sampling more production traces using your observability platform (e.g., <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a>) revisit error analysis, as exlained in <a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Article 2</a>, to find the specific patterns where the judge fails and add those to the few-shot-examples. With Brown, our initial examples were too uniform, and adding subtle structural failures immediately improved alignment.</p><p>The rubric might lack specificity, as asking if the article is well-written invites interpretation, while asking if it contains well-defined paragraphs, transitions, and metaphors leaves less room for ambiguity. Sharpen the criteria.</p><p>Also, in case the task itself is too subjective, consider accepting a lower F1 score. For example, with Brown, style adherence was inherently subjective, so we accepted a lower F1 there while holding structure to &gt;90%. The idea is to adapt your acceptance threshold based on the nature of each business metric you are tracking.</p><p>Even with strong agreement, there is one more challenge. Both your judge and your agent introduce randomness into every run. Let&#8217;s see how we can fix that.</p><h2>Dealing With Non-Determinism</h2><p>Randomness comes from two directions: as the judge produces different scores on the same input, and the agent itself takes different paths each run. You need to address both to build a stable evaluation pipeline.</p><p>The easiest and most powerful way to win is to scale the dataset, as larger datasets smooth out noise. Aim for enough examples in each class that a few misclassifications do not swing your metrics wildly. A good starting point is a minimum of 50 samples per class.</p><p>Also, another easy win (but not necessarily cheap) is to pick the strongest available model, using a frontier model like the latest versions of Claude Opus or Gemini Pro, because the judge should be at least as capable as the system it evaluates <a href="https://hamel.dev/blog/posts/llm-judge/">[2]</a>. Require reasoning before the verdict by structuring the prompt with Chain of Thought (CoT) so the judge walks through each criterion first before delivering Pass or Fail. This step-by-step analysis produces more consistent scores and better alignment with human judgment <a href="https://arize.com/llm-as-a-judge/">[1]</a>.</p><p>Let the judge abstain by giving it an &#8220;Unknown&#8221; option when it lacks enough information to decide, because forcing a binary Pass/Fail on ambiguous cases generates false positives you cannot distinguish from real ones <a href="https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents">[3]</a>.</p><p>To further stabilize the judge, you can compute a significance threshold by running the evaluation 3-5 times and computing the variance between the runs. With Brown, this was essential because writing is subjective, and running the evaluator 5 times told us the real error threshold. A 3% metric shift across runs was noise, but 10% meant something actually changed. Without this, you are chasing random fluctuations.</p><p>On the agent side, treat it as a black box and evaluate the destination, not the route, as agents can reach the same outcome through different strategies. Brown might outline first, then write or draft everything, then restructure, but both can produce a strong article. Score the final output against your quality criteria, not the intermediate steps <a href="https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents">[3]</a>.</p><p>For the agent, measure reliability across multiple runs using <code>pass@k</code> and <code>pass^k</code>. <code>pass@k</code> tracks whether at least one out of k attempts succeeds, while <code>pass^k</code> tracks whether all k attempts succeed. These two metrics tell opposite stories as k grows: <code>pass@k</code> climbs toward 100% while <code>pass^k</code> dropping sharply, revealing how consistent your agent really is. <a href="https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents">[3]</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TdzU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3048100-30a5-4ce5-a695-1dad61e26b6b_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TdzU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3048100-30a5-4ce5-a695-1dad61e26b6b_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!TdzU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3048100-30a5-4ce5-a695-1dad61e26b6b_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!TdzU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3048100-30a5-4ce5-a695-1dad61e26b6b_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!TdzU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3048100-30a5-4ce5-a695-1dad61e26b6b_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TdzU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3048100-30a5-4ce5-a695-1dad61e26b6b_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e3048100-30a5-4ce5-a695-1dad61e26b6b_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;pass@k vs pass^k divergence&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="pass@k vs pass^k divergence" title="pass@k vs pass^k divergence" srcset="https://substackcdn.com/image/fetch/$s_!TdzU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3048100-30a5-4ce5-a695-1dad61e26b6b_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!TdzU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3048100-30a5-4ce5-a695-1dad61e26b6b_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!TdzU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3048100-30a5-4ce5-a695-1dad61e26b6b_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!TdzU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3048100-30a5-4ce5-a695-1dad61e26b6b_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 6: pass@k and pass^k tell opposite stories about reliability as the number of trials grows.</em></figcaption></figure></div><p>You now have the complete toolkit for evaluating your evaluator.</p><h2>Demo</h2><p>To fully grasp the end-to-end workflow for building AI Evals, I recommend rewatching our demo using&nbsp;<a href="https://aligneval.com/">AlignEval</a>, an open-source tool created by Eugene Yan. It provides a streamlined interface for the exact workflow this article teaches: look at your data, label it, evaluate outputs, and optimize your evaluators:</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;90b981d0-fb1c-4c33-82f2-4f1fb476cd02&quot;,&quot;duration&quot;:null}"></div><p>The tool is open source and available at <a href="https://aligneval.com/">aligneval.com</a>, with the source code on GitHub (<a href="https://github.com/eugeneyan/align-app">eugeneyan/align-app</a>). You can try it for free with your own data or use the prompt below to quickly generate a CSV similar to the one from the demo:</p><pre><code><code>I want you to generate a CSV file with the following characteristics:
"""
* The CSV file must include the following columns:
   * id: Unique identifier for each row
   * input: Context used to generate output
   * output: Generated text to be evaluated
   * label: Ground truth (values optional but counts towards XP)
   * explanation: A one-sentence explanation on why we labeled the row as 0 (PASS) or 1 (FAIL)
* &#128680; The label column only accepts binary labels, either 0 or 1.
   * 0: Output PASSES your evaluation
   * 1: Output FAILS your evaluation
"""
that contains 100 rows

The goal of the CSV file is to implement a dataset to build an LLM Judge evaluator. 

We want to create some mock, synthetic data to conceptually show how labeling, evaluating and optimizing the LLM judge would look like, based on this tool: https://aligneval.com/

Let's say that we collected data from a vertical assistant agent specialized in answering work emails and Slack messages. Thus, create 100 scenarios based on these dimensions:
* feature: email/slack
* scenario: executive, manager, colleague, spam email, phishing email, friend (as an exception)
* label: success/failure of properly answering the message

Where the input is a single email or Slack message or an email or Slack thread, but the output will ALWAYS be just the generated reply, whether it's email or Slack.

Make the labels a 50%/50% split between passes and fails.

Also, note that NO REPLY is an expected behavior for SPAM and phishing emails. Also, for non-essential emails or toxic or slack messages.</code></code></pre><p>We used Claude Opus 4.6 within the Claude app to generate it.</p><h2>Next Steps</h2><p>An evaluator only earns trust when it matches expert judgment. The workflow is straightforward: measure where your judge disagrees with the expert, fix those gaps, and confirm on data the judge has never seen. Only when the judge aligns with the expert on the test set can you rely on your eval metrics.</p><p>The error analysis workflow and iterative labeling were only the tip of the iceberg. Now you see the full picture of how to build, validate, and maintain evaluators.</p><p>Next up is a specialized article focused on evaluating Retrieval-Augmented Generation (RAG) systems.</p><p>Also, remember that this article is part of a <strong><a href="https://www.decodingai.com/t/ai-evals-and-observability">7-piece series on AI Evals &amp; Observability</a></strong>. <strong>Here&#8217;s what&#8217;s ahead:</strong></p><ol><li><p><a href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app">Integrating AI Evals Into Your AI App</a> </p></li><li><p><a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Build an AI Evals Dataset from Scratch</a>  </p></li><li><p><a href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals">Generate Synthetic Datasets for AI Evals</a>  </p></li><li><p><a href="https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures">How to Design Evaluators</a></p></li><li><p><strong>How to Evaluate the Evaluator</strong>  &#8592; <em>You just finished this one</em></p></li><li><p><a href="https://www.decodingai.com/p/rag-evaluation-6-metrics-framework">RAG Evaluation: The Only 6 Metrics You Need</a></p></li><li><p><a href="https://www.decodingai.com/p/behind-the-scenes-of-ai-observability">Lessons from 6 Months of Evals on a Production AI Companion</a></p></li></ol><p>See you next Tuesday.</p><p><a href="https://www.pauliusztin.ai/">Paul Iusztin</a></p><div><hr></div><p><em>What&#8217;s your opinion? Do you agree, disagree, or is there something I missed?</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to share our work.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h2>Go Deeper</h2><p><strong>Go from zero to production-grade AI agents</strong> with the <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Agentic AI Engineering self-paced course</a>. Built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Towards AI</a>.</p><p>Across <strong>34 lessons</strong> (articles, videos, and a lot of code), you&#8217;ll design, build, evaluate, and deploy production-grade AI agents end to end. By the final lesson, you&#8217;ll have <strong>built a multi-agent system</strong> and a <strong>capstone project</strong> where you apply everything you've learned on your own.</p><p><em>Three portfolio projects and a certificate to showcase in interviews. Plus a Discord community where you have direct access to other industry experts and me.</em></p><p>Rated 4.9/5 &#11088;&#65039; by 290+ early students &#8212; <em>&#8220;Every AI Engineer needs a course like this.&#8221;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&quot;,&quot;text&quot;:&quot;Learn more&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31"><span>Learn more</span></a></p><p><em>Not ready to commit?</em> We also prepared a free 6-day email course to reveal the <em><strong>6 critical mistakes that silently destroy agentic systems. </strong><a href="https://email-course.towardsai.net/?ref=b3ab31">Get the free email course.</a></em></p><div><hr></div><p><em>Thanks again to <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a> for sponsoring the series and keeping it free!</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yeD8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 424w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 848w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1272w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yeD8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png" width="1200" height="400" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/deaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:400,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Opik Banner&quot;,&quot;title&quot;:&quot;Opik Banner&quot;,&quot;type&quot;:null,&quot;href&quot;:&quot;https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Opik Banner" title="Opik Banner" srcset="https://substackcdn.com/image/fetch/$s_!yeD8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 424w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 848w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1272w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Try Opik for free here</a> (25k spans/month free)</figcaption></figure></div><p><strong>If you want to monitor, evaluate and optimize your AI workflows and agents:</strong></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;text&quot;:&quot;Try Opik for free&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul"><span>Try Opik for free</span></a></p><div><hr></div><h2>References</h2><ol><li><p>Arize AI. (n.d.). LLM as a Judge: Primer and Pre-Built Evaluators. Arize. <a href="https://arize.com/llm-as-a-judge/">https://arize.com/llm-as-a-judge/</a></p></li><li><p>Husain, H. (n.d.). Using LLM-as-a-Judge for Evaluation. hamel.dev. <a href="https://hamel.dev/blog/posts/llm-judge/">https://hamel.dev/blog/posts/llm-judge/</a></p></li><li><p>Anthropic. (n.d.). Demystifying Evals for AI Agents. Anthropic Engineering Blog. <a href="https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents">https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents</a></p></li></ol><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[Scaling to 120+ AI Agents Without Losing Control]]></title><description><![CDATA[How two-tier orchestration keeps multi-agent systems debuggable]]></description><link>https://www.decodingai.com/p/scaling-120-ai-agents-two-tier-orchestration</link><guid isPermaLink="false">https://www.decodingai.com/p/scaling-120-ai-agents-two-tier-orchestration</guid><dc:creator><![CDATA[Lucian Lature]]></dc:creator><pubDate>Thu, 05 Mar 2026 12:03:06 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!tePE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4513bc-dd44-4c65-9544-871d17b87c7f_1200x1485.gif" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Paul:</strong> Today, the stage belongs to <a href="https://substack.com/@lucianlature">Lucian Lature</a>, Solutions Architect and Technical Leader with 15+ years of experience spent building and scaling cloud platforms and Node.js products.</p><p>He&#8217;s skipping the textbook definitions today to focus on the architectural trade-offs and real-world logic behind his most recent builds.</p><p>Enough chitchat. Let&#8217;s get into it &#128064; &#8595;</p><div><hr></div><h2>When Single-Agent Systems Fall Apart</h2><p>You know the moment. You built a perfectly capable AI agent that writes code, answers questions, and searches through your docs. It works great. Then you ask it to review code for security issues and synthesize three different research papers. It returns something that&#8217;s half right and half wrong, delivered with full confidence.</p><p>I used to think this was a model problem. Better prompts, bigger context window, maybe switch to the latest Sonnet release. Wrong. The problem is architectural, and no amount of prompt engineering fixes it.</p><p>A single agent with 40+ tools, a 2,000-word prompt over five different domains, and retrieval tuned for one job at a time collapses. Context windows get bloated. Tool selection becomes a mess. Quality tanks.</p><p>This happened to me with Screech, a personal agent I built for my side projects. It started simply, basically a smarter search over my notes. Then I kept adding: code generation, documentation, code reviews, security audits, and research synthesis. The single-agent approach worked beautifully until it very suddenly didn&#8217;t.</p><p>The stack is not exotic. It&#8217;s VoltAgent for runtime and workflows, SurrealDB as the &#8220;one DB to store everything&#8221; experiment, and Claude as the default model tier.</p><p>And yes, the agent is named after Screech from the Saved by the Bell TV series. Also, my childhood nickname.</p><p>I didn&#8217;t invent this in a vacuum. <a href="https://github.com/getzep/graphiti">Graphiti</a> shaped how I think about knowledge that changes over time. VoltAgent gave me workflow primitives I didn&#8217;t have to build. Paul Iusztin&#8217;s <a href="https://www.decodingai.com/p/stop-converting-documents-to-text">AI Agents Foundations</a> convinced me to stop forcing PDFs through OCR and treat them as images. <a href="https://github.com/JustinNarracott/agentic-playbooks">Agentic Playbooks</a> showed me that auditable agent decisions are a performance win, not only a governance check.</p><p>So, here&#8217;s the architecture, decisions, and stuff I&#8217;d do differently. For legal reasons, that&#8217;s &#8220;informational only,&#8221; not &#8220;you should do this.&#8221;</p><p><em>Before we continue, a quick word from the Decoding AI team.</em> &#8595;</p><div><hr></div><h2><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Go Deeper: Your Path to Agentic AI for Production</a></h2><p>The <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Agentic AI Engineering course</a>, built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Towards AI</a>, walks you through building exactly this kind of multi-agent architecture across 34 lessons. </p><p>Across <strong>34 lessons</strong> (articles, videos, and a lot of code), you&#8217;ll design, build, evaluate, and deploy production-grade AI agents end to end. By the final lesson, you&#8217;ll have <strong>built a multi-agent system</strong> and a <strong>capstone project</strong> where you apply everything you&#8217;ve learned on your own.</p><p><em>Three portfolio projects and a certificate to showcase in interviews. Plus a Discord community where you have direct access to other industry experts and me.</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!59a6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!59a6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 424w, https://substackcdn.com/image/fetch/$s_!59a6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 848w, https://substackcdn.com/image/fetch/$s_!59a6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 1272w, https://substackcdn.com/image/fetch/$s_!59a6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!59a6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif" width="1200" height="1200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1200,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:315304,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188280936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!59a6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 424w, https://substackcdn.com/image/fetch/$s_!59a6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 848w, https://substackcdn.com/image/fetch/$s_!59a6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 1272w, https://substackcdn.com/image/fetch/$s_!59a6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"> <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">What you will build during the course</a>: Nova, the deep research agent, and Brown, the writing workflow, connected into a multi-agent system.</figcaption></figure></div><p>Rated 4.9/5 &#11088;&#65039; by 290+ early students &#8212; <em>&#8221;Every AI Engineer needs a course like this&#8221;</em> and <em>&#8221;an excellent bridge from experimental LLM projects to real-world AI engineering.&#8221;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&quot;,&quot;text&quot;:&quot;Start learning today&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31"><span>Start learning today</span></a></p><div><hr></div><p>&#8595; <em>Now, back to the article.</em></p><h2>If You&#8217;re Not Building Agents</h2><p>You can stop here and still get the gist. The problem: one AI agent that should do everything (search your notes, write code, review for security, summarize research) does all of it poorly. One large prompt and many tools. It confuses tasks, wastes tokens, and returns confident nonsense when goals conflict, e.g., security paranoia versus &#8220;ship it&#8221; code gen.</p><p>The solution: one conductor agent that handles simple work itself and a pool of specialists it calls when the task needs depth. The conductor stays cheap and fast for most requests. Specialists run only when needed. You need routing (who handles what), hybrid retrieval (not only vector search), and one store for documents, relationships, and chat (here, SurrealDB). The rest of this article is for people who want to see the wiring.</p><h2>Multi-Agent: When It&#8217;s Worth the Complexity</h2><p>The maintenance overhead is real. So let me be clear about when this makes sense.</p><p>I&#8217;d only do it when there are 3+ domains that actively conflict (dev, research, security is a classic triangle), when I care about cost per request (not &#8220;cost later&#8221;, cost now), and when I need failures to be contained so that one specialist can be dumb without contaminating the whole system.</p><p>Stay single-agent when tasks are similar, the tool count is under about 15, you do not need different model tiers, and simplicity beats per-task quality.</p><p>Single-agent favors simplicity. Multi-agent favors quality per task and adds orchestration. Pick your poison.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QFQw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9e9644-e94e-4aeb-a794-060b11fc073e_923x328.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QFQw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9e9644-e94e-4aeb-a794-060b11fc073e_923x328.png 424w, https://substackcdn.com/image/fetch/$s_!QFQw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9e9644-e94e-4aeb-a794-060b11fc073e_923x328.png 848w, https://substackcdn.com/image/fetch/$s_!QFQw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9e9644-e94e-4aeb-a794-060b11fc073e_923x328.png 1272w, https://substackcdn.com/image/fetch/$s_!QFQw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9e9644-e94e-4aeb-a794-060b11fc073e_923x328.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QFQw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9e9644-e94e-4aeb-a794-060b11fc073e_923x328.png" width="728" height="258.7042253521127" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2c9e9644-e94e-4aeb-a794-060b11fc073e_923x328.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:328,&quot;width&quot;:923,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:62221,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188280936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9e9644-e94e-4aeb-a794-060b11fc073e_923x328.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QFQw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9e9644-e94e-4aeb-a794-060b11fc073e_923x328.png 424w, https://substackcdn.com/image/fetch/$s_!QFQw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9e9644-e94e-4aeb-a794-060b11fc073e_923x328.png 848w, https://substackcdn.com/image/fetch/$s_!QFQw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9e9644-e94e-4aeb-a794-060b11fc073e_923x328.png 1272w, https://substackcdn.com/image/fetch/$s_!QFQw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9e9644-e94e-4aeb-a794-060b11fc073e_923x328.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>The Three-Layer Architecture</h2><p>I think of Screech as an orchestra. One conductor who knows the entire score but doesn&#8217;t play every instrument. Backed by specialists who are genuinely brilliant at their specific parts.</p><p><strong>Layer 1: Orchestration.</strong> It does the boring-but-hard parts: understanding intent, pulling context, and deciding whether this is &#8220;handle it now&#8221; or &#8220;call a specialist&#8221;. Three meta-tools carry most of the orchestration weight: discover subagents, invoke one subagent, and fan out to multiple subagents. A task router (Claude Haiku) classifies complexity before any expensive model runs. The runtime, memory management, workflow engine (with suspend/resume), and MCP server integration come from <a href="https://github.com/VoltAgent/voltagent">VoltAgent</a>. I didn&#8217;t build any of that infrastructure, I plugged in.</p><p><strong>Layer 2: Specialists.</strong>128 subagents in 10 categories: core development, language specialists (TypeScript, Python, Rust, plus 19 more), testing &amp; quality, meta-orchestration. More on why that number and why these categories in a bit.</p><p><strong>Layer 3: Knowledge. </strong>Hybrid retrieval combining vector search + knowledge graph traversal + keyword matching, all backed by SurrealDB. Plus a temporal layer (Graphiti-style) so the system knows when it learned something, not only what.</p><p>Here&#8217;s the key decision: Screech is a full agent with its own tools and retrieval, not a dumb router. That is the decision that matters. It handles 60&#8211;70% of requests directly. Subagents only kick in when you need deep specialization. That keeps latency and cost sane for the common case.</p><p><strong>If you&#8217;ve used Claude Code, this pattern will feel familiar, but there is a key difference:</strong> Claude Code is one agent plus injected context (<a href="http://CLAUDE.md">CLAUDE.md</a>, conventions, slash commands). When you give it a task, the same agent handles everything, and it just gets extra context injected from your skill files. It&#8217;s the &#8220;enhanced single-agent&#8221; end of the spectrum: one brain, augmented with domain knowledge. Screech pushes further along that spectrum. Instead of injecting domain knowledge into one agent&#8217;s prompt, each specialist <em>is its own agent</em> with a dedicated system prompt, model tier, and tool set. The orchestrator doesn&#8217;t just get &#8220;React knowledge&#8221; injected &#8212; it delegates to an <code>react-specialist</code> agent that was born and bred to think in components, hooks, and JSX. The difference matters when domains actively conflict: a security auditor&#8217;s &#8220;assume everything is dangerous&#8221; mindset would poison a code generator&#8217;s &#8220;keep it simple&#8221; prompt if they shared the same context. Separate agents, separate prompts, no cross-contamination. Think of it as: Claude Code = one chef who reads different recipe books depending on the dish. Screech = a head chef who delegates to a pastry specialist, a sushi chef, and a grill master: each with their own kitchen and knives.</p><p>The diagram below shows how these layers connect. Here&#8217;s the flow from top to bottom:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!H09h!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d7cc48-530c-4698-8b87-4ed5de2925cd_1200x1051.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!H09h!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d7cc48-530c-4698-8b87-4ed5de2925cd_1200x1051.gif 424w, https://substackcdn.com/image/fetch/$s_!H09h!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d7cc48-530c-4698-8b87-4ed5de2925cd_1200x1051.gif 848w, https://substackcdn.com/image/fetch/$s_!H09h!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d7cc48-530c-4698-8b87-4ed5de2925cd_1200x1051.gif 1272w, https://substackcdn.com/image/fetch/$s_!H09h!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d7cc48-530c-4698-8b87-4ed5de2925cd_1200x1051.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!H09h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d7cc48-530c-4698-8b87-4ed5de2925cd_1200x1051.gif" width="728" height="637.6066666666667" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b1d7cc48-530c-4698-8b87-4ed5de2925cd_1200x1051.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:1051,&quot;width&quot;:1200,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:2239121,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188280936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d7cc48-530c-4698-8b87-4ed5de2925cd_1200x1051.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!H09h!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d7cc48-530c-4698-8b87-4ed5de2925cd_1200x1051.gif 424w, https://substackcdn.com/image/fetch/$s_!H09h!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d7cc48-530c-4698-8b87-4ed5de2925cd_1200x1051.gif 848w, https://substackcdn.com/image/fetch/$s_!H09h!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d7cc48-530c-4698-8b87-4ed5de2925cd_1200x1051.gif 1272w, https://substackcdn.com/image/fetch/$s_!H09h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d7cc48-530c-4698-8b87-4ed5de2925cd_1200x1051.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>User Interfaces</strong> layer exposes three entry points: a web UI (React), an MCP server (for IDE integration like Cursor), and a CLI terminal. All hit the same orchestration layer.</p><p><strong>Screech Web UI</strong> is a React-based interface for the Screech personal knowledge agent. It connects to the Screech backend and provides five main views: <strong>Sources</strong> (ingest and manage documents, view chunk/entity/pattern/insight stats), <strong>Notes</strong>, <strong>Chat</strong> (conversation with the agent), <strong>Search</strong> (query over your knowledge), and <strong>Knowledge Graph</strong> (browse entities and relationships). It also shows connection status and supports running synthesis. Its main use is to browse, chat, search, and explore your knowledge graph in one place.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cpGB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b12d8e8-0201-40fa-8bb2-edb8a236c83a_3984x2256.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cpGB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b12d8e8-0201-40fa-8bb2-edb8a236c83a_3984x2256.png 424w, https://substackcdn.com/image/fetch/$s_!cpGB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b12d8e8-0201-40fa-8bb2-edb8a236c83a_3984x2256.png 848w, https://substackcdn.com/image/fetch/$s_!cpGB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b12d8e8-0201-40fa-8bb2-edb8a236c83a_3984x2256.png 1272w, https://substackcdn.com/image/fetch/$s_!cpGB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b12d8e8-0201-40fa-8bb2-edb8a236c83a_3984x2256.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cpGB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b12d8e8-0201-40fa-8bb2-edb8a236c83a_3984x2256.png" width="728" height="412" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9b12d8e8-0201-40fa-8bb2-edb8a236c83a_3984x2256.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:824,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:1411086,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188280936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b12d8e8-0201-40fa-8bb2-edb8a236c83a_3984x2256.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cpGB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b12d8e8-0201-40fa-8bb2-edb8a236c83a_3984x2256.png 424w, https://substackcdn.com/image/fetch/$s_!cpGB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b12d8e8-0201-40fa-8bb2-edb8a236c83a_3984x2256.png 848w, https://substackcdn.com/image/fetch/$s_!cpGB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b12d8e8-0201-40fa-8bb2-edb8a236c83a_3984x2256.png 1272w, https://substackcdn.com/image/fetch/$s_!cpGB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b12d8e8-0201-40fa-8bb2-edb8a236c83a_3984x2256.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Orchestration Layer</strong> has as a master agent the Screech agent (Claude Sonnet 4), which sits at the center, handling 60&#8211;70% of requests directly. Three supporting components surround it: the <strong>Task Router</strong> (Haiku, $0.0025/classification), the <strong>Event Bus</strong> (in-process pub/sub), and <strong>Persistent Memory</strong> (conversation history, user context).</p><p><strong>The 128 Subagents</strong> are arranged by category: Core Dev (11), Language Specialists (22), DevOps (15), Testing &amp; Quality (13), Domain-Specific (27), Business (12), Research (6), Dev Experience (12), and Meta-Orchestration (10). The orchestrator delegates to these when deep specialization is needed.</p><p><strong>Hybrid Retrieval</strong> sits between the agents and the database: <code>0.6 vector + 0.2 graph + 0.2 keyword</code>, merging three signals before final relevance scoring.</p><p><strong>SurrealDB</strong> acts as the persistence layer, split into three logical stores in one database: the <strong>Vector Store</strong> (MTREE index, 3072-dim embeddings, cosine similarity), the <strong>Knowledge Graph</strong> (entities, relationships, BFS traversal), and the <strong>Temporal Graph</strong> (Graphiti-inspired episodes, facts, time-range queries).</p><p>At a code level, Screech is just an <code>Agent</code> instance with four things wired in: a <strong>model</strong>, a <strong>hybrid retriever</strong>, a <strong>tool set</strong>, and <strong>persistent memory</strong>. This &#8220;agent factory&#8221; is the single place where the orchestration decisions become concrete.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">// The Screech agent factory
const agent = new Agent({
  name: "Screech",
  purpose: "Unified personal agent for side projects, knowledge synthesis, " +
           "development, documentation, and orchestration of specialist subagents.",
  model: anthropic("claude-sonnet-4-20250514"),
  retriever,  // Hybrid RAG (vector + graph + keyword)
  tools: screechTools,  // Deduplicated from 3 domains
  memory,  // LibSQL persistent memory
});</code></pre></div><h2>11 Tables, One Database: The SurrealDB Model</h2><p>One of the strongest arguments for SurrealDB: documents, embeddings, knowledge graph, temporal events, and conversation memory in 11 tables, one database. No Postgres + Neo4j + Redis dance.</p><h3>Documents and Chunks</h3><p>Ingest a document. Create a <code>document</code> record (metadata, content hash for dedup). Then split it into <code>chunk</code> records. Each chunk gets a 3072-dim embedding (OpenAI <code>text-embedding-3-large</code>). SurrealDB&#8217;s MTREE index does cosine similarity. MTREE is a tree index for high-dimensional vectors (same idea as pgvector&#8217;s HNSW/IVFFlat). It lets SurrealDB find the nearest embeddings without brute-force scanning every row. Chunks are multimodal. They store <code>image_data</code> (base64) and <code>mime_type</code> alongside text. This comes straight from Paul Iusztin&#8217;s insight: stop forcing PDFs through OCR. Treat them as images.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">// Multimodal chunk structure
interface Chunk {
  document_id: string;  // Parent document link
  content: string;      // Text or image description
  embedding: number[];  // 3072-dim vector (MTREE indexed)
  mime_type?: string;   // "text/plain", "image/png", "application/pdf"
  image_data?: string;  // Base64 for vision-processed pages
  page_number?: number; // PDF page tracking
}</code></pre></div><h3>Entities and the Graph (Your Ontology)</h3><p>Here&#8217;s where Screech diverges from typical RAG: I extract a structured knowledge graph. Claude identifies entities and relationships from each document. SurrealDB&#8217;s <code>RELATION</code> type makes this straightforward: <code>entity</code> table <code>relates_to</code> with <code>TYPE RELATION IN entity OUT entity</code>. No separate graph DB.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">-- SurrealDB graph relationships (native support)
DEFINE TABLE relates_to SCHEMAFULL TYPE RELATION IN entity OUT entity;

RELATE entity:react-&gt;relates_to-&gt;entity:nextjs CONTENT {
  relation_type: "EXTENDS",
  confidence: 0.9,
  description: "Next.js extends React with SSR and routing"
};</code></pre></div><p>I picked 11 entity types on purpose. This is the system&#8217;s <strong>ontology</strong>, the vocabulary it uses to classify everything it learns: <code>concept</code>, <code>person</code>, <code>organization</code>, <code>tool</code>, <code>technology</code>, <code>pattern</code>, <code>best_practice</code>, <code>principle</code>, <code>process</code>, <code>document</code>, <code>topic</code>. Each type has its own extraction prompt (e.g., person for roles and affiliations, technology for use cases, and ecosystem). Relationship types include <code>IMPLEMENTS</code>, <code>USES</code>, <code>DEPENDS_ON</code>, <code>PART_OF</code>, <code>EXTENDS</code>, <code>SIMILAR_TO</code>. The ontology is deliberately small; there is enough granularity for useful graph queries without turning into a taxonomy nightmare. Bigger ontologies mean more edge cases and more &#8220;is this a tool or a technology?&#8221; ambiguity. Eleven types cover 95%+ of what a personal knowledge agent encounters.</p><h3>Episodes and Facts: Temporal Layer (It&#8217;s a Log)</h3><p>This is the <a href="https://github.com/getzep/graphiti">Graphiti</a>-inspired layer that most RAG systems completely skip. Every ingestion creates an episode. Think of episodes as an <strong>append-only log of everything the system has ever learned</strong>. It is time-stamped and immutable. Episodes link to entities via source_episode_ids. Ingest a PDF, and you get a new episode. Process a paper, and you get a new episode. They do not get updated or overwritten. Old episodes don&#8217;t get deleted when new ones arrive; they stay in the timeline with their original timestamps. You can ask &#8220;what did I know about X six months ago?&#8221; and get a real and accurate answer.</p><p>Facts are triples (<code>subject</code>, <code>predicate</code>, <code>object</code>) with a source_episode_id. They capture the structured knowledge extracted alongside each episode. When two facts conflict, e.g., &#8220;Bun is experimental&#8221; (June) and &#8220;Bun is production-ready&#8221; (January), the agent can prefer the more recent one.</p><p>Why does this matter? Knowledge changes. Without temporal tracking, both facts coexist in your knowledge base with equal weight, and the agent might confidently cite the stale one. Graphiti calls this &#8220;bi-temporal awareness&#8221;. Tracking both when a fact was true in the world <em>and</em> when the system learned it.</p><p>Behind the scenes, temporal queries run entity search (match query terms), then episode retrieval in a time range (filter out chat episodes, keep knowledge episodes), then relevance filtering and linking back to entities. The result is a time-ordered context. The <code>context</code> field returned is prompt-ready: entity descriptions and relationship sentences in plain language.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">// Time-aware queries
const recent = await queryTemporalGraph("recent learnings", {
  includeTemporal: true,
  timeRange: { start: oneMonthAgo, end: now },
});
// Returns: episodes + linked entities + relationships, time-ordered</code></pre></div><h3>Patterns and Insights: The &#8220;So What?&#8221; Chain</h3><p>Beyond storage, Screech runs a synthesis pipeline. Detects patterns across your knowledge base. Generates actionable insights. The <code>pattern</code> table stores detected patterns (workflows, successes, failures, optimizations) with embeddings for searchability. The <code>insight</code> table stores generated insights, each linked back to source patterns with relevance scores.</p><p>Documents become chunks, which get embedded. Chunks become entities through extraction. Entities become patterns once the synthesis pipeline starts noticing recurring signals. Patterns become insights (actionable takeaways with provenance). Each stage feeds the next. Every layer is searchable. Vague question goes to vector over chunks. Relationships go to a graph over entities. &#8220;What should I do?&#8221; goes to insights.</p><h3>Conversation Memory</h3><p><code>user</code>, <code>thread</code>, <code>message</code> tables handle conversation memory. Zep-style user summaries, conversation threading, and message history. Persistent context across sessions, but separate from the knowledge base.</p><p>The diagram below shows how data flows through the system. Think of it as five layers stacked on top of each other, each feeding the next:</p><ol><li><p><strong>Document layer</strong> (top): <code>document</code> and <code>chunk</code>. Raw material comes in here. A PDF becomes a <code>document</code> record; its content gets split into <code>chunk</code> records, each with a 3072-dim embedding. This is the foundation everything else builds on.</p></li><li><p><strong>Temporal layer</strong>: <code>episode</code> and <code>community</code>. Every ingestion event creates an <code>episode</code> timestamped to when the system learned it. Episodes link back to chunks (what was ingested) and forward to entities (what was extracted). This is the Graphiti-inspired time dimension&#8212;the system knows <em>when</em> it learned something, not just <em>what</em>.</p></li><li><p><strong>Knowledge graph layer</strong>: <code>entity</code>, <code>relates_to</code>, and <code>fact</code>. Entities extracted from chunks (concepts, technologies, people) live here, connected by typed <code>relates_to</code> edges. The diamond shape in the diagram represents the relationship table, which is a SurrealDB <code>RELATION</code> type that sits <em>between</em> entity nodes. <code>fact</code> triples (subject, predicate, object) capture the structured knowledge extracted alongside entities.</p></li><li><p><strong>Synthesis layer</strong>: <code>pattern</code> and <code>insight</code>. Patterns detected across your knowledge base (recurring workflows, success/failure signals, optimization opportunities) and actionable insights generated from those patterns. Each links back to the entities and episodes that sourced it.</p></li><li><p><strong>Conversation layer</strong> (bottom): <code>user</code>, <code>thread</code>, <code>message</code>. Conversation memory, separate from knowledge. Threads reference the user; messages reference threads. The agent can query conversation history independently of the knowledge base.</p></li></ol><p>The arrows in the diagram show the key relationships: chunks link to their parent document. Episodes link to chunks and entities. Entities connect via <code>relates_to</code>. Patterns and insights link back to entities and episodes for provenance. Each layer is independently searchable via the hybrid retrieval pipeline.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!N9Lq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24fd69fe-7267-414d-8e85-51a20564cc3c_1200x2188.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!N9Lq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24fd69fe-7267-414d-8e85-51a20564cc3c_1200x2188.gif 424w, https://substackcdn.com/image/fetch/$s_!N9Lq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24fd69fe-7267-414d-8e85-51a20564cc3c_1200x2188.gif 848w, https://substackcdn.com/image/fetch/$s_!N9Lq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24fd69fe-7267-414d-8e85-51a20564cc3c_1200x2188.gif 1272w, https://substackcdn.com/image/fetch/$s_!N9Lq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24fd69fe-7267-414d-8e85-51a20564cc3c_1200x2188.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!N9Lq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24fd69fe-7267-414d-8e85-51a20564cc3c_1200x2188.gif" width="1200" height="2188" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/24fd69fe-7267-414d-8e85-51a20564cc3c_1200x2188.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2188,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2513810,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188280936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24fd69fe-7267-414d-8e85-51a20564cc3c_1200x2188.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!N9Lq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24fd69fe-7267-414d-8e85-51a20564cc3c_1200x2188.gif 424w, https://substackcdn.com/image/fetch/$s_!N9Lq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24fd69fe-7267-414d-8e85-51a20564cc3c_1200x2188.gif 848w, https://substackcdn.com/image/fetch/$s_!N9Lq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24fd69fe-7267-414d-8e85-51a20564cc3c_1200x2188.gif 1272w, https://substackcdn.com/image/fetch/$s_!N9Lq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24fd69fe-7267-414d-8e85-51a20564cc3c_1200x2188.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>When to Use This vs. Alternatives</h3><p>Unified SurrealDB works when you want graph + vector + relational without three databases, your dataset is moderate (thousands to tens of thousands of docs), and you value dev velocity over ecosystem maturity.</p><p>For production SLAs, PostgreSQL plus pgvector is the safer bet. If your graph is only 2&#8211;3 hops (like Screech&#8217;s BFS), Postgres handles it with recursive CTEs or JOINs, even at millions of rows. Neo4j earns its place when you need deep traversals or heavy graph queries. Graphiti uses Neo4j for that. For my 2-hop, few-thousand-entity case, Postgres + pgvector could do it all. I chose SurrealDB to prototype faster with one schema and one connection. Right call for me. One schema file. One connection. One query language. I would not blindly recommend it for a team with compliance needs.</p><h2>Subagent System: Factory, Registry, Profiles</h2><p>Every subagent is a factory <code>(memory?) =&gt; Agent</code>. Keeps instantiation lazy (no subagent created until needed) and shared memory (agents in the same workflow see the same conversation history).</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">// Subagent factory pattern
export type SubagentFactory = (memory?: Memory) =&gt; Agent;

// Registry metadata for discovery
export interface SubagentDefinition {
  name: string;
  description: string;
  category: SubagentCategory;  // 10 categories
  tags: string[];
  modelTier?: ModelTier;       // fast | standard | reasoning
  toolProfile?: ToolProfile;   // core | dev | security | full
  capabilities?: SubagentCapabilities;
  factory: SubagentFactory;
}</code></pre></div><p>Three decisions that actually matter:</p><p><strong>Model tiers control cost.</strong> Not every agent needs Sonnet. Simple formatting &#8594; Haiku (~90% cheaper). Security audits &#8594; reasoning tier. Default is standard (Sonnet 4). Router can override.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">const MODEL_MAP = {
  fast: anthropic("claude-3-5-haiku-20241022"),     // ~10x cheaper
  standard: anthropic("claude-sonnet-4-20250514"),   // Balanced
  reasoning: anthropic("claude-sonnet-4-20250514"),  // Same model, deeper prompts
};</code></pre></div><p><strong>Tool profiles prevent token waste.</strong> Research analyst doesn&#8217;t need git tools. Code reviewer doesn&#8217;t need security scanning tools. Four profiles: <code>core</code> (7 tools): Knowledge/RAG + file ops + workflow discovery, <code>dev</code> (15 tools): core + git + code analysis + testing, <code>security</code> (17 tools): dev + security scanning + dependency audit and<code>full</code> (18 tools): everything (backwards-compatible default)</p><p>Each subagent can add domain tools on top of its profile.</p><p><strong>Capability declarations.</strong> With 128 agents, &#8220;find agents tagged typescript&#8221; returns a dozen. The orchestrator needs to know what each agent is good at. Each subagent declares what it can do, expected input, output, and latency tier. Semantic matching sends &#8220;TypeScript conditional types&#8221; to the agent whose canDo includes &#8220;conditional types&#8221; and &#8220;type system design&#8221;, not any agent with TypeScript in the tag. Same language field, different canDo, e.g., typescript-pro vs. react-specialist.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">capabilities: {
  canDo: ["type system design", "generics", "conditional types"],
  languages: ["typescript", "javascript"],
  inputSchema: "code snippet or type problem description",
  outputSchema: "typed solution with explanation",
  latencyTier: "medium",
}</code></pre></div><p><strong>Why 10 categories?</strong> Flat list worked until about 40 agents. Then discovery got noisy. Categories are a coarse filter: the orchestrator picks a category, then finds the right specialist within it. The split follows prompt conflicts: security paranoia vs. code-gen creativity, test-engineer adversarial vs. technical-writer explanatory. Separate categories, separate prompts.</p><p>At the same time, <strong>agents in the same category share a tool profile but differ in expertise.</strong> All language specialists get the <code>dev</code> tool profile (git, testing, code analysis). All testing-quality agents share <code>dev</code> tools, too, but their prompts are tuned for finding problems, not writing code. Security agents get <code>security</code> tools. Research agents only need <code>core</code> tools (knowledge/RAG). The table below summarizes counts and examples. The diagram repeats it visually.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!T_gi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e57724a-028a-4ffa-ac91-d76a42ea4aef_926x535.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!T_gi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e57724a-028a-4ffa-ac91-d76a42ea4aef_926x535.png 424w, https://substackcdn.com/image/fetch/$s_!T_gi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e57724a-028a-4ffa-ac91-d76a42ea4aef_926x535.png 848w, https://substackcdn.com/image/fetch/$s_!T_gi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e57724a-028a-4ffa-ac91-d76a42ea4aef_926x535.png 1272w, https://substackcdn.com/image/fetch/$s_!T_gi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e57724a-028a-4ffa-ac91-d76a42ea4aef_926x535.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!T_gi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e57724a-028a-4ffa-ac91-d76a42ea4aef_926x535.png" width="728" height="420.6047516198704" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5e57724a-028a-4ffa-ac91-d76a42ea4aef_926x535.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:535,&quot;width&quot;:926,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:162791,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188280936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e57724a-028a-4ffa-ac91-d76a42ea4aef_926x535.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!T_gi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e57724a-028a-4ffa-ac91-d76a42ea4aef_926x535.png 424w, https://substackcdn.com/image/fetch/$s_!T_gi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e57724a-028a-4ffa-ac91-d76a42ea4aef_926x535.png 848w, https://substackcdn.com/image/fetch/$s_!T_gi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e57724a-028a-4ffa-ac91-d76a42ea4aef_926x535.png 1272w, https://substackcdn.com/image/fetch/$s_!T_gi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e57724a-028a-4ffa-ac91-d76a42ea4aef_926x535.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!N9Lo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de746d-833b-4df7-b2bd-c8441306eaef_1200x675.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!N9Lo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de746d-833b-4df7-b2bd-c8441306eaef_1200x675.png 424w, https://substackcdn.com/image/fetch/$s_!N9Lo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de746d-833b-4df7-b2bd-c8441306eaef_1200x675.png 848w, https://substackcdn.com/image/fetch/$s_!N9Lo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de746d-833b-4df7-b2bd-c8441306eaef_1200x675.png 1272w, https://substackcdn.com/image/fetch/$s_!N9Lo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de746d-833b-4df7-b2bd-c8441306eaef_1200x675.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!N9Lo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de746d-833b-4df7-b2bd-c8441306eaef_1200x675.png" width="728" height="409.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/33de746d-833b-4df7-b2bd-c8441306eaef_1200x675.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:675,&quot;width&quot;:1200,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:158842,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188280936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de746d-833b-4df7-b2bd-c8441306eaef_1200x675.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!N9Lo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de746d-833b-4df7-b2bd-c8441306eaef_1200x675.png 424w, https://substackcdn.com/image/fetch/$s_!N9Lo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de746d-833b-4df7-b2bd-c8441306eaef_1200x675.png 848w, https://substackcdn.com/image/fetch/$s_!N9Lo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de746d-833b-4df7-b2bd-c8441306eaef_1200x675.png 1272w, https://substackcdn.com/image/fetch/$s_!N9Lo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de746d-833b-4df7-b2bd-c8441306eaef_1200x675.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>How Voltagent Fits Into the Picture</h3><p>VoltAgent is the runtime layer that makes the orchestration practical: it provides workflow primitives, tool execution, memory management, and suspend/resume so the orchestrator and subagents can run as a coordinated system.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5a7I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddd739d1-c162-46f8-af88-6d9f30dfda5a_3942x2198.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5a7I!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddd739d1-c162-46f8-af88-6d9f30dfda5a_3942x2198.png 424w, https://substackcdn.com/image/fetch/$s_!5a7I!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddd739d1-c162-46f8-af88-6d9f30dfda5a_3942x2198.png 848w, https://substackcdn.com/image/fetch/$s_!5a7I!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddd739d1-c162-46f8-af88-6d9f30dfda5a_3942x2198.png 1272w, https://substackcdn.com/image/fetch/$s_!5a7I!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddd739d1-c162-46f8-af88-6d9f30dfda5a_3942x2198.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5a7I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddd739d1-c162-46f8-af88-6d9f30dfda5a_3942x2198.png" width="728" height="406" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ddd739d1-c162-46f8-af88-6d9f30dfda5a_3942x2198.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:812,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:469712,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188280936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddd739d1-c162-46f8-af88-6d9f30dfda5a_3942x2198.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5a7I!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddd739d1-c162-46f8-af88-6d9f30dfda5a_3942x2198.png 424w, https://substackcdn.com/image/fetch/$s_!5a7I!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddd739d1-c162-46f8-af88-6d9f30dfda5a_3942x2198.png 848w, https://substackcdn.com/image/fetch/$s_!5a7I!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddd739d1-c162-46f8-af88-6d9f30dfda5a_3942x2198.png 1272w, https://substackcdn.com/image/fetch/$s_!5a7I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddd739d1-c162-46f8-af88-6d9f30dfda5a_3942x2198.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>The $0.0025 Routing Layer</h2><p>This is what makes the economics work. Before anything expensive runs, I do a tiny classification call (Haiku) to label the request. Complexity. Domain. Suggested tier. Up to three candidate specialists. The system prompt gives Haiku the full category list. Rules are explicit. &#8220;Security&#8221; or &#8220;vulnerability&#8221; always goes to reasoning. Simple question goes too fast. Code gen goes to at least standard.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">const classificationSchema = z.object({
  complexity: z.enum(["simple", "moderate", "complex"]),
  domain: z.enum(["lookup", "formatting", "code-generation",
                  "code-review", "architecture", "security",
                  "debugging", "research", "orchestration", "other"]),
  reasoning: z.string(),
  suggestedTier: z.enum(["fast", "standard", "reasoning"]),
  suggestedSubagents: z.array(z.string()).max(3),
});</code></pre></div><p><strong>Step 2: Domain overrides.</strong> <code>resolveRoutedTier()</code> takes the complexity-based tier and domain overrides and picks the <em><strong>higher</strong></em> of the two. So a &#8220;simple&#8221; security question still goes to reasoning. Security that looks simple often is not. The override is a safety net for Haiku&#8217;s optimism.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">const DOMAIN_TIER_OVERRIDES = {
  security: "reasoning",     // No shortcuts
  architecture: "reasoning", // No shortcuts
  debugging: "reasoning",    // No shortcuts
  "code-review": "standard", // At least standard
};</code></pre></div><p><strong>Step 3: Final routing.</strong> Resolved tier + suggested subagents + rationale &#8594; orchestrator picks model and specialists.</p><p>Fallback. If classification fails (network error, timeout), we default to <code>{ complexity: "moderate", tier: "standard" }</code>. Fail to the middle. Not cheapest (might undershoot). Not the most expensive (waste on every failure). Safest with zero information.</p><p><strong>Cost:</strong> ~$0.0025 per classification (~$0.25/M input tokens on Haiku). Route 1,000 tasks, spend $2.50. If even 30% land on Haiku instead of Sonnet, you save on the order $8-10 per 1,000 tasks. The router pays for itself quickly.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tePE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4513bc-dd44-4c65-9544-871d17b87c7f_1200x1485.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tePE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4513bc-dd44-4c65-9544-871d17b87c7f_1200x1485.gif 424w, https://substackcdn.com/image/fetch/$s_!tePE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4513bc-dd44-4c65-9544-871d17b87c7f_1200x1485.gif 848w, https://substackcdn.com/image/fetch/$s_!tePE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4513bc-dd44-4c65-9544-871d17b87c7f_1200x1485.gif 1272w, https://substackcdn.com/image/fetch/$s_!tePE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4513bc-dd44-4c65-9544-871d17b87c7f_1200x1485.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tePE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4513bc-dd44-4c65-9544-871d17b87c7f_1200x1485.gif" width="1200" height="1485" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9a4513bc-dd44-4c65-9544-871d17b87c7f_1200x1485.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1485,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2084570,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188280936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4513bc-dd44-4c65-9544-871d17b87c7f_1200x1485.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tePE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4513bc-dd44-4c65-9544-871d17b87c7f_1200x1485.gif 424w, https://substackcdn.com/image/fetch/$s_!tePE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4513bc-dd44-4c65-9544-871d17b87c7f_1200x1485.gif 848w, https://substackcdn.com/image/fetch/$s_!tePE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4513bc-dd44-4c65-9544-871d17b87c7f_1200x1485.gif 1272w, https://substackcdn.com/image/fetch/$s_!tePE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4513bc-dd44-4c65-9544-871d17b87c7f_1200x1485.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Hybrid Retrieval: Three Signals Beat One</h2><p>I started with pure vector search. It worked until it didn&#8217;t. Three failure modes:</p><ol><li><p><strong>Structural queries fell flat.</strong> &#8220;What tools does the API designer use?&#8221; The answer is in the relationship structure. Vector search gave me chunks that mentioned the API designer, not the ones describing its tool config. I needed graph traversal.</p></li><li><p><strong>Exact-match queries got paraphrased away.</strong> &#8220;What is the error for SQLITE_BUSY?&#8221; Embeddings map that into &#8220;database locking&#8221; neighborhood and miss the chunk with the actual error code. I needed a keyword.</p></li><li><p><strong>Long-document questions needed reasoning, not similarity.</strong> &#8220;What are the conclusions?&#8221; The conclusion section often is not the most similar to the word &#8220;conclusions&#8221;. The intro restating the thesis can score higher. I needed the model to reason over document structure (e.g., a table of contents), not only similarity.</p></li></ol><p>Instead of trying to make one approach handle everything, I split retrieval into three paths:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6Ru_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aff88ce-f6c3-4a8b-8f5b-a4ec3b10b9a7_923x280.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6Ru_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aff88ce-f6c3-4a8b-8f5b-a4ec3b10b9a7_923x280.png 424w, https://substackcdn.com/image/fetch/$s_!6Ru_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aff88ce-f6c3-4a8b-8f5b-a4ec3b10b9a7_923x280.png 848w, https://substackcdn.com/image/fetch/$s_!6Ru_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aff88ce-f6c3-4a8b-8f5b-a4ec3b10b9a7_923x280.png 1272w, https://substackcdn.com/image/fetch/$s_!6Ru_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aff88ce-f6c3-4a8b-8f5b-a4ec3b10b9a7_923x280.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6Ru_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aff88ce-f6c3-4a8b-8f5b-a4ec3b10b9a7_923x280.png" width="923" height="280" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5aff88ce-f6c3-4a8b-8f5b-a4ec3b10b9a7_923x280.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:280,&quot;width&quot;:923,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:86938,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188280936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aff88ce-f6c3-4a8b-8f5b-a4ec3b10b9a7_923x280.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6Ru_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aff88ce-f6c3-4a8b-8f5b-a4ec3b10b9a7_923x280.png 424w, https://substackcdn.com/image/fetch/$s_!6Ru_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aff88ce-f6c3-4a8b-8f5b-a4ec3b10b9a7_923x280.png 848w, https://substackcdn.com/image/fetch/$s_!6Ru_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aff88ce-f6c3-4a8b-8f5b-a4ec3b10b9a7_923x280.png 1272w, https://substackcdn.com/image/fetch/$s_!6Ru_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aff88ce-f6c3-4a8b-8f5b-a4ec3b10b9a7_923x280.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The agent picks which tool to call: <code>document_search</code> / <code>get_context</code> &#8594; hybrid (vector + graph + keyword, then rerank). <code>search_within_document</code> &#8594; same pipeline, one document. <code>answer_from_document_deep</code> &#8594; build a section tree from chunks, LLM picks sections, fetch those chunks only. No vectors on path 3.</p><h3>Why Three Signals Beat One (Paths 1 &amp; 2)</h3><p>The hybrid pipeline runs three queries <strong>in parallel</strong>, then merges with configurable weights:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">const DEFAULT_WEIGHTS = {
  vectorWeight: 0.6,   // Semantic similarity (primary)
  graphWeight: 0.2,    // Relationship traversal
  keywordWeight: 0.2,  // Exact matching
};</code></pre></div><p>Why these weights? <strong>Vector search</strong> gets 0.6 because most knowledge base questions are conceptual, like &#8220;explain X,&#8221; &#8220;how does Y work?&#8221;; therefore, embeddings handle these well. OpenAI&#8217;s <code>text-embedding-3-large</code> (3072 dims) with SurrealDB&#8217;s native MTREE index, cosine similarity. The threshold is intentionally low (0.35), but you can tweak that value according to your use case. Similarity scores compress into a narrow range, so 0.35 is more selective than it sounds.</p><p><strong>Graph search</strong> gets 0.2 because structural queries are the minority but high-value. It works in two stages. First, it finds entities that match the query by name or description, basically a simple text search over the <code>entity</code> table (concepts, technologies, people, organizations). Then it <em>expands</em> outward from those matches using breadth-first search (BFS): for each matched entity, it queries the <code>relates_to</code> edges in SurrealDB to discover neighbors, scoring them at 70% of the parent&#8217;s relevance. Configurable traversal depth (default: 2 hops) controls how far the expansion goes. It should be deep enough to find meaningful connections, but shallow enough to avoid pulling in the entire graph.</p><p>So: vector search finds a chunk mentioning &#8220;React&#8221;? Graph search starts at the &#8220;React&#8221; entity node, walks its edges, and pulls in &#8220;hooks&#8221;, &#8220;server components&#8221;, or &#8220;Next.js&#8221;, without needing those terms in the original query. Need the path between two concepts? A separate BFS finds the shortest connection: <code>React &#8594; EXTENDS &#8594; JavaScript &#8594; USES &#8594; V8</code>, each hop following typed relationships (<code>IMPLEMENTS</code>, <code>USES</code>, <code>DEPENDS_ON</code>, <code>EXTENDS</code>, <code>SIMILAR_TO</code>). This is the signal that vector search fundamentally <em>cannot</em> provide, because relationships are structural, not semantic.</p><p><strong>Keyword search</strong> gets 0.2 because sometimes you just need to find the exact string. Ask a pure vector system &#8220;what version of React does project X use?&#8221; and it&#8217;ll confidently return chunks about React 17, React 18.2, and React 19, all because to an embedding model, they&#8217;re all basically &#8220;React with a number.&#8221; Helpful if you&#8217;re writing an essay. Useless if you need the actual version pinned in your <code>package.json</code>. Keyword search is the boring friend who actually reads the label. Full-text matching with term coverage scoring. No AI magic, just string comparison. And for error codes, version numbers, and config keys, that&#8217;s exactly what you want.</p><p>After merging, results go through <strong>reranking,</strong> and this is where the quality jump happens. The weighted merge gets you close, but reranking catches cases where a high-scoring vector result is semantically related but does not actually <em>answer the question</em>.</p><p>The reranker supports three methods, selectable per query:</p><ul><li><p><strong>Embedding reranking</strong> (fast, cheap): recalculates cosine similarity between the query embedding and each result&#8217;s embedding, then blends it 50/50 with the original merge score. This catches results that scored well on the graph or keyword but are semantically distant from the actual query. Fast because it&#8217;s just math. You don&#8217;t need an LLM call.</p></li><li><p><strong>LLM reranking</strong> (slower, more accurate): sends the query + top 20 candidate passages to Claude Sonnet 4, which scores each on a 0&#8211;1 relevance scale. The LLM understands <em>intent, and</em> it knows that &#8220;how do I fix CORS errors?&#8221; is asking for a solution, not a definition. Sits behind an LRU cache (128 entries, 5-minute TTL) to avoid redundant calls for similar queries.</p></li><li><p><strong>Hybrid reranking</strong> (two-pass): embedding reranking first to narrow the candidate set, then LLM reranking on the survivors. Best quality, highest latency.</p></li></ul><p>On top of any reranking method, there&#8217;s an optional <strong>diversity-aware mode</strong> using MMR (Maximal Marginal Relevance). It iteratively selects results that maximize relevance while penalizing similarity to already-selected results, so it prevents returning five chunks from the same paragraph. Plus a <strong>source-type preference</strong> layer that weights chunks, entities, patterns, and insights differently depending on the query type.</p><h3>Why Reasoning Beats Similarity for Long Documents (Path 3)</h3><p>This is the insight that took me the longest to internalize. For a 300-page PDF, when someone asks &#8220;what are the conclusions?&#8221; the <em>location</em> of the answer is a function of document <em>structure</em>, not content similarity. A chunk from the introduction that restates the thesis will often score higher on cosine similarity to &#8220;conclusions&#8221; than the actual conclusion section. More embedding dimensions won&#8217;t fix this. Better chunking strategies help, but don&#8217;t solve it.</p><p>Path 3 skips vector search entirely. The pipeline has three steps:</p><p><strong>Step 1: Build the section tree.</strong> Take the document&#8217;s chunks (already stored from ingestion) and group them into sections. If chunks have page numbers (PDFs), group by page. Otherwise, group into fixed-size windows (default: 4 chunks per section). Each section node gets an ID, a title (&#8221;Page 12&#8221; or &#8220;Section 5&#8221;), and a short summary (first ~220 characters of the first chunk). The result is a flat list of <code>TreeNode</code> objects, essentially a reconstructed table of contents.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">interface TreeNode {
  node_id: string;        // "s0", "s1", "s2"...
  title: string;          // "Page 12" or "Section 5"
  summary: string;        // First ~220 chars of first chunk
  startChunkIndex: number;
  endChunkIndex: number;
  pageRange?: string;     // "pp. 12&#8211;14"
}</code></pre></div><p><strong>Step 2: LLM selects relevant sections.</strong> The tree outline (node IDs + titles + summaries) is sent to Claude in a single prompt. The key instruction: <em>use reasoning, not keyword matching</em>. The prompt explicitly tells the LLM to think structurally, e.g., &#8220;conclusions are usually in the final section&#8221;, &#8220;see Appendix G means look for an appendix section.&#8221; The LLM returns a JSON object with its reasoning and a list of selected node IDs.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">// What the LLM sees (abbreviated)
// - s0: Page 1 &#8212; "Chapter 1: Introduction. This paper presents..."
// - s1: Page 2 &#8212; "Related work in retrieval-augmented generation..."
// - ...
// - s14: Page 28 &#8212; "7. Conclusions and Future Work. We have shown..."

// What the LLM returns
{
  "thinking": "Conclusions are in the final sections. s14 title mentions Conclusions.",
  "node_list": ["s14"]
}</code></pre></div><p><strong>Step 3: Fetch and return.</strong> Map the selected node IDs back to chunk index ranges, fetch those chunks, and concatenate their content. That&#8217;s your retrieval context. No embedding comparison anywhere in the pipeline.</p><p>The fallback is important: if the LLM returns invalid JSON or no valid node IDs, the system defaults to the first 2&#8211;3 sections. Better to return <em>something</em> than nothing, and introductory sections are a reasonable default for most questions.</p><p>The design is directly inspired by <a href="https://github.com/VectifyAI/PageIndex">PageIndex</a>&#8216;s thesis: similarity &#8800; relevance, and reasoning over document structure often beats embedding search for professional long-form content. It won&#8217;t help for vague conceptual questions; that&#8217;s what path 1 is for. But for &#8220;where in this document does X live?&#8221; or &#8220;what does chapter 7 say about Y?&#8221;, it&#8217;s dramatically better because the LLM can reason about document organization the way a human reader would: by scanning the table of contents first.</p><p><strong>Document-scoped (path 2)</strong> simply narrows the same hybrid pipeline to one document via a <code>documentIds</code> filter. Same three signals, same reranker&#8212;just scoped.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!j0uI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a60174-b01a-4df4-b7b1-8f432f3fc64b_1200x1556.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!j0uI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a60174-b01a-4df4-b7b1-8f432f3fc64b_1200x1556.gif 424w, https://substackcdn.com/image/fetch/$s_!j0uI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a60174-b01a-4df4-b7b1-8f432f3fc64b_1200x1556.gif 848w, https://substackcdn.com/image/fetch/$s_!j0uI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a60174-b01a-4df4-b7b1-8f432f3fc64b_1200x1556.gif 1272w, https://substackcdn.com/image/fetch/$s_!j0uI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a60174-b01a-4df4-b7b1-8f432f3fc64b_1200x1556.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!j0uI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a60174-b01a-4df4-b7b1-8f432f3fc64b_1200x1556.gif" width="1200" height="1556" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/46a60174-b01a-4df4-b7b1-8f432f3fc64b_1200x1556.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1556,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3325409,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188280936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a60174-b01a-4df4-b7b1-8f432f3fc64b_1200x1556.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!j0uI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a60174-b01a-4df4-b7b1-8f432f3fc64b_1200x1556.gif 424w, https://substackcdn.com/image/fetch/$s_!j0uI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a60174-b01a-4df4-b7b1-8f432f3fc64b_1200x1556.gif 848w, https://substackcdn.com/image/fetch/$s_!j0uI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a60174-b01a-4df4-b7b1-8f432f3fc64b_1200x1556.gif 1272w, https://substackcdn.com/image/fetch/$s_!j0uI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a60174-b01a-4df4-b7b1-8f432f3fc64b_1200x1556.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The temporal knowledge graph is directly inspired by <a href="https://github.com/getzep/graphiti">Graphiti</a>, Zep&#8217;s framework for building real-time knowledge graphs. Their core insight: knowledge isn&#8217;t static. Tracking <em>when</em> facts were learned matters as much as the facts themselves. Perfect for a personal agent that continuously ingests new content.</p><p>Every ingested piece creates an &#8220;episode&#8221;; remember that it&#8217;s a timestamped event linking to extracted entities and fact triples (subject-predicate-object). So now you can use time-aware queries: &#8220;What technologies have I been reading about this month?&#8221; or &#8220;How has my understanding of RAG changed?&#8221;</p><h2>Four Production Patterns That Actually Saved Me</h2><p>These emerged from running this thing in the wild. I&#8217;d recommend all four to anyone building multi-agent systems.</p><h3>1. LLM Resilience with Tiered Timeouts</h3><p>Every LLM call goes through <code>withLLMResilience():</code>wrapper that adds per-attempt AbortController timeouts, exponential backoff with jitter, retry only for rate limits / 5xx/ network. Never retries 4xx errors (your fault, not theirs). Different timeouts per use case: classification 60s (if it takes that long, fail), synthesis 300s (different budget). I learned this the hard way. One stuck call should not hold everything up.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">export const LLM_TIMEOUTS = {
  fast: { timeoutMs: 60_000, maxRetries: 3 },      // Classification, reranking
  standard: { timeoutMs: 120_000, maxRetries: 2 }, // Agent generation
  long: { timeoutMs: 300_000, maxRetries: 1 },     // Synthesis, deep analysis
};

// Usage
const result = await withLLMResilience(
  (signal) =&gt; anthropic.messages.create({ ... }, { signal }),
  { ...LLM_TIMEOUTS.fast, label: "task-classification" }
);</code></pre></div><p>Key insight: different operations need different timeout budgets. Classification call taking 60 seconds? Failed. Synthesis operation taking 60 seconds? Just warming up.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CuIu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe20d2352-c18b-4906-b88d-962b4d9fca33_3974x2312.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CuIu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe20d2352-c18b-4906-b88d-962b4d9fca33_3974x2312.png 424w, https://substackcdn.com/image/fetch/$s_!CuIu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe20d2352-c18b-4906-b88d-962b4d9fca33_3974x2312.png 848w, https://substackcdn.com/image/fetch/$s_!CuIu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe20d2352-c18b-4906-b88d-962b4d9fca33_3974x2312.png 1272w, https://substackcdn.com/image/fetch/$s_!CuIu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe20d2352-c18b-4906-b88d-962b4d9fca33_3974x2312.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CuIu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe20d2352-c18b-4906-b88d-962b4d9fca33_3974x2312.png" width="1456" height="847" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e20d2352-c18b-4906-b88d-962b4d9fca33_3974x2312.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:847,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:803494,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188280936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe20d2352-c18b-4906-b88d-962b4d9fca33_3974x2312.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CuIu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe20d2352-c18b-4906-b88d-962b4d9fca33_3974x2312.png 424w, https://substackcdn.com/image/fetch/$s_!CuIu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe20d2352-c18b-4906-b88d-962b4d9fca33_3974x2312.png 848w, https://substackcdn.com/image/fetch/$s_!CuIu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe20d2352-c18b-4906-b88d-962b4d9fca33_3974x2312.png 1272w, https://substackcdn.com/image/fetch/$s_!CuIu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe20d2352-c18b-4906-b88d-962b4d9fca33_3974x2312.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>2. Findings Cache for Review Chains</h3><p>When you run multiple reviewers on the same code (code-reviewer &#8594; security-auditor &#8594; design-analyst), each reviewer produces findings that downstream reviewers need. Without sharing, every reviewer re-parses the same files, re-discovers the same structure, and wastes tokens on duplicate analysis.</p><p>The <code>FindingsCache</code> is a singleton in-memory cache keyed by chain ID. It stores two things: <strong>structural analysis</strong> (file structure, dependencies, symbols, complexity metrics) produced by the first reviewer, and <strong>accumulated findings</strong> from every reviewer in the chain&#8212;each typed with category, severity, source, and location.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">// Each finding is typed and traceable
interface ReviewFinding {
  source: string;     // Which reviewer produced it
  category: "structural" | "quality" | "security" | "design" | "performance";
  severity: "info" | "low" | "medium" | "high" | "critical";
  summary: string;    // Human-readable
  data?: Record&lt;string, unknown&gt;;  // Structured data per reviewer type
  location?: string;  // File/line reference
}</code></pre></div><p>The first reviewer in the chain caches the expensive structural work. The next ones get <code>getChainContextForReviewer()</code>: previous findings + structural cache as a prompt-ready string. Typed findings (source, category, severity, location). TTL 10 min, cap 50 chains. Cuts chain latency by about 40%. The expensive part is parsing and context building, not the LLM. Pattern credit: <a href="https://github.com/JustinNarracott/agentic-playbooks">Agentic Playbooks</a>. Traceable decisions are also a performance win.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">// First one caches structure
findingsCache.setStructuralCache(chainId, {
  fileStructure: "src/api/users.ts - 245 lines, 3 exports",
  dependencies: ["express", "zod", "prisma"],
  symbols: ["createUser", "validateInput", "UserSchema"],
  metrics: { cyclomaticComplexity: 12, loc: 245 },
});

// Next get pre-built context
const context = findingsCache.getChainContextForReviewer(chainId, "security-auditor");
// Returns previous findings + cached structure, prompt-ready</code></pre></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!25rx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c0f132c-2047-4a97-b319-cb587b5fbfb8_1200x2126.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!25rx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c0f132c-2047-4a97-b319-cb587b5fbfb8_1200x2126.gif 424w, https://substackcdn.com/image/fetch/$s_!25rx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c0f132c-2047-4a97-b319-cb587b5fbfb8_1200x2126.gif 848w, https://substackcdn.com/image/fetch/$s_!25rx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c0f132c-2047-4a97-b319-cb587b5fbfb8_1200x2126.gif 1272w, https://substackcdn.com/image/fetch/$s_!25rx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c0f132c-2047-4a97-b319-cb587b5fbfb8_1200x2126.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!25rx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c0f132c-2047-4a97-b319-cb587b5fbfb8_1200x2126.gif" width="1200" height="2126" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6c0f132c-2047-4a97-b319-cb587b5fbfb8_1200x2126.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2126,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:4663699,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188280936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c0f132c-2047-4a97-b319-cb587b5fbfb8_1200x2126.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!25rx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c0f132c-2047-4a97-b319-cb587b5fbfb8_1200x2126.gif 424w, https://substackcdn.com/image/fetch/$s_!25rx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c0f132c-2047-4a97-b319-cb587b5fbfb8_1200x2126.gif 848w, https://substackcdn.com/image/fetch/$s_!25rx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c0f132c-2047-4a97-b319-cb587b5fbfb8_1200x2126.gif 1272w, https://substackcdn.com/image/fetch/$s_!25rx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c0f132c-2047-4a97-b319-cb587b5fbfb8_1200x2126.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h4>What the Workflow Looks Like in Practice</h4><p>The screenshot below is a real run of a multi-step workflow, showing the chain of specialist calls and the event-style logging that makes the system debuggable.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EHL2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bae90ab-8811-45df-8638-b47b69ddf6be_3981x2392.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EHL2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bae90ab-8811-45df-8638-b47b69ddf6be_3981x2392.png 424w, https://substackcdn.com/image/fetch/$s_!EHL2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bae90ab-8811-45df-8638-b47b69ddf6be_3981x2392.png 848w, https://substackcdn.com/image/fetch/$s_!EHL2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bae90ab-8811-45df-8638-b47b69ddf6be_3981x2392.png 1272w, https://substackcdn.com/image/fetch/$s_!EHL2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bae90ab-8811-45df-8638-b47b69ddf6be_3981x2392.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EHL2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bae90ab-8811-45df-8638-b47b69ddf6be_3981x2392.png" width="1456" height="875" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4bae90ab-8811-45df-8638-b47b69ddf6be_3981x2392.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:875,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:754816,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188280936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bae90ab-8811-45df-8638-b47b69ddf6be_3981x2392.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EHL2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bae90ab-8811-45df-8638-b47b69ddf6be_3981x2392.png 424w, https://substackcdn.com/image/fetch/$s_!EHL2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bae90ab-8811-45df-8638-b47b69ddf6be_3981x2392.png 848w, https://substackcdn.com/image/fetch/$s_!EHL2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bae90ab-8811-45df-8638-b47b69ddf6be_3981x2392.png 1272w, https://substackcdn.com/image/fetch/$s_!EHL2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bae90ab-8811-45df-8638-b47b69ddf6be_3981x2392.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>3. In-Process Event Bus</h3><p>Agents need to share findings without tight coupling. Solution: a singleton in-memory pub/sub: typed events, well-known topics (<code>VULNERABILITY_FOUND</code>, <code>CODE_REVIEW_COMPLETE</code>, etc.), source agent ID + correlation ID for tracing across a review chain, and a typed payload. Security auditor finds a vulnerability? Publishes to <code>vulnerability_found</code>. Code reviewer subscribes, incorporates the finding.</p><p>The implementation is a single <code>AgentEventBus</code> class with no external dependencies.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">// Typed event structure
interface AgentEvent&lt;T = unknown&gt; {
  id: string;           // Auto-generated: evt_&lt;timestamp&gt;_&lt;counter&gt;
  topic: string;        // Well-known topic (e.g., "vulnerability_found")
  source: string;       // Publishing agent name
  data: T;              // Typed payload
  timestamp: string;    // ISO timestamp
  correlationId?: string; // Chain/session tracing
}</code></pre></div><p>The key design choice: <strong>fire-and-forget delivery</strong>. When an agent publishes, subscribers are notified via <code>Promise.allSettled()</code>. Slow or failing subscribers never block the publisher. Handler errors are caught and logged, never thrown.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">// Publisher (security-auditor)
await eventBus.publish("vulnerability_found", "security-auditor", {
  severity: "critical",
  type: "sql-injection",
  location: "src/api/users.ts:42",
});

// Subscriber (code-reviewer) registered at startup
eventBus.subscribe("vulnerability_found", async (event) =&gt; {
  // Incorporate into review findings
});</code></pre></div><p>Late-joining subscribers can replay event history (last 50 per topic). Events auto-expire via TTL (5 minutes) with periodic cleanup every 100 events. This way, it keeps memory bounded without needing a background timer. Source filtering lets subscribers only receive events from specific agents.</p><h3>4. Live Evaluation with Sampling</h3><p>Production traffic gets evaluated by moderation and relevancy scorers at configurable sampling rates. Moderation runs on 20% of requests (cheap). Relevance scoring on 10% (LLM judge, expensive). Both async, non-blocking. Never slow down user-facing response.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sZJp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea80761-446d-4398-9255-ca5a84eacc1c_3984x2186.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sZJp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea80761-446d-4398-9255-ca5a84eacc1c_3984x2186.png 424w, https://substackcdn.com/image/fetch/$s_!sZJp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea80761-446d-4398-9255-ca5a84eacc1c_3984x2186.png 848w, https://substackcdn.com/image/fetch/$s_!sZJp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea80761-446d-4398-9255-ca5a84eacc1c_3984x2186.png 1272w, https://substackcdn.com/image/fetch/$s_!sZJp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea80761-446d-4398-9255-ca5a84eacc1c_3984x2186.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sZJp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea80761-446d-4398-9255-ca5a84eacc1c_3984x2186.png" width="1456" height="799" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0ea80761-446d-4398-9255-ca5a84eacc1c_3984x2186.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:799,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:508794,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188280936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea80761-446d-4398-9255-ca5a84eacc1c_3984x2186.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sZJp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea80761-446d-4398-9255-ca5a84eacc1c_3984x2186.png 424w, https://substackcdn.com/image/fetch/$s_!sZJp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea80761-446d-4398-9255-ca5a84eacc1c_3984x2186.png 848w, https://substackcdn.com/image/fetch/$s_!sZJp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea80761-446d-4398-9255-ca5a84eacc1c_3984x2186.png 1272w, https://substackcdn.com/image/fetch/$s_!sZJp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea80761-446d-4398-9255-ca5a84eacc1c_3984x2186.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>What I&#8217;d Change: Real Talk</h2><p><strong>The in-memory event bus doesn&#8217;t survive restarts.</strong> Fine for a personal side-project agent. Terrible for a production system serving a team. Durable workflow engines like <a href="https://www.prefect.io/">Prefect</a>, <a href="https://temporal.io/">Temporal</a>, or <a href="https://www.dbos.dev/">DBOS</a> solve this really well with less infrastructure overhead than rolling your own durability with Redis Streams or NATS. Current design optimizes for simplicity, not resilience. I made that trade knowingly.</p><p><strong>128 subagents is ridiculous.</strong> Pareto wins: ~20 agents do 80%+ of the work. The long tail exists because adding a subagent costs basically nothing (factory + registry entry). I should prune the ones that never get used. Future me will regret not doing that earlier.</p><p><strong>SurrealDB as a unified store is elegant but young.</strong> Graph + vector + relational in one database? Architecturally clean. But doc gaps cost me time. For strict SLAs, I would use Postgres plus pgvector. For 2 to 3 hop graphs, Postgres is enough. Neo4j, when you need a deep graph. I chose SurrealDB to move fast with one DB. Wouldn&#8217;t push it on a team with compliance requirements.</p><p><strong>Haiku routing adds ~500ms latency.</strong> Noticeable in interactive chat. Negligible in background workflows. For latency-critical paths, consider static routing rules (if the tool is <code>security_scan</code> always use reasoning tier) and only invoke the dynamic router for ambiguous tasks.</p><p><strong>Workflow suspend/resume is powerful but adds state complexity.</strong> The 70+ workflows support human-in-the-loop via suspend/resume&#8212;workflow pauses, waits for human input, and continues. Great for approval flows (expense reports, code reviews). Terrible for state management. Every suspended workflow is a piece of state that can go stale. I&#8217;ve had workflows suspended for weeks because I forgot about them. Take that as you will.</p><h3>The Elephant in the Room: Multi-Agent Is Hard for Everyone</h3><p>Screech and others: Claude Code with Skills, custom orchestrators, the lot.</p><p><strong>Overconfidence</strong>. One wrong assumption in step 2 of a 10-step workflow and you get a confidently wrong result. Isolated specialists with focused prompts help, but don&#8217;t remove it. I still see invented APIs and wrong architectural assumptions.</p><p><strong>More agents do not mean better output.</strong> Coordination overhead, conflicting findings, more for the human to reconcile. Findings cache and event bus make communication explicit and traceable, but someone still has to review. Synthesis is an LLM summarizing other LLMs. The chain can be long.</p><p><strong>Oversight tax</strong>. You spend more time reviewing and redirecting than writing. PR review times go up in high-adoption teams (e.g., plus 91%). Comprehension debt: the more you delegate, the less you understand your codebase. Review becomes rubber-stamping. Screech does not fix that.</p><p><strong>Token bloat</strong>. Tool schemas, prompts, skills. You can blow past 50k tokens before the agent does useful work. I keep tool profiles tight (7&#8211;18 per agent). Complex runs still burn tokens.</p><p><strong>Credentials</strong>. For a side project, it&#8217;s manageable. For production with real APIs and DBs, auth and secrets become a project. A lot of agent efforts reportedly fail to scale there. Not the AI. The plumbing. (My therapist has asked me not to elaborate.)</p><p>I&#8217;m building Screech knowing these limits. The design mitigates some of it. It does not remove it. Multi-agent amplifies both capability and failure modes. Build guardrails.</p><h2>Standing on Shoulders</h2><p><a href="https://github.com/VoltAgent/voltagent">VoltAgent</a>. TypeScript agent framework. Runtime, memory, workflows, MCP, observability. Saved me months.</p><p><a href="https://github.com/getzep/graphiti">Graphiti</a> (Zep). Temporal knowledge graph. Episodes, bi-temporal awareness. &#8220;Knowledge changes&#8221; shaped my RAG thinking.</p><p><a href="https://www.decodingai.com/p/stop-converting-documents-to-text">Decoding AI, AI Agents Foundations</a> (Paul Iusztin). Treat docs as images, not OCR. ReAct, tools, memory.</p><p><a href="https://github.com/JustinNarracott/agentic-playbooks">Agentic Playbooks</a> (Justin Narracott). Traceable, auditable decisions as a performance pattern. Findings cache and review chains owe a lot here.</p><p><a href="https://github.com/VectifyAI/PageIndex">PageIndex</a> (Vectify AI). Reasoning over document structure instead of pure similarity. Path 3 (tree-search) is inspired by this.</p><h2>Three Patterns Worth Stealing</h2><p>You don&#8217;t need 128 subagents or a temporal knowledge graph. Here are the three ideas that transfer to any multi-agent system:</p><ol><li><p><strong>Route cheap before routing expensive.</strong> A $0.0025 classification call that routes 30% of tasks to a model, 90% cheaper? Pays for itself on the first batch. Even without subagents, using a small model to decide whether a task needs your large model is almost always worth it.</p></li><li><p><strong>Not every agent needs every tool.</strong> Tool profiles cut token usage, improve tool selection accuracy, and make prompts focused. A research analyst with 7 tools outperforms the same analyst drowning in 18 tools they&#8217;ll never use.</p></li><li><p><strong>Hybrid retrieval beats any single method.</strong> Vector search handles 70% of queries. Graph traversal and keyword matching cover the other 30% (structural queries, exact-match lookups, relationship questions that embeddings silently botch).</p></li></ol><p>The multi-agent pattern isn&#8217;t inherently better. It&#8217;s a trade: quality per task versus orchestration complexity. Start with a single capable agent. When quality degrades across diverse tasks, reach for these patterns. The hard part isn&#8217;t the agents. It&#8217;s the routing, the retrieval, the resilience.</p><p><em>Screech runs on <a href="https://github.com/VoltAgent/voltagent">VoltAgent</a> (agent framework), SurrealDB (multi-model database), and Anthropic Claude (LLM). Architectural inspiration from <a href="https://github.com/getzep/graphiti">Graphiti</a>, <a href="https://github.com/JustinNarracott/agentic-playbooks">Agentic Playbooks</a>, <a href="https://github.com/VectifyAI/PageIndex">PageIndex</a>, and the <a href="https://www.decodingai.com/">Decoding AI</a> community. Built for personal side-project workloads. Adapt the patterns to your scale.</em></p><p>&#8216;Till next time</p><p><a href="https://substack.com/@lucianlature">Lucian Lature</a> | <a href="https://www.linkedin.com/in/lucianlature/">LinkedIn</a></p><div><hr></div><p><em>What&#8217;s your opinion? Do you agree, disagree, or is there something I missed?</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/scaling-120-ai-agents-two-tier-orchestration/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/scaling-120-ai-agents-two-tier-orchestration/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to share our work.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/scaling-120-ai-agents-two-tier-orchestration?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/scaling-120-ai-agents-two-tier-orchestration?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h2>Go Deeper</h2><p><strong>Go from zero to production-grade AI agents</strong> with the <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Agentic AI Engineering self-paced course</a>. Built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Towards AI</a>.</p><p>Across <strong>34 lessons</strong> (articles, videos, and a lot of code), you&#8217;ll design, build, evaluate, and deploy production-grade AI agents end to end. By the final lesson, you&#8217;ll have <strong>built a multi-agent system</strong> and a <strong>capstone project</strong> where you apply everything you&#8217;ve learned on your own.</p><p><em>Three portfolio projects and a certificate to showcase in interviews. Plus a Discord community where you have direct access to other industry experts and me.</em></p><p>Rated 4.9/5 &#11088;&#65039; by 290+ early students &#8212; <em>&#8220;Every AI Engineer needs a course like this.&#8221;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&quot;,&quot;text&quot;:&quot;Learn more&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31"><span>Learn more</span></a></p><p><em>Not ready to commit?</em> We also prepared a free 6-day email course to reveal the <em><strong>6 critical mistakes that silently destroy agentic systems. </strong><a href="https://email-course.towardsai.net/?ref=b3ab31">Get the free email course.</a></em></p><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[How to Design Evaluators That Catch What Actually Breaks]]></title><description><![CDATA[The practical guide to code-based checks, LLM judges, and rubrics for real-world AI apps]]></description><link>https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures</link><guid isPermaLink="false">https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures</guid><dc:creator><![CDATA[Paolo Perrone]]></dc:creator><pubDate>Tue, 03 Mar 2026 12:02:58 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!a1uV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Welcome to the <strong><a href="https://www.decodingai.com/t/ai-evals-and-observability">AI Evals &amp; Observability series</a></strong>: A 7-part journey from shipping AI apps to systematically improving them. Made by busy people. For busy people.</em></p><p>&#129488; Everyone says you need AI evals. Few explain how to actually build them and answer questions such as&#8230;</p><p>How do we avoid creating evals that waste our time and resources? How do we build datasets and design evaluators that matter? How do we adapt them for RAG? ...and most importantly, how do we stop &#8220;vibe checking&#8221; and leverage evals to actually track and optimize our app?</p><p><em>This 7-article series breaks it all down from first principles:</em></p><ol><li><p><a href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app">Integrating AI Evals Into Your AI App</a><strong> </strong></p></li><li><p><a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Build an AI Evals Dataset from Scratch</a> </p></li><li><p><a href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals">Generate Synthetic Datasets for AI Evals </a></p></li><li><p><strong>How to Design Evaluators</strong> &#8592; <em>You are here</em></p></li><li><p><a href="https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge">How to Evaluate the Evaluator </a></p></li><li><p><a href="https://www.decodingai.com/p/rag-evaluation-6-metrics-framework">RAG Evaluation: The Only 6 Metrics You Need</a></p></li><li><p><a href="https://www.decodingai.com/p/behind-the-scenes-of-ai-observability">Lessons from 6 Months of Evals on a Production AI Companion</a></p></li></ol><p>By the end, you&#8217;ll know how to integrate AI evals that actually track and improve the performance of your AI product. No vibe checking required!</p><p><strong>Let&#8217;s get started.</strong></p><div><hr></div><h2>How to Design Evaluators</h2><p>You have a dataset. You&#8217;ve manually labeled examples. You&#8217;ve fixed the obvious bugs. Now you need evaluators that can run automatically and catch problems before users do.</p><p>But here&#8217;s what trips up most teams: they build evaluators that check for things nobody cares about, or they use off-the-shelf metrics that sound impressive but don&#8217;t match their actual use case.</p><p>Three months ago, I spent a weekend building what I thought was a comprehensive evaluation suite for an AI agent that drafted replies to customer support tickets. I had ROUGE scores, BLEU scores, semantic similarity metrics, the works. Everything from the NLP textbook.</p><p>Then I ran it on production traces. The evaluators gave perfect scores to replies that were factually wrong, missed the customer&#8217;s actual question, and used the wrong tone for frustrated users. Meanwhile, they penalized perfectly good replies for using &#8220;different words than the reference answer.&#8221;</p><p>That&#8217;s when I realized: generic metrics optimize for academic benchmarks, not business outcomes. (And no, I&#8217;m not saying academic metrics are useless. They&#8217;re just solving a different problem than &#8220;did this agent do what my users needed?&#8221;)</p><p>The solution is to design evaluators that match your specific success criteria. Not what worked for someone else&#8217;s summarization task. Not what scored well on SQuAD. What actually matters for your users in your use case.</p><p><strong>In this article, we will cover:</strong></p><ul><li><p>The evaluation harness: infrastructure that runs evals end-to-end</p></li><li><p>Dataset and metric types: direct scoring vs. pairwise vs. reference-based</p></li><li><p>Model evaluation vs. app evaluation (and why benchmarks lie)</p></li><li><p>Components of an evaluator: reference examples, metrics, rubrics</p></li><li><p>When to use code-based checks vs. LLM judges</p></li><li><p>Common mistakes (and how to avoid them)</p></li><li><p>Advanced metric designs for multi-turn conversations and agentic workflows</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!a1uV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!a1uV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!a1uV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!a1uV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!a1uV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!a1uV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png" width="1456" height="1048" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1048,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Designing evaluators for AI applications: from code-based checks to LLM judges.&quot;,&quot;title&quot;:&quot;Designing evaluators for AI applications: from code-based checks to LLM judges.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Designing evaluators for AI applications: from code-based checks to LLM judges." title="Designing evaluators for AI applications: from code-based checks to LLM judges." srcset="https://substackcdn.com/image/fetch/$s_!a1uV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!a1uV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!a1uV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!a1uV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 1: Designing evaluators for AI applications: from code-based checks to LLM judges.</em></figcaption></figure></div><p><em>Before digging into the article, a quick word from our sponsor, Opik.</em> &#8595;</p><div><hr></div><h2><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik: Open-Source LLMOps Platform (Sponsored)</a></h2><p>This <strong>AI Evals &amp; Observability</strong> series is brought to you by <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a>, the LLMOps open-source platform used by Uber, Etsy, Netflix, and more. </p><p>We use Opik daily across our courses and AI products. Not just for observability, but to design and run the exact evaluators this article teaches: custom LLM judges, code-based checks, and experiments. All from the same platform.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oSDm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oSDm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 424w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 848w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1272w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png" width="1456" height="364" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:364,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Opik Banner&quot;,&quot;title&quot;:&quot;Opik Banner&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Opik Banner" title="Opik Banner" srcset="https://substackcdn.com/image/fetch/$s_!oSDm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 424w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 848w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1272w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Try Opik for free here</a> (25k spans/month free)</figcaption></figure></div><p>This article shows you how to design evaluators. Opik gives you the harness to run them at scale. Here is how we use it:</p><ul><li><p><strong>Custom LLM judges with rubrics</strong> &#8212; Build the evaluators this article describes: define your criteria, add few-shot examples, and run them across hundreds of traces automatically.</p></li><li><p><strong>Run experiments, compare results</strong> &#8212; Test different prompts, models, or configurations side by side. Opik scores each variant with your evaluators and shows you which one wins.</p></li><li><p><strong>Plug evaluators into production</strong> &#8212; The same LLM judges you design for testing run on live traces too. Set up alarms when scores drop below your threshold so you catch regressions before users do.</p></li></ul><p><strong><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a></strong> is fully <strong>open-source</strong> and works with custom code or most AI frameworks. You can also use the managed version for free (with 25K spans/month on their generous free tier):</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;text&quot;:&quot;Try Opik for free&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul"><span>Try Opik for free</span></a></p><div><hr></div><p><em>&#8595;</em>  <em>Now, let&#8217;s move back to the article.</em></p><h2>Understanding the Evaluation Harness</h2><p>You can&#8217;t manually run 500 test cases. You need automation.</p><p>The infrastructure that runs evals end-to-end is called an <strong>evaluation harness (1)</strong>. It loads your dataset, executes your agent on each test case, captures all the outputs and traces, runs your graders, and aggregates the scores into something you can actually use.</p><p>Think of it like pytest for AI apps. Except instead of checking if a function returns the right type, you&#8217;re checking if an LLM generated text that accomplishes a business goal.</p><p>Here&#8217;s what a harness does:</p><ol><li><p><strong>Loads tasks</strong> from your evaluation dataset</p></li><li><p><strong>Provides instructions and tools</strong> to the agent (system prompts, available functions, etc.)</p></li><li><p><strong>Runs tasks</strong> (often in parallel across multiple trials because LLM outputs vary)</p></li><li><p><strong>Records everything</strong>: inputs, outputs, tool calls, reasoning traces, intermediate states</p></li><li><p><strong>Runs graders</strong> on the results (your evaluators)</p></li><li><p><strong>Aggregates scores</strong> across trials and tasks</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tu0r!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce14a1c-8162-406e-a728-fd596c93af89_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tu0r!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce14a1c-8162-406e-a728-fd596c93af89_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!tu0r!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce14a1c-8162-406e-a728-fd596c93af89_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!tu0r!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce14a1c-8162-406e-a728-fd596c93af89_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!tu0r!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce14a1c-8162-406e-a728-fd596c93af89_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tu0r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce14a1c-8162-406e-a728-fd596c93af89_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1ce14a1c-8162-406e-a728-fd596c93af89_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The evaluation harness pipeline: loading tasks, running agents, scoring results, and aggregating metrics.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The evaluation harness pipeline: loading tasks, running agents, scoring results, and aggregating metrics." title="The evaluation harness pipeline: loading tasks, running agents, scoring results, and aggregating metrics." srcset="https://substackcdn.com/image/fetch/$s_!tu0r!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce14a1c-8162-406e-a728-fd596c93af89_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!tu0r!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce14a1c-8162-406e-a728-fd596c93af89_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!tu0r!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce14a1c-8162-406e-a728-fd596c93af89_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!tu0r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce14a1c-8162-406e-a728-fd596c93af89_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 2: The evaluation harness pipeline: loading tasks, running agents, scoring results, and aggregating metrics.</em></figcaption></figure></div><p>Without a harness, you&#8217;re manually running your agent on test cases and eyeballing the output. With a harness, you run 500 test cases overnight and wake up to a report showing exactly which failure categories spiked [1].</p><p>The harness is separate from your evaluators. The evaluators decide what &#8220;good&#8221; means. The harness handles the boring work of running everything at scale and collecting results.</p><p>Popular harness options include <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a> (what we use), Braintrust, LangSmith, and open-source frameworks like Promptfoo. But honestly, you can build a minimal harness in ~100 lines of Python if you need custom logic [1]. The hard part isn&#8217;t the infrastructure - it&#8217;s assembling the right context (system prompts, conversation history, retrieved docs, tools) for each task. The key is having one. Don&#8217;t manually run evals.</p><p>Now let&#8217;s talk about what those evaluators actually check.</p><h2>Dataset and Metric Types: Three Ways to Grade</h2><p>When designing an evaluator, you need to pick a grading strategy. There are three main approaches, each suited for different situations.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2d8b!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F748f635b-6849-48a1-88fe-e67c64a46b50_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2d8b!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F748f635b-6849-48a1-88fe-e67c64a46b50_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!2d8b!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F748f635b-6849-48a1-88fe-e67c64a46b50_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!2d8b!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F748f635b-6849-48a1-88fe-e67c64a46b50_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!2d8b!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F748f635b-6849-48a1-88fe-e67c64a46b50_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2d8b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F748f635b-6849-48a1-88fe-e67c64a46b50_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/748f635b-6849-48a1-88fe-e67c64a46b50_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Three grading strategies: direct scoring, pairwise comparison, and reference-based evaluation.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Three grading strategies: direct scoring, pairwise comparison, and reference-based evaluation." title="Three grading strategies: direct scoring, pairwise comparison, and reference-based evaluation." srcset="https://substackcdn.com/image/fetch/$s_!2d8b!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F748f635b-6849-48a1-88fe-e67c64a46b50_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!2d8b!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F748f635b-6849-48a1-88fe-e67c64a46b50_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!2d8b!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F748f635b-6849-48a1-88fe-e67c64a46b50_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!2d8b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F748f635b-6849-48a1-88fe-e67c64a46b50_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 3: Three grading strategies: direct scoring, pairwise comparison, and reference-based evaluation.</em></figcaption></figure></div><h3>1. Direct Scoring (Pointwise Evaluation)</h3><p>The evaluator looks at a single output and scores it in isolation. No comparison to anything else.</p><p><strong>Example:</strong></p><ul><li><p>Input: &#8220;Refund my order #12345&#8221;</p></li><li><p>Output: &#8220;I&#8217;ve processed your refund for order #12345. You&#8217;ll see the credit in 3-5 business days.&#8221;</p></li><li><p>Score: Pass (correctly identified the task, provided timeline, professional tone)</p></li></ul><p><strong>When to use:</strong></p><ul><li><p>You have clear, absolute quality criteria (was it helpful? was it safe? did it call the right tool?)</p></li><li><p>You want to track performance over time on the same dataset</p></li><li><p>Your baseline is &#8220;good enough&#8221; not &#8220;better than X&#8221;</p></li></ul><p><strong>Metrics:</strong></p><ul><li><p>Binary pass/fail</p></li><li><p>0-1 scores (where 1 = perfect)</p></li><li><p>Classification labels (Helpful/Neutral/Harmful)</p></li></ul><h3>2. Pairwise Comparison</h3><p>The evaluator compares two outputs and picks which one is better.</p><p><strong>Example:</strong></p><ul><li><p>Input: &#8220;Refund my order #12345&#8221;</p></li><li><p>Output A: &#8220;Refund processed.&#8221;</p></li><li><p>Output B: &#8220;I&#8217;ve processed your refund for order #12345. You&#8217;ll see the credit in 3-5 business days.&#8221;</p></li><li><p>Winner: Output B (more informative, sets expectations)</p></li></ul><p><strong>When to use:</strong></p><ul><li><p>Comparing two model versions (baseline vs. candidate)</p></li><li><p>A/B testing different prompts</p></li><li><p>LLMs are better at ranking than absolute scoring</p></li></ul><p><strong>Watch out for biases (2):</strong></p><ul><li><p><strong>Position bias</strong>: LLMs favor the first or last response shown</p></li><li><p><strong>Verbosity bias</strong>: LLMs prefer longer answers even when they&#8217;re not better</p></li><li><p><strong>Self-enhancement bias</strong>: LLMs favor outputs from themselves over other models</p></li></ul><p>You can mitigate these by randomizing response order and using multiple trials.</p><h3>3. Reference-Based Evaluation</h3><p>The evaluator compares the output to a known &#8220;gold standard&#8221; answer.</p><p><strong>Example:</strong></p><ul><li><p>Input: &#8220;What&#8217;s the capital of France?&#8221;</p></li><li><p>Output: &#8220;Paris&#8221;</p></li><li><p>Reference: &#8220;Paris&#8221;</p></li><li><p>Score: Exact match (Pass)</p></li></ul><p><strong>Example 2 (Semantic equivalence):</strong></p><ul><li><p>Input: &#8220;Summarize the refund policy&#8221;</p></li><li><p>Output: &#8220;Customers can return items within 30 days for a full refund if unused.&#8221;</p></li><li><p>Reference: &#8220;Full refunds are available for unused products returned within 30 days of purchase.&#8221;</p></li><li><p>Score: Pass (different wording, same meaning)</p></li></ul><p><strong>When to use:</strong></p><ul><li><p>You have ground truth answers (FAQs, knowledge bases, structured tasks)</p></li><li><p>Task has a single correct answer or small set of acceptable answers</p></li><li><p>You&#8217;re testing retrieval accuracy or factual correctness</p></li></ul><p><strong>How to measure: </strong></p><ul><li><p><strong>Exact match</strong>: For structured outputs (dates, product IDs, categorical values)</p></li><li><p><strong>Semantic similarity / LLM judges:</strong> For natural language, where multiple phrasings are valid (summaries, explanations, instructions)</p></li></ul><p><strong>Common metrics (3):</strong></p><ul><li><p>Exact match</p></li><li><p>ROUGE (recall-oriented, good for summarization)</p></li><li><p>BLEU (precision-oriented, originally for translation)</p></li><li><p>BERTScore (semantic similarity using embeddings)</p></li><li><p>LLM judges (for nuanced semantic equivalence)</p></li></ul><p><strong>The trap:</strong> Exact match metrics penalize valid variations. If your reference says &#8220;The meeting is on Friday&#8221; and your agent says &#8220;The meeting is scheduled for this Friday,&#8221; exact match fails. This is where semantic similarity metrics (BERTScore) or LLM judges become powerful - they can recognize that different phrasings convey the same outcome.</p><h2>Model Evaluation vs. App Evaluation (Why Benchmarks Lie)</h2><p>Here&#8217;s a distinction that matters more than people realize:</p><p><strong>Model evaluation</strong> measures the LLM itself, in isolation, on generic tasks. This is what benchmarks like MMLU, HumanEval, and Chatbot Arena do.</p><p><strong>App evaluation</strong> measures your entire application (LLM + prompts + tools + retrieval + business logic) on your specific use case.</p><p>High MMLU score doesn&#8217;t mean it handles your refund policy correctly. Benchmarks test general capability. You need to test your specific use case.</p><h3>Model Evaluation (Benchmarks)</h3><p>Tests: &#8220;Can this LLM answer random trivia, write code snippets, or score high on standardized tests?&#8221;</p><p><strong>Useful for:</strong></p><ul><li><p>Comparing foundation models across the board</p></li><li><p>Understanding general capabilities</p></li><li><p>Academic research</p></li></ul><p><strong>Useless for:</strong></p><ul><li><p>Predicting whether it will handle your refund policy correctly</p></li><li><p>Knowing if it will escalate frustrated customers at the right time</p></li><li><p>Determining if it respects your company&#8217;s tone of voice</p></li></ul><h3>App Evaluation (What You Actually Need)</h3><p>Tests: &#8220;Does my customer support agent correctly process refunds, handle escalations, and follow our policies?&#8221;</p><p><strong>This is what matters</strong> because your users don&#8217;t care if GPT-5 scored 95% on MMLU. They care if it solved their problem.</p><p>Your evaluators must be grounded in your business use case, not generic academic benchmarks. This means:</p><ul><li><p>Testing against your actual policies, not Wikipedia facts</p></li><li><p>Using your real user queries, not synthetic textbook questions</p></li><li><p>Measuring outcomes that impact revenue, retention, or safety</p></li></ul><p>Benchmarks tell you which LLM is &#8220;generally smarter.&#8221; App evals tell you which version of your system works better for your users.</p><p>Don&#8217;t mistake one for the other.</p><h2>Components of an Evaluator</h2><p>Now that you know the types, let&#8217;s build one. Every evaluator has three components:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!b_Ph!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabad08fa-fbdb-45b7-9f1d-00ef87d12897_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!b_Ph!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabad08fa-fbdb-45b7-9f1d-00ef87d12897_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!b_Ph!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabad08fa-fbdb-45b7-9f1d-00ef87d12897_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!b_Ph!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabad08fa-fbdb-45b7-9f1d-00ef87d12897_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!b_Ph!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabad08fa-fbdb-45b7-9f1d-00ef87d12897_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!b_Ph!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabad08fa-fbdb-45b7-9f1d-00ef87d12897_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/abad08fa-fbdb-45b7-9f1d-00ef87d12897_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The three components of every evaluator: reference examples, metrics, and rubrics.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The three components of every evaluator: reference examples, metrics, and rubrics." title="The three components of every evaluator: reference examples, metrics, and rubrics." srcset="https://substackcdn.com/image/fetch/$s_!b_Ph!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabad08fa-fbdb-45b7-9f1d-00ef87d12897_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!b_Ph!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabad08fa-fbdb-45b7-9f1d-00ef87d12897_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!b_Ph!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabad08fa-fbdb-45b7-9f1d-00ef87d12897_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!b_Ph!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabad08fa-fbdb-45b7-9f1d-00ef87d12897_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 4: The three components of every evaluator: reference examples, metrics, and rubrics.</em></figcaption></figure></div><h3>1. Reference Examples (Few-Shot Prompts)</h3><p>These are the labeled examples from your dataset. They show the evaluator what &#8220;good&#8221; and &#8220;bad&#8221; look like for your specific task.</p><p>Remember from Article 2: the real power isn&#8217;t in the system prompt, it&#8217;s in these few-shot examples. They encode your domain expert&#8217;s judgment.</p><p><strong>Example:</strong></p><p><strong>Example 1 - PASS</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">Input: &#8220;I need a refund for order #12345&#8221;
Output: &#8220;I&#8217;ve processed your refund. You&#8217;ll see the credit in 3-5 business days.&#8221;
Reason: Confirms action, sets timeline, professional tone.
</code></pre></div><p><strong>Example 2 - FAIL</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">Input: &#8220;Can you waive the late fee on my account?&#8221;
Output: &#8220;I can help with that!&#8221;
Reason: Didn&#8217;t actually take action or explain next steps. Empty promise.
</code></pre></div><h3>2. Metrics</h3><p>The quantifiable measurement of quality. This can be:</p><ul><li><p><strong>Objective</strong>: Did it call the right tool? Is the JSON valid? Is the response under 200 words?</p></li><li><p><strong>Subjective</strong>: Was it helpful? Was the tone appropriate? Did it follow the conversation flow?</p></li></ul><p>For objective metrics, use code-based checks (fast, cheap, deterministic).</p><p>For subjective metrics, use LLM judges or human evaluation.</p><h3>3. Rubrics</h3><p>For subjective metrics, you need a rubric: explicit criteria that define what you&#8217;re measuring.</p><p><strong>Bad rubric:</strong><br><em>&#8220;Was the response helpful?&#8221;</em></p><p>(Too vague. Helpful how? To whom? Compared to what?)</p><p><strong>Good rubric:</strong><br><em>&#8220;Did the response: (1) correctly identify the user&#8217;s request, (2) provide a specific action or next step, (3) include a timeline or expectation, and (4) maintain professional tone?&#8221;</em></p><p>Rubrics force precision. They make subjective judgments repeatable. These criteria become part of your LLM judge&#8217;s system prompt.</p><h2>Code-Based Evaluators: Fast, Cheap, Objective</h2><p>Some checks are deterministic. Did the agent call <code>refund_order()</code>? Is the output valid JSON? Does it include a required disclaimer?</p><p>Use code for these. It&#8217;s faster, cheaper, and never gives you a different answer on the same input.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-byG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31a1e6f1-d168-44f4-928c-1a7969ac85f3_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-byG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31a1e6f1-d168-44f4-928c-1a7969ac85f3_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!-byG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31a1e6f1-d168-44f4-928c-1a7969ac85f3_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!-byG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31a1e6f1-d168-44f4-928c-1a7969ac85f3_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!-byG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31a1e6f1-d168-44f4-928c-1a7969ac85f3_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-byG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31a1e6f1-d168-44f4-928c-1a7969ac85f3_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/31a1e6f1-d168-44f4-928c-1a7969ac85f3_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Code-based evaluators check deterministic criteria: tool calls, format, required elements, and prohibited content.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Code-based evaluators check deterministic criteria: tool calls, format, required elements, and prohibited content." title="Code-based evaluators check deterministic criteria: tool calls, format, required elements, and prohibited content." srcset="https://substackcdn.com/image/fetch/$s_!-byG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31a1e6f1-d168-44f4-928c-1a7969ac85f3_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!-byG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31a1e6f1-d168-44f4-928c-1a7969ac85f3_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!-byG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31a1e6f1-d168-44f4-928c-1a7969ac85f3_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!-byG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31a1e6f1-d168-44f4-928c-1a7969ac85f3_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 5: Code-based evaluators check deterministic criteria: tool calls, format, required elements, and prohibited content.</em></figcaption></figure></div><p><strong>Use code-based evaluators for:</strong></p><ul><li><p><strong>Tool calls</strong>: Did it call <code>refund_order()</code> with the right parameters?</p></li><li><p><strong>Format checks</strong>: Is the output valid JSON? Is it under the character limit?</p></li><li><p><strong>Required elements</strong>: Does it include a disclaimer? Does it have a timestamp?</p></li><li><p><strong>Prohibited content</strong>: Does it contain banned phrases or leaked data?</p></li></ul><p><strong>Example (pseudocode):</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def evaluate_refund_agent(trace):
  # Check if right tool was called
  if &#8220;refund_order&#8221; not in trace.tool_calls:
  return {&#8221;pass&#8221;: False, &#8220;reason&#8221;: &#8220;Didn&#8217;t call refund_order&#8221;}

  # Check if order_id parameter was provided
  params = trace.tool_calls["refund_order"].parameters
  if "order_id" not in params:
    return {"pass": False, "reason": "Missing order_id parameter"}

  # Check if response includes timeline
  if not any(word in trace.output.lower() for word in ["days", "week", "timeline"]):
    return {"pass": False, "reason": "No timeline provided to customer"}

  return {"pass": True, "reason": "All checks passed"}`</code></pre></div><p>Code-based evaluators are:</p><ul><li><p><strong>Fast</strong>: Milliseconds per check</p></li><li><p><strong>Cheap</strong>: No API costs</p></li><li><p><strong>Reproducible</strong>: Same input always gives same result</p></li><li><p><strong>Easy to debug</strong>: When they fail, you know exactly what broke</p></li></ul><p>But they can&#8217;t handle nuance. They can&#8217;t judge tone, helpfulness, or conversational flow. For that, you need LLM judges.</p><p>These code-based evaluators work exactly like classic unit tests you&#8217;re already familiar with. They&#8217;re deterministic, fast, and easy to debug. That&#8217;s why you should always try to implement code-based checks first before reaching for LLM judges. If you can check it with code, do that. Only use LLM judges when code can&#8217;t capture what you need to measure.</p><h2>LLM Judges: Flexible, Scalable, Nuanced</h2><p>An <strong>LLM judge</strong> is an LLM that grades another LLM&#8217;s output. You give it the task, the output, and the evaluation criteria, and it returns a score with reasoning</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!L6EX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c0a5b9-99aa-427e-899f-a2e463267ec7_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!L6EX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c0a5b9-99aa-427e-899f-a2e463267ec7_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!L6EX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c0a5b9-99aa-427e-899f-a2e463267ec7_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!L6EX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c0a5b9-99aa-427e-899f-a2e463267ec7_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!L6EX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c0a5b9-99aa-427e-899f-a2e463267ec7_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!L6EX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c0a5b9-99aa-427e-899f-a2e463267ec7_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e0c0a5b9-99aa-427e-899f-a2e463267ec7_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;LLM judge flow: input context and criteria produce a score with reasoning.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="LLM judge flow: input context and criteria produce a score with reasoning." title="LLM judge flow: input context and criteria produce a score with reasoning." srcset="https://substackcdn.com/image/fetch/$s_!L6EX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c0a5b9-99aa-427e-899f-a2e463267ec7_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!L6EX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c0a5b9-99aa-427e-899f-a2e463267ec7_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!L6EX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c0a5b9-99aa-427e-899f-a2e463267ec7_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!L6EX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c0a5b9-99aa-427e-899f-a2e463267ec7_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 6: LLM judge flow: input context and criteria produce a score with reasoning.</em></figcaption></figure></div><p>LLM judges work in two modes: evaluating outputs against absolute criteria (is it helpful? professional? accurate?) or comparing outputs to reference answers when you have ground truth but need semantic understanding rather than exact string matching.</p><p><strong>Use LLM judges for:</strong></p><ul><li><p><strong>Tone</strong>: Was it empathetic? Professional? Not condescending?</p></li><li><p><strong>Helpfulness</strong>: Did it actually answer the question or deflect?</p></li><li><p><strong>Conversation flow</strong>: Did it maintain context across turns?</p></li><li><p><strong>Reasoning quality</strong>: Did the agent&#8217;s plan make sense?</p></li></ul><p><strong>How it works:</strong></p><ol><li><p>You provide:</p><ul><li><p>The input (user query)</p></li><li><p>The output (agent&#8217;s response)</p></li><li><p>The context (system prompt, retrieved docs, conversation history)</p></li><li><p>Evaluation criteria (what you&#8217;re checking for)</p></li><li><p>Few-shot examples (labeled passes and fails)</p></li></ul></li><li><p>The LLM judge outputs:</p><ul><li><p>A score (pass/fail or 0-1 scale)</p></li><li><p>A critique explaining why</p></li></ul></li></ol><p><strong>Example prompt (simplified):</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;markdown&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-markdown">You are evaluating customer support responses. For each trace, output Pass or Fail 
with reasoning.

Evaluation criteria:
1. Did the response correctly identify the customer&#8217;s request?
2. Did it provide a specific action or next step?
3. Did it include a timeline or expectation?
4. Did it maintain a professional tone?

Here are examples of how a domain expert judged similar cases:

[Few-shot examples from your labeled dataset]

Now evaluate this trace:
Input: [customer query]
Output: [agent response]
Context: [system prompt, policies]</code></pre></div><p>The judge generates: </p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;markdown&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-markdown">FAIL

The response correctly identified the refund request (criterion 1: pass) and 
maintained professional tone (criterion 4: pass). However, it didn&#8217;t specify a next 
step beyond &#8220;we&#8217;ll look into this&#8221; (criterion 2: fail) and provided no timeline 
(criterion 3: fail). Customer is left waiting with no expectations set.</code></pre></div><h3>Strengths of LLM Judges</h3><ul><li><p><strong>Flexible</strong>: Handle open-ended tasks where code can&#8217;t</p></li><li><p><strong>Scalable</strong>: Grade thousands of traces automatically</p></li><li><p><strong>Explainable</strong>: Critiques show reasoning, helping debug failures</p></li></ul><h3>Weaknesses of LLM Judges</h3><ul><li><p><strong>Non-deterministic</strong>: Same input might get different scores across runs</p></li><li><p><strong>Expensive</strong>: Every evaluation is an API call</p></li><li><p><strong>Needs calibration</strong>: Must align with human judgment (we cover this in Article 5)</p></li></ul><h3>Making LLM Judges More Stable</h3><ol><li><p><strong>Use the most capable model</strong> (e.g., Claude Opus, GPT-4o) + footnotes (4)</p></li><li><p><strong>Add chain-of-thought reasoning</strong> before scoring (&#8221;Let&#8217;s think step-by-step...&#8221;)</p></li><li><p><strong>Control for verbosity bias</strong> (normalize response lengths)</p></li><li><p><strong>Run multiple trials</strong> and average scores for critical evals</p></li><li><p><strong>Increase dataset size</strong> to at least 50-100 samples (reduces noise)</p></li></ol><h2>Common Mistakes (And How to Avoid Them)</h2><h3>Mistake 1: Not Providing Critiques</h3><p><strong>Wrong:</strong><br>Score: 1</p><p><strong>Right:</strong><br>Score: 1</p><p>Critique: <em>&#8220;Response correctly identified the refund request but didn&#8217;t provide a timeline. Customer left without expectations.&#8221;</em></p><p>Critiques are not optional. They&#8217;re how you debug failures and train better evaluators.</p><h3>Mistake 2: Overly Terse Critiques</h3><p><strong>Wrong:</strong><br>&#8220;Bad tone&#8221;</p><p><strong>Right:</strong><br><em>&#8220;Response used dismissive language (&#8217;just wait&#8217;) when customer expressed frustration about a delayed order. Should have acknowledged frustration and provided specific next steps.&#8221;</em></p><p>The critique should be detailed enough to serve as a few-shot example later.</p><h3>Mistake 3: Missing Context</h3><p>Don&#8217;t evaluate the output in isolation. Give the evaluator everything a human would see:</p><ul><li><p>The full conversation history (for multi-turn tasks)</p></li><li><p>Retrieved documents (for RAG)</p></li><li><p>System prompts (for understanding constraints)</p></li><li><p>Tool call results (for agentic workflows)</p></li></ul><p>If a human needs it to judge quality, the evaluator needs it too.</p><h3>Mistake 4: Not Providing Diverse Examples</h3><p>If all your few-shot examples are &#8220;customer angry, agent apologizes,&#8221; the judge won&#8217;t know how to handle &#8220;customer confused, needs technical explanation.&#8221;</p><p>Cover the failure modes you actually see in production.</p><h3>Mistake 5: Using Ready-Made Metrics Without Validation</h3><p>ROUGE, BLEU, BERTScore, etc. sound professional, but they might not correlate with your actual goal.</p><p>Before using any metric, validate it against human judgment on your specific task. If high ROUGE doesn&#8217;t mean &#8220;users are happy,&#8221; don&#8217;t optimize for ROUGE.</p><h3>Mistake 6: <strong>Using 1-5 Scales Instead of Binary Pass/Fail</strong></h3><p>Wrong:<br>Score: 3.2 out of 5</p><p>Right:<br>Score: 0 (Fail)<br>Critique: <em>&#8220;Response didn&#8217;t provide a timeline or next steps.&#8221;</em></p><p>Why it matters: A score of 3.2 is ambiguous. Is that good enough to ship? Should you fix it? Binary forces clarity. Either it passes your quality bar or it doesn&#8217;t. Scoring on a float scale (0.0-1.0) has the same problem - it leaves room for interpretation instead of forcing a clear decision.</p><h2>When Should I Use Similarity Metrics (BERTScore, ROUGE, etc.)?</h2><p>Short answer: <strong>Only for specific, narrow tasks where semantic overlap actually matters.</strong></p><h3>When They Work</h3><p><strong>Summarization:</strong> ROUGE measures how much of the source content appears in the summary. If your task is &#8220;don&#8217;t miss key facts,&#8221; ROUGE helps.</p><p><strong>Translation:</strong> BLEU checks n-gram overlap with reference translations. Works when there&#8217;s a narrow acceptable output space.</p><p><strong>Retrieval accuracy:</strong> BERTScore compares semantic similarity between retrieved chunks and expected documents.</p><h3>When They Fail</h3><p><strong>Open-ended generation:</strong> Your AI agent says &#8220;I&#8217;ve refunded order #12345. You&#8217;ll see the credit in 3-5 days.&#8221; Reference says &#8220;Refund processed for order #12345, expect 3-5 business days.&#8221; Different words, same meaning. ROUGE fails.</p><p><strong>Tone and helpfulness:</strong> Similarity metrics don&#8217;t measure if the tone was appropriate or if it actually helped the user.</p><p><strong>Business outcomes:</strong> High similarity doesn&#8217;t mean the customer is satisfied, the sale closed, or the task completed.</p><h3>The Rule</h3><p>If your success criterion is &#8220;output should be semantically similar to the reference answer,&#8221; use similarity metrics.</p><p>If your success criteria are <em>&#8220;user achieved their goal,&#8221;</em> use app-level evaluators grounded in outcomes.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ovy3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd045ca-c3dc-46b7-a942-0ceec8fe523c_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ovy3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd045ca-c3dc-46b7-a942-0ceec8fe523c_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!Ovy3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd045ca-c3dc-46b7-a942-0ceec8fe523c_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!Ovy3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd045ca-c3dc-46b7-a942-0ceec8fe523c_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!Ovy3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd045ca-c3dc-46b7-a942-0ceec8fe523c_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ovy3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd045ca-c3dc-46b7-a942-0ceec8fe523c_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dbd045ca-c3dc-46b7-a942-0ceec8fe523c_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Decision tree for choosing the right evaluator type.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Decision tree for choosing the right evaluator type." title="Decision tree for choosing the right evaluator type." srcset="https://substackcdn.com/image/fetch/$s_!Ovy3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd045ca-c3dc-46b7-a942-0ceec8fe523c_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!Ovy3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd045ca-c3dc-46b7-a942-0ceec8fe523c_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!Ovy3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd045ca-c3dc-46b7-a942-0ceec8fe523c_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!Ovy3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd045ca-c3dc-46b7-a942-0ceec8fe523c_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 7: Decision tree for choosing the right evaluator type.</em></figcaption></figure></div><h2>Advanced Metric Designs</h2><p>Now let&#8217;s handle the hard cases: multi-turn conversations, complex workflows, and agentic systems.</p><h3>Evaluating Multi-Turn Conversation Traces</h3><p>A single-turn eval checks one input and one output. Multi-turn evals check entire conversations.</p><p><strong>Challenges:</strong></p><ul><li><p>Context must carry across turns</p></li><li><p>Errors compound (one bad response derails the rest)</p></li><li><p>You need to catch the <strong>first upstream failure</strong>, not downstream symptoms</p></li></ul><p><strong>Strategy:</strong></p><ol><li><p><strong>End-to-end task success</strong>: Did the agent accomplish the user&#8217;s goal by the end?</p></li><li><p><strong>Turn-by-turn checks</strong>: Evaluate each exchange individually</p><ul><li><p>Did turn 3 maintain context from turn 1?</p></li><li><p>Did turn 5 escalate when the user got frustrated?</p></li></ul></li><li><p><strong>Failure attribution</strong>: When something breaks, find the first turn where it went wrong</p></li></ol><p><strong>Example (customer support conversation):</strong></p><p><strong>Turn 1:</strong></p><p>User: <em>&#8220;I need to return order #12345&#8221;</em></p><p>Agent: <em>&#8220;Sure, I can help with that. What&#8217;s the reason for the return?&#8221;</em></p><p>Eval: Pass (acknowledged request, asked clarifying question)</p><p><strong>Turn 2:</strong></p><p>User: <em>&#8220;It arrived damaged&#8221;</em></p><p>Agent: <em>&#8220;I&#8217;ll process a refund. Expect 3-5 business days.&#8221;</em></p><p>Eval: FAIL (Skipped required step: didn&#8217;t offer replacement or ask for photos of damage)</p><p><strong>Turn 3:</strong></p><p>User: <em>&#8220;Do I need to ship it back?&#8221;</em></p><p>Agent: <em>&#8220;No, keep it.&#8221;</em></p><p><strong>Eval:</strong> Pass (but only because Turn 2 already failed the workflow)</p><p>The <strong>first upstream failure</strong> is Turn 2. Everything after is a consequence.</p><p><strong>Important</strong>: When evaluating any turn, provide all previous turns as context. Evaluating Turn 2? Include Turn 1. Evaluating Turn 3? Include Turns 1 and 2. The evaluator needs the full conversation history to judge whether context was properly maintained.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!F9xg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec05adaa-5103-4052-aadb-2df1e3cc201e_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!F9xg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec05adaa-5103-4052-aadb-2df1e3cc201e_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!F9xg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec05adaa-5103-4052-aadb-2df1e3cc201e_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!F9xg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec05adaa-5103-4052-aadb-2df1e3cc201e_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!F9xg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec05adaa-5103-4052-aadb-2df1e3cc201e_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!F9xg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec05adaa-5103-4052-aadb-2df1e3cc201e_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ec05adaa-5103-4052-aadb-2df1e3cc201e_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Multi-turn conversation evaluation with first upstream failure attribution.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Multi-turn conversation evaluation with first upstream failure attribution." title="Multi-turn conversation evaluation with first upstream failure attribution." srcset="https://substackcdn.com/image/fetch/$s_!F9xg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec05adaa-5103-4052-aadb-2df1e3cc201e_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!F9xg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec05adaa-5103-4052-aadb-2df1e3cc201e_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!F9xg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec05adaa-5103-4052-aadb-2df1e3cc201e_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!F9xg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec05adaa-5103-4052-aadb-2df1e3cc201e_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 8: Multi-turn conversation evaluation with first upstream failure attribution.</em></figcaption></figure></div><h3>Evaluating Complex Multi-Step Workflows</h3><p>Workflows have dependencies. Step 3 can&#8217;t succeed if Step 1 failed. Your evaluator needs to know this.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xiHl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bd8fc68-b010-4c85-bdef-89708a22c188_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xiHl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bd8fc68-b010-4c85-bdef-89708a22c188_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!xiHl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bd8fc68-b010-4c85-bdef-89708a22c188_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!xiHl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bd8fc68-b010-4c85-bdef-89708a22c188_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!xiHl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bd8fc68-b010-4c85-bdef-89708a22c188_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xiHl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bd8fc68-b010-4c85-bdef-89708a22c188_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6bd8fc68-b010-4c85-bdef-89708a22c188_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Evaluating complex multi-step workflows with dependency-aware sequencing.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Evaluating complex multi-step workflows with dependency-aware sequencing." title="Evaluating complex multi-step workflows with dependency-aware sequencing." srcset="https://substackcdn.com/image/fetch/$s_!xiHl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bd8fc68-b010-4c85-bdef-89708a22c188_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!xiHl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bd8fc68-b010-4c85-bdef-89708a22c188_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!xiHl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bd8fc68-b010-4c85-bdef-89708a22c188_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!xiHl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bd8fc68-b010-4c85-bdef-89708a22c188_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 9: Evaluating complex multi-step workflows with dependency-aware sequencing.</em></figcaption></figure></div><p><strong>Example (flight booking agent):</strong></p><p>Required sequence:</p><ol><li><p>Search flights</p></li><li><p>Validate availability</p></li><li><p>Confirm payment</p></li><li><p>Book reservation</p></li></ol><p><strong>Bad eval:</strong> Check if all steps ran (yes/no)</p><p><strong>Good eval:</strong> Check if steps ran in the right order, with correct dependencies</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2pwF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc968b3ad-0e3f-4b47-928b-4578eb3a155b_2969x1587.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2pwF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc968b3ad-0e3f-4b47-928b-4578eb3a155b_2969x1587.png 424w, https://substackcdn.com/image/fetch/$s_!2pwF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc968b3ad-0e3f-4b47-928b-4578eb3a155b_2969x1587.png 848w, https://substackcdn.com/image/fetch/$s_!2pwF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc968b3ad-0e3f-4b47-928b-4578eb3a155b_2969x1587.png 1272w, https://substackcdn.com/image/fetch/$s_!2pwF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc968b3ad-0e3f-4b47-928b-4578eb3a155b_2969x1587.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2pwF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc968b3ad-0e3f-4b47-928b-4578eb3a155b_2969x1587.png" width="1456" height="778" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c968b3ad-0e3f-4b47-928b-4578eb3a155b_2969x1587.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:778,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;code&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="code" title="code" srcset="https://substackcdn.com/image/fetch/$s_!2pwF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc968b3ad-0e3f-4b47-928b-4578eb3a155b_2969x1587.png 424w, https://substackcdn.com/image/fetch/$s_!2pwF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc968b3ad-0e3f-4b47-928b-4578eb3a155b_2969x1587.png 848w, https://substackcdn.com/image/fetch/$s_!2pwF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc968b3ad-0e3f-4b47-928b-4578eb3a155b_2969x1587.png 1272w, https://substackcdn.com/image/fetch/$s_!2pwF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc968b3ad-0e3f-4b47-928b-4578eb3a155b_2969x1587.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Evaluating Agentic Workflows</h3><p>Agents don&#8217;t follow fixed scripts. They plan, reason, and adapt. This makes evaluation harder.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TV0w!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92400711-7db8-4161-b306-6fad56d41a5a_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TV0w!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92400711-7db8-4161-b306-6fad56d41a5a_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!TV0w!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92400711-7db8-4161-b306-6fad56d41a5a_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!TV0w!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92400711-7db8-4161-b306-6fad56d41a5a_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!TV0w!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92400711-7db8-4161-b306-6fad56d41a5a_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TV0w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92400711-7db8-4161-b306-6fad56d41a5a_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/92400711-7db8-4161-b306-6fad56d41a5a_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Two-phase agentic workflow evaluation: end-to-end success followed by step-level diagnostics.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Two-phase agentic workflow evaluation: end-to-end success followed by step-level diagnostics." title="Two-phase agentic workflow evaluation: end-to-end success followed by step-level diagnostics." srcset="https://substackcdn.com/image/fetch/$s_!TV0w!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92400711-7db8-4161-b306-6fad56d41a5a_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!TV0w!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92400711-7db8-4161-b306-6fad56d41a5a_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!TV0w!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92400711-7db8-4161-b306-6fad56d41a5a_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!TV0w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92400711-7db8-4161-b306-6fad56d41a5a_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 10: Two-phase agentic workflow evaluation: end-to-end success followed by step-level diagnostics.</em></figcaption></figure></div><p><strong>Two-phase approach</strong> (from Hamel Husain) (5):</p><h3>Phase 1: End-to-End Task Success</h3><p>Treat the agent as a black box. Did it meet the user&#8217;s goal?</p><p><strong>Define precise success rules per task:</strong></p><ul><li><p>Exact answer match (for factual tasks)</p></li><li><p>Correct side-effect (database updated, email sent, file created)</p></li><li><p>User satisfaction (thumbs up, complaint rate, retry rate)</p></li></ul><p>Use human judges or well-aligned LLM judges. <strong>Focus on first upstream failures</strong> during error analysis.</p><h3>Phase 2: Step-Level Diagnostics</h3><p>Once you know which workflows fail, diagnose why.</p><p>Assuming you&#8217;ve instrumented your system to log tool calls and responses, score:</p><ol><li><p><strong>Tool choice</strong>: Was the selected tool appropriate?</p></li><li><p><strong>Parameter extraction</strong>: Were inputs complete and well-formed?</p></li><li><p><strong>Error handling</strong>: Did it recover from empty results or API failures?</p></li><li><p><strong>Context retention</strong>: Did it preserve earlier constraints?</p></li><li><p><strong>Plan quality</strong>: Does the agent&#8217;s plan match the task requirements?</p></li></ol><p><strong>Transition matrix analysis</strong> (Bryan Bischof&#8217;s approach):</p><p>Track which state transitions cause failures.</p><p>Example (text-to-SQL agent):</p><ul><li><p>GenSQL &#8594; ExecSQL: 12 failures</p></li><li><p>DecideTool &#8594; PlanCal: 2 failures</p></li></ul><p>This data-driven view shows where to focus debugging.</p><p><strong>Session-level metrics:</strong></p><ul><li><p>Task completion rate</p></li><li><p>Step completion (did it finish the required steps?)</p></li><li><p>Trajectory quality (did it avoid loops?)</p></li><li><p>Self-aware failures (did it acknowledge limitations?)</p></li></ul><p><strong>Node-level metrics (per tool call):</strong></p><ul><li><p>Tool correctness (right tool with right parameters?)</p></li><li><p>Tool call accuracy (did the tool run without errors?)</p></li><li><p>Output correctness (did the tool return valid results?)</p></li></ul><p><strong>System efficiency metrics:</strong></p><ul><li><p>Latency (time to complete task)</p></li><li><p>Token usage (cost per task)</p></li><li><p>Tool calls per task (efficiency of plan)</p></li></ul><p>These metrics layer on top of each other[6]. System efficiency ensures scalability. Session-level metrics validate goal achievement. Node-level metrics pinpoint root causes.</p><h2>Bringing It All Together</h2><p>Pick evaluators based on what you&#8217;re actually trying to measure, not what sounds impressive. Here&#8217;s how to decide which evaluator to use:</p><p><strong>Can you check it with code?</strong></p><p>Yes &#8594; Use code-based evaluators (tool calls, format checks, required elements)</p><p>No &#8594; Move to next question</p><p><strong>Is there a single correct answer or narrow acceptable range?</strong></p><p>Yes &#8594; Use reference-based evaluation (exact match, ROUGE, BLEU)</p><p>No &#8594; Move to next question</p><p><strong>Are you comparing two versions?</strong></p><p>Yes &#8594; Use pairwise comparison</p><p>No &#8594; Use direct scoring</p><p><strong>Is the task subjective (tone, helpfulness, flow)?</strong></p><p>Yes &#8594; Use LLM judges with rubrics and few-shot examples</p><p>No &#8594; Rethink your criteria (you might have missed a code-based check)</p><p><strong>Is it a multi-turn or agentic workflow?</strong></p><p>Yes &#8594; Use two-phase approach (end-to-end task success + step-level diagnostics)</p><p>No &#8594; Single-turn direct scoring</p><p>And remember: <strong>your evaluators are only as good as your dataset and few-shot examples</strong>. The system prompt matters less than you think. The examples matter more than you think.</p><h2>Next Steps</h2><p>You now know how to design evaluators that match your use case. You know when to use code, when to use LLMs, and when to combine both.</p><p>But here&#8217;s the critical question we haven&#8217;t answered: <strong>How do you know if your evaluator is actually working?</strong></p><p>An evaluator who says everything is great when it&#8217;s not is worse than no evaluator at all. You need to validate that your automated judges align with human judgment before you trust them.</p><p>That&#8217;s what we&#8217;ll cover in <a href="https://www.decodingai.com/how-to-evaluate-the-evaluator-validate-llm-judge">Article 5: How to Evaluate the Effectiveness of the Evaluator</a>.</p><p>Also, remember that this article is part of a <strong><a href="https://www.decodingai.com/t/ai-evals-and-observability">7-piece series on AI Evals &amp; Observability</a></strong>. <strong>Here&#8217;s what&#8217;s ahead:</strong></p><ol><li><p><a href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app">Integrating AI Evals Into Your AI App</a> </p></li><li><p><a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Build an AI Evals Dataset from Scratch</a>  </p></li><li><p><a href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals">Generate Synthetic Datasets for AI Evals</a>  </p></li><li><p><strong>How to Design Evaluators</strong> &#8592; <em>You just finished this one</em></p></li><li><p><a href="https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge">How to Evaluate the Evaluator</a></p></li><li><p><a href="https://www.decodingai.com/p/rag-evaluation-6-metrics-framework">RAG Evaluation: The Only 6 Metrics You Need</a></p></li><li><p><a href="https://www.decodingai.com/p/behind-the-scenes-of-ai-observability">Lessons from 6 Months of Evals on a Production AI Companion</a></p></li></ol><p>See you next Tuesday.</p><p><a href="https://substack.com/@paoloap">Paolo Perrone</a></p><div><hr></div><p><em>What&#8217;s your opinion? Do you agree, disagree, or is there something I missed?</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures/comments"><span>Leave a comment</span></a></p><div><hr></div><h3>Most AI newsletters give you news. The AI Engineer gives you understanding.</h3><p>One concept per week, explained from first principles: when to fine-tune vs. prompt vs. RAG, which vector database fits your workload, and how companies like DoorDash ship AI at scale.</p><p><em>Written for senior engineers and tech leads who build with AI, not just read about it.</em></p><div class="embedded-publication-wrap" data-attrs="{&quot;id&quot;:6800638,&quot;name&quot;:&quot;The AI Engineer&quot;,&quot;logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!sXyF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F598ebb57-14dc-4faa-9dd1-08d4f2499564_512x512.png&quot;,&quot;base_url&quot;:&quot;https://theaiengineer.substack.com&quot;,&quot;hero_text&quot;:&quot;Where software engineers become dangerously good AI engineers.\n\n&quot;,&quot;author_name&quot;:&quot;Paolo Perrone&quot;,&quot;show_subscribe&quot;:true,&quot;logo_bg_color&quot;:&quot;#ffffff&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="EmbeddedPublicationToDOMWithSubscribe"><div class="embedded-publication show-subscribe"><a class="embedded-publication-link-part" native="true" href="https://theaiengineer.substack.com?utm_source=substack&amp;utm_campaign=publication_embed&amp;utm_medium=web"><img class="embedded-publication-logo" src="https://substackcdn.com/image/fetch/$s_!sXyF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F598ebb57-14dc-4faa-9dd1-08d4f2499564_512x512.png" width="56" height="56" style="background-color: rgb(255, 255, 255);"><span class="embedded-publication-name">The AI Engineer</span><div class="embedded-publication-hero-text">Where software engineers become dangerously good AI engineers.

</div><div class="embedded-publication-author-name">By Paolo Perrone</div></a><form class="embedded-publication-subscribe" method="GET" action="https://theaiengineer.substack.com/subscribe?"><input type="hidden" name="source" value="publication-embed"><input type="hidden" name="autoSubmit" value="true"><input type="email" class="email-input" name="email" placeholder="Type your email..."><input type="submit" class="button primary" value="Subscribe"></form></div></div><div><hr></div><h2>Go Deeper</h2><p><strong>Go from zero to production-grade AI agents</strong> with the <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Agentic AI Engineering self-paced course</a>. Built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Towards AI</a>.</p><p>Across <strong>34 lessons</strong> (articles, videos, and a lot of code), you&#8217;ll design, build, evaluate, and deploy production-grade AI agents end to end. By the final lesson, you&#8217;ll have <strong>built a multi-agent system</strong> and a <strong>capstone project</strong> where you apply everything you've learned on your own.</p><p><em>Three portfolio projects and a certificate to showcase in interviews. Plus a Discord community where you have direct access to other industry experts and me.</em></p><p>Rated 4.9/5 &#11088;&#65039; by 290+ early students &#8212; <em>&#8220;Every AI Engineer needs a course like this.&#8221;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&quot;,&quot;text&quot;:&quot;Learn more&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31"><span>Learn more</span></a></p><p><em>Not ready to commit?</em> We also prepared a free 6-day email course to reveal the <em><strong>6 critical mistakes that silently destroy agentic systems. </strong><a href="https://email-course.towardsai.net/?ref=b3ab31">Get the free email course.</a></em></p><div><hr></div><p><em>Thanks again to <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a> for sponsoring the series and keeping it free!</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yeD8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 424w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 848w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1272w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yeD8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png" width="1200" height="400" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/deaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:400,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Opik Banner&quot;,&quot;title&quot;:&quot;Opik Banner&quot;,&quot;type&quot;:null,&quot;href&quot;:&quot;https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Opik Banner" title="Opik Banner" srcset="https://substackcdn.com/image/fetch/$s_!yeD8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 424w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 848w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1272w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Try Opik for free here</a> (25k spans/month free)</figcaption></figure></div><p><strong>If you want to monitor, evaluate and optimize your AI workflows and agents:</strong></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;text&quot;:&quot;Try Opik for free&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul"><span>Try Opik for free</span></a></p><div><hr></div><h2>References</h2><ol><li><p>Anthropic. (n.d.). Demystifying evals for AI agents. <a href="http://Not ready to commit? We also prepared a free 6-day email course to reveal the 6 critical mistakes that silently destroy agentic   systems. Get the free email course.">https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents</a></p></li><li><p>Evidentlyai. (n.d.). LLM-as-a-judge: a complete guide. <a href="http://Not ready to commit? We also prepared a free 6-day email course to reveal the 6 critical mistakes that silently destroy agentic   systems. Get the free email course.">https://www.evidentlyai.com/llm-guide/llm-as-a-judge</a></p></li><li><p>Evidentlyai. (n.d.). LLM evaluation metrics and methods. <a href="http://Not ready to commit? We also prepared a free 6-day email course to reveal the 6 critical mistakes that silently destroy agentic   systems. Get the free email course.">https://www.evidentlyai.com/llm-guide/llm-evaluation-metrics</a></p></li><li><p>OpenAI. (n.d.). Evaluation best practices. <a href="http://Not ready to commit? We also prepared a free 6-day email course to reveal the 6 critical mistakes that silently destroy agentic   systems. Get the free email course.">https://developers.openai.com/api/docs/guides/evaluation-best-practices</a></p></li><li><p>Husain, H. (n.d.). How do I evaluate agentic workflows? <a href="http://Not ready to commit? We also prepared a free 6-day email course to reveal the 6 critical mistakes that silently destroy agentic   systems. Get the free email course.">https://hamelhusain.substack.com/p/how-do-i-evaluate-agentic-workflows</a></p></li><li><p>Maxim. (n.d.). Evaluating agentic workflows: The essential metrics that matter. <a href="https://www.getmaxim.ai/articles/evaluating-agentic-workflows-the-essential-metrics-that-matter">https://www.getmaxim.ai/articles/evaluating-agentic-workflows-the-essential-metrics-that-matter</a></p></li><li><p>Confident AI. (n.d.). LLM evaluation metrics: Everything you need for LLM evaluation. <a href="https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation">https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation</a></p></li></ol><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[Start Here: Your Map to Decoding AI]]></title><description><![CDATA[The AI Engineering Command Center]]></description><link>https://www.decodingai.com/p/ai-engineering-roadmaps-courses-and-books</link><guid isPermaLink="false">https://www.decodingai.com/p/ai-engineering-roadmaps-courses-and-books</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Sat, 28 Feb 2026 09:33:50 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/8a2dbd01-1caf-40f0-b912-52c22d96c533_1200x628.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Many of you have mentioned that as the magazine grows, finding the right architectural deep dive is becoming harder than the engineering itself. I want you building, not digging through archives.</p><p>This page is your <strong>Command Center</strong>. A clear map to the blueprints you need to move past "fancy demos" and ship production-grade AI.</p><p><em>Here&#8217;s how to find what you need</em> &#8595;</p><div><hr></div><div><hr></div><h1>&#128205;Step 1: The Decoding AI Roadmaps</h1><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oMuA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe94508b7-a266-4d57-ba49-dd0cc848fdb5_1200x230.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oMuA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe94508b7-a266-4d57-ba49-dd0cc848fdb5_1200x230.png 424w, https://substackcdn.com/image/fetch/$s_!oMuA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe94508b7-a266-4d57-ba49-dd0cc848fdb5_1200x230.png 848w, https://substackcdn.com/image/fetch/$s_!oMuA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe94508b7-a266-4d57-ba49-dd0cc848fdb5_1200x230.png 1272w, https://substackcdn.com/image/fetch/$s_!oMuA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe94508b7-a266-4d57-ba49-dd0cc848fdb5_1200x230.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oMuA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe94508b7-a266-4d57-ba49-dd0cc848fdb5_1200x230.png" width="1200" height="230" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e94508b7-a266-4d57-ba49-dd0cc848fdb5_1200x230.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:230,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:117183,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/189115362?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe94508b7-a266-4d57-ba49-dd0cc848fdb5_1200x230.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oMuA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe94508b7-a266-4d57-ba49-dd0cc848fdb5_1200x230.png 424w, https://substackcdn.com/image/fetch/$s_!oMuA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe94508b7-a266-4d57-ba49-dd0cc848fdb5_1200x230.png 848w, https://substackcdn.com/image/fetch/$s_!oMuA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe94508b7-a266-4d57-ba49-dd0cc848fdb5_1200x230.png 1272w, https://substackcdn.com/image/fetch/$s_!oMuA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe94508b7-a266-4d57-ba49-dd0cc848fdb5_1200x230.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p>If you&#8217;re here for the specific architectures and mental models, start by exploring my past work. To keep things tidy, I&#8217;ve moved the full master index to a dedicated page where you can navigate the magazine at your own pace.</p><p>You can filter the entire archive by:</p><ul><li><p><strong>Level:</strong> Beginner, Intermediate, Advanced.</p></li><li><p><strong>Collections:</strong> Foundations, Case Studies, Projects.</p></li><li><p><strong>Series:</strong> Per Topic End-to-End Blueprints.</p></li></ul><p><strong><a href="https://www.decodingai.com/p/ai-engineering-roadmaps">Explore the Roadmaps &#8594;</a></strong> </p><h1>&#128205;Step 2:  The Resource Library</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MOKD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5626e5a-09d7-4f82-a98f-e266e2bbbd76_1200x628.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MOKD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5626e5a-09d7-4f82-a98f-e266e2bbbd76_1200x628.png 424w, https://substackcdn.com/image/fetch/$s_!MOKD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5626e5a-09d7-4f82-a98f-e266e2bbbd76_1200x628.png 848w, https://substackcdn.com/image/fetch/$s_!MOKD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5626e5a-09d7-4f82-a98f-e266e2bbbd76_1200x628.png 1272w, https://substackcdn.com/image/fetch/$s_!MOKD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5626e5a-09d7-4f82-a98f-e266e2bbbd76_1200x628.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MOKD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5626e5a-09d7-4f82-a98f-e266e2bbbd76_1200x628.png" width="1200" height="628" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b5626e5a-09d7-4f82-a98f-e266e2bbbd76_1200x628.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:628,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:378302,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/189115362?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5626e5a-09d7-4f82-a98f-e266e2bbbd76_1200x628.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MOKD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5626e5a-09d7-4f82-a98f-e266e2bbbd76_1200x628.png 424w, https://substackcdn.com/image/fetch/$s_!MOKD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5626e5a-09d7-4f82-a98f-e266e2bbbd76_1200x628.png 848w, https://substackcdn.com/image/fetch/$s_!MOKD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5626e5a-09d7-4f82-a98f-e266e2bbbd76_1200x628.png 1272w, https://substackcdn.com/image/fetch/$s_!MOKD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5626e5a-09d7-4f82-a98f-e266e2bbbd76_1200x628.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>If you&#8217;re looking to go deeper with more structured guides, here&#8217;s where to look next. While the weekly content is great for staying sharp, if you're ready to build a complete system from scratch without piecing together different articles, I&#8217;ve compiled the best of what I know into a few digital products.</p><p>Unlike the weekly posts, these include <strong>full codebases, video walkthroughs, and Q&amp;A support</strong> to help you go from a blank IDE to a deployed system.</p><ul><li><p><strong><a href="https://www.amazon.com/LLM-Engineers-Handbook-engineering-production/dp/1836200072/">The LLM Engineer&#8217;s Handbook:</a></strong> A framework for building LLM and RAG apps. </p></li><li><p><strong><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Agentic AI Engineering Course:</a> </strong>The end-to-end blueprint for designing, testing, and deploying autonomous agents.</p></li><li><p><strong><a href="https://www.pauliusztin.ai/courses">Full Course Catalog:</a> </strong>Write real code. Ship AI that actually works.</p></li></ul><p><strong>Not sure what to pick? </strong>I also have a <strong><a href="https://email-course.towardsai.net/?ref=b3ab31">6-day free email course</a></strong> on the critical design mistakes that silently break agentic systems. It boils down 2+ years of production experience into a simple mental model for building reliable agents that actually scale.</p><h1>&#128233; Every Tuesday</h1><p>You&#8217;ll get one actionable project, case study, or concept deep dive focused on the reality of shipping AI.</p><ul><li><p><strong>Real-world:</strong> No bedtime stories, just hands-on content.</p></li><li><p><strong>Time-efficient:</strong> One free actionable tip in less than 8 minutes.</p></li><li><p><strong>Future-proof:</strong> Skills that will thrive in a future dominated by AI coding tools.</p></li></ul><p><strong><a href="https://www.decodingai.com/">Check the latest insights &#8594; </a></strong></p><div><hr></div><h1>&#128172; Keep in Touch</h1><p>I&#8217;m building and sharing what works, and what doesn&#8217;t, every week. If you want to see the <em>"work in progress"</em> or the journey behind these systems, I also post here:</p><p><a href="http://linkedin.com/company/decodingai-magazine">LinkedIn</a> <strong>|</strong> <a href="https://x.com/pauliusztin_">X</a> <strong>|</strong> <a href="https://github.com/decodingai-magazine">GitHub</a> <strong>| </strong><a href="https://www.pauliusztin.ai/">pauliusztin.ai</a></p><p>Happy learning, <br><a href="https://substack.com/@pauliusztin">Paul Iusztin</a></p><div><hr></div><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[I Spent 9 Months Building an Agentic AI Engineering Course]]></title><description><![CDATA[Google is already recommending it alongside Coursera, DeepLearning.AI and Oxford.]]></description><link>https://www.decodingai.com/p/agentic-ai-engineering-course</link><guid isPermaLink="false">https://www.decodingai.com/p/agentic-ai-engineering-course</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Thu, 26 Feb 2026 12:00:27 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!6Qcm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Most AI agent courses teach you toy examples. Build a chatbot, call an API, done. But when you try to build something real, something that handles research, generates structured content, orchestrates multiple tools, and actually works in production, you realize those tutorials left out everything that matters. Agentic AI is an engineering discipline, not a prompting exercise.</p><p>That gap is exactly why I spent the last 9 months building an <strong><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Agentic AI Engineering course</a></strong> with Towards AI. And here is what makes it different: we didn&#8217;t just teach how to build agents. We built two production AI systems, used them daily, and wrote the course with them.</p><p>Google and Gemini are already recommending it alongside courses from Coursera and DeepLearning.AI:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RBtZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0ea13d6-eb52-4ae6-a7bc-29622bd71cfc_878x894.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RBtZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0ea13d6-eb52-4ae6-a7bc-29622bd71cfc_878x894.jpeg 424w, https://substackcdn.com/image/fetch/$s_!RBtZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0ea13d6-eb52-4ae6-a7bc-29622bd71cfc_878x894.jpeg 848w, https://substackcdn.com/image/fetch/$s_!RBtZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0ea13d6-eb52-4ae6-a7bc-29622bd71cfc_878x894.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!RBtZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0ea13d6-eb52-4ae6-a7bc-29622bd71cfc_878x894.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RBtZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0ea13d6-eb52-4ae6-a7bc-29622bd71cfc_878x894.jpeg" width="598" height="608.8974943052392" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a0ea13d6-eb52-4ae6-a7bc-29622bd71cfc_878x894.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:894,&quot;width&quot;:878,&quot;resizeWidth&quot;:598,&quot;bytes&quot;:172075,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188718296?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab6c56e9-a4a4-4a38-9396-c157482464fd_878x894.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RBtZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0ea13d6-eb52-4ae6-a7bc-29622bd71cfc_878x894.jpeg 424w, https://substackcdn.com/image/fetch/$s_!RBtZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0ea13d6-eb52-4ae6-a7bc-29622bd71cfc_878x894.jpeg 848w, https://substackcdn.com/image/fetch/$s_!RBtZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0ea13d6-eb52-4ae6-a7bc-29622bd71cfc_878x894.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!RBtZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0ea13d6-eb52-4ae6-a7bc-29622bd71cfc_878x894.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>How This Course Was Built</h2><p>Back in January 2025, Louis-Fran&#231;ois Bouchard (Co-Founder at Towards AI) reached out to me about creating a course on Agentic AI Engineering. I deeply respected Louis&#8217;s work in the AI space. So I said yes.</p><p>By April 2025, we had a team of five and one non-negotiable rule: we would only teach something we actually use ourselves. No toy examples. No throwaway demos.</p><p>We settled on an ambitious idea: a deep research agent and a writing workflow specialized in generating high-quality lessons and articles with text, code, images, diagrams, and references. We called them Nova and Brown.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sbHi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43a2a61-f109-49ab-995a-44fa8919e03d_1200x1200.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sbHi!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43a2a61-f109-49ab-995a-44fa8919e03d_1200x1200.gif 424w, https://substackcdn.com/image/fetch/$s_!sbHi!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43a2a61-f109-49ab-995a-44fa8919e03d_1200x1200.gif 848w, https://substackcdn.com/image/fetch/$s_!sbHi!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43a2a61-f109-49ab-995a-44fa8919e03d_1200x1200.gif 1272w, https://substackcdn.com/image/fetch/$s_!sbHi!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43a2a61-f109-49ab-995a-44fa8919e03d_1200x1200.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sbHi!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43a2a61-f109-49ab-995a-44fa8919e03d_1200x1200.gif" width="728" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b43a2a61-f109-49ab-995a-44fa8919e03d_1200x1200.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1200,&quot;width&quot;:1200,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sbHi!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43a2a61-f109-49ab-995a-44fa8919e03d_1200x1200.gif 424w, https://substackcdn.com/image/fetch/$s_!sbHi!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43a2a61-f109-49ab-995a-44fa8919e03d_1200x1200.gif 848w, https://substackcdn.com/image/fetch/$s_!sbHi!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43a2a61-f109-49ab-995a-44fa8919e03d_1200x1200.gif 1272w, https://substackcdn.com/image/fetch/$s_!sbHi!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43a2a61-f109-49ab-995a-44fa8919e03d_1200x1200.gif 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The twist: we used Nova and Brown to write the course itself. Every lesson went through the same AI system we were teaching students to build. If something broke, we fixed it. Not for a demo, but because we needed it to work. That pressure forced us to build something production-ready, not just classroom-ready.</p><p>Nova and Brown are two MCP servers that can be orchestrated within a multi-agent system through Cursor, Claude Code, or any custom orchestrator. We created an AI system that writes about itself.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&quot;,&quot;text&quot;:&quot;Learn more&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31"><span>Learn more</span></a></p><h2>What You Get</h2><p>34 lessons that take you from foundations to deploying your own agent through articles, videos, and hands-on Notebooks. You will learn tool calling, ReAct loops, context engineering, structured generation, memory systems, RAG, planning and reasoning architectures, human-in-the-loop feedback, and CI/CD deployment:</p><ul><li><p><strong>Self-paced with monthly live kick-off sessions</strong> so you can go at your own speed without losing momentum.</p></li><li><p><strong>4 parts:</strong> Foundations (multiple smaller projects), two end-to-end complex projects, LLMOps (evaluation, observability, auth, deployment), and a final capstone project you implement yourself.</p></li><li><p><strong>Real code, not notebook-only demos.</strong> The teaching happens through Notebooks, but the code is structured as two Python modules (Nova and Brown). You import from the modules into Notebooks for a structured learning experience.</p></li><li><p><strong>Fundamentals over frameworks.</strong> We wrote as much as possible from scratch because tools change constantly. The course focuses on design principles and patterns you can replicate in any tool. Key tools used: LangGraph, LangChain, Gemini, FastMCP, Cursor/Claude Code, Opik, Perplexity, and GCP.</p></li><li><p><strong>Discord community</strong> with Q&amp;A support and a completion certificate.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6Qcm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 424w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 848w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6Qcm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg" width="721" height="405.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:720,&quot;width&quot;:1280,&quot;resizeWidth&quot;:721,&quot;bytes&quot;:190048,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188718296?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6Qcm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 424w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 848w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Who Is This For?</h3><p>Engineers who want to go deep on AI agents, not skim the surface. If you are a software engineer, ML engineer, or data scientist who has played with LLMs but never built a multi-step agent that actually works in production, this is for you.</p><p>You should be comfortable with Python, have basic familiarity with LLMs, Docker, and cloud. And above all: a builder mindset.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&quot;,&quot;text&quot;:&quot;Learn more&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31"><span>Learn more</span></a></p><p><em>Early-bird pricing: <strong>$449 for lifetime access</strong> &#8212; limited to the first 100 seats!</em></p><p><strong>&#128161;</strong><em><strong> Not sure yet?</strong> We <a href="https://github.com/towardsai/agentic-ai-engineering-course/tree/main">open-sourced the code on GitHub</a> and made the <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">first 6 lessons free</a>.</em></p><h2>What Students Are Saying</h2><p>We sold 150 pre-release slots to build the course with a real audience. The result: 25 five-star reviews. Not from our own biased impression, but from students who went through the material. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RrcJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1005b1c-02f9-4079-95aa-c2b04dd9f991_1314x342.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RrcJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1005b1c-02f9-4079-95aa-c2b04dd9f991_1314x342.png 424w, https://substackcdn.com/image/fetch/$s_!RrcJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1005b1c-02f9-4079-95aa-c2b04dd9f991_1314x342.png 848w, https://substackcdn.com/image/fetch/$s_!RrcJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1005b1c-02f9-4079-95aa-c2b04dd9f991_1314x342.png 1272w, https://substackcdn.com/image/fetch/$s_!RrcJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1005b1c-02f9-4079-95aa-c2b04dd9f991_1314x342.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RrcJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1005b1c-02f9-4079-95aa-c2b04dd9f991_1314x342.png" width="1314" height="342" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d1005b1c-02f9-4079-95aa-c2b04dd9f991_1314x342.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:342,&quot;width&quot;:1314,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:79774,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188718296?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1005b1c-02f9-4079-95aa-c2b04dd9f991_1314x342.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RrcJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1005b1c-02f9-4079-95aa-c2b04dd9f991_1314x342.png 424w, https://substackcdn.com/image/fetch/$s_!RrcJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1005b1c-02f9-4079-95aa-c2b04dd9f991_1314x342.png 848w, https://substackcdn.com/image/fetch/$s_!RrcJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1005b1c-02f9-4079-95aa-c2b04dd9f991_1314x342.png 1272w, https://substackcdn.com/image/fetch/$s_!RrcJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1005b1c-02f9-4079-95aa-c2b04dd9f991_1314x342.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As one reviewer put it: &#8220;goes far beyond theory, providing deep, practical experience&#8221; with real-world constraints rather than flashy demos.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AVPK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704217e2-2c10-4dab-8090-619fcd9d0600_1080x1106.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AVPK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704217e2-2c10-4dab-8090-619fcd9d0600_1080x1106.png 424w, https://substackcdn.com/image/fetch/$s_!AVPK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704217e2-2c10-4dab-8090-619fcd9d0600_1080x1106.png 848w, https://substackcdn.com/image/fetch/$s_!AVPK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704217e2-2c10-4dab-8090-619fcd9d0600_1080x1106.png 1272w, https://substackcdn.com/image/fetch/$s_!AVPK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704217e2-2c10-4dab-8090-619fcd9d0600_1080x1106.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AVPK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704217e2-2c10-4dab-8090-619fcd9d0600_1080x1106.png" width="721" height="738.3574074074074" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/704217e2-2c10-4dab-8090-619fcd9d0600_1080x1106.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1106,&quot;width&quot;:1080,&quot;resizeWidth&quot;:721,&quot;bytes&quot;:761690,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188718296?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97c248ba-e497-4c1e-b0ca-f260707c043d_1080x1440.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AVPK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704217e2-2c10-4dab-8090-619fcd9d0600_1080x1106.png 424w, https://substackcdn.com/image/fetch/$s_!AVPK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704217e2-2c10-4dab-8090-619fcd9d0600_1080x1106.png 848w, https://substackcdn.com/image/fetch/$s_!AVPK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704217e2-2c10-4dab-8090-619fcd9d0600_1080x1106.png 1272w, https://substackcdn.com/image/fetch/$s_!AVPK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704217e2-2c10-4dab-8090-619fcd9d0600_1080x1106.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Sean Myers, Principal Analyst at Columbia, already earned the first completion certificate:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6MwH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb39dba71-2646-4932-9698-e0b3473db50f_2658x1878.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6MwH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb39dba71-2646-4932-9698-e0b3473db50f_2658x1878.png 424w, https://substackcdn.com/image/fetch/$s_!6MwH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb39dba71-2646-4932-9698-e0b3473db50f_2658x1878.png 848w, https://substackcdn.com/image/fetch/$s_!6MwH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb39dba71-2646-4932-9698-e0b3473db50f_2658x1878.png 1272w, https://substackcdn.com/image/fetch/$s_!6MwH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb39dba71-2646-4932-9698-e0b3473db50f_2658x1878.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6MwH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb39dba71-2646-4932-9698-e0b3473db50f_2658x1878.png" width="725" height="512.3798076923077" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b39dba71-2646-4932-9698-e0b3473db50f_2658x1878.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1029,&quot;width&quot;:1456,&quot;resizeWidth&quot;:725,&quot;bytes&quot;:1297320,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188718296?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb39dba71-2646-4932-9698-e0b3473db50f_2658x1878.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6MwH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb39dba71-2646-4932-9698-e0b3473db50f_2658x1878.png 424w, https://substackcdn.com/image/fetch/$s_!6MwH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb39dba71-2646-4932-9698-e0b3473db50f_2658x1878.png 848w, https://substackcdn.com/image/fetch/$s_!6MwH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb39dba71-2646-4932-9698-e0b3473db50f_2658x1878.png 1272w, https://substackcdn.com/image/fetch/$s_!6MwH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb39dba71-2646-4932-9698-e0b3473db50f_2658x1878.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here you can learn more:</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&quot;,&quot;text&quot;:&quot;Learn more&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31"><span>Learn more</span></a></p><p></p><h2>Paid Subscribers</h2><p>For paid subscribers, we are offering <strong>20% off.</strong> For the discount code, DM me on Substack or comment on this post.</p><p>We will soon create a paid subscribers&#8217; perks page with more offers. But for now, let&#8217;s keep it simple.</p><p>Looking forward to your feedback on the course and seeing you next Tuesday!</p><p>Paul</p>]]></content:encoded></item></channel></rss>