<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Decoding AI Magazine]]></title><description><![CDATA[Join for content on designing, building, and shipping AI software. Learn AI engineering, end-to-end, from idea to production. Every Tuesday.]]></description><link>https://www.decodingai.com</link><image><url>https://substackcdn.com/image/fetch/$s_!k2ig!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00bc74e0-3601-49ce-8ab9-4c7b499ce597_1280x1280.png</url><title>Decoding AI Magazine</title><link>https://www.decodingai.com</link></image><generator>Substack</generator><lastBuildDate>Sun, 12 Apr 2026 05:43:43 GMT</lastBuildDate><atom:link href="https://www.decodingai.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Paul Iusztin]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[decodingai@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[decodingai@substack.com]]></itunes:email><itunes:name><![CDATA[Paul Iusztin]]></itunes:name></itunes:owner><itunes:author><![CDATA[Paul Iusztin]]></itunes:author><googleplay:owner><![CDATA[decodingai@substack.com]]></googleplay:owner><googleplay:email><![CDATA[decodingai@substack.com]]></googleplay:email><googleplay:author><![CDATA[Paul Iusztin]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Your RAG Pipeline Is Overkill]]></title><description><![CDATA[The pattern that lets your model write code to explore its context instead of retrieving it.]]></description><link>https://www.decodingai.com/p/recursive-language-models</link><guid isPermaLink="false">https://www.decodingai.com/p/recursive-language-models</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Tue, 07 Apr 2026 11:03:14 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!jJY1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08afe0b8-6aa1-41ce-8bb3-cae88284181f_1400x1000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We constantly fight a battle against the context window limit. You either compress your data until it loses meaning, or you build a massive infrastructure project just to read a few documents. Today, we look at a third option. We explore a pattern that allows models to read millions of tokens by treating data as an environment rather than an input.</p><p>In most AI projects, such as the financial assistant I am working on, there is a constant battle between Retrieval-Augmented Generation (RAG) and Context-Augmented Generation (CAG). Should you implement a heavy RAG architecture up front that might not even work, or does CAG get the job done? For example, in our financial assistant system, we ultimately decided to use RAG only when we really HAVE to, because it introduces zigzag retrieval patterns that require dozens of queries per operation, increasing latency.</p><p>Also, while building Brown, my writing agent, I hit another wall. Brown needs to ingest massive amounts of research to anchor its writing process. At 180,000 input tokens, the Gemini API became entirely unreliable.</p><p>I faced constant timeouts, disconnections, and infrastructure breakdowns. Huge context windows suffer from API reliability and infrastructure stability issues, as well as performance degradation. But the thing is, I didn&#8217;t want to overcomplicate my solution with a RAG layer, so I started looking around for other solutions.</p><p>Most engineers face this painful tradeoff when working with large documents. You can stuff everything into the context window, but performance degrades quickly. This causes context rot, which happens when attention degrades over long contexts and earlier information loses its influence <a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents">[1]</a>, <a href="https://venturebeat.com/orchestration/mits-new-recursive-framework-lets-llms-process-10-million-tokens-without-context-rot/">[2]</a>.</p><p>Alternatively, you can build a RAG pipeline. But that requires maintaining vector databases, chunking strategies, and retrieval evaluation infrastructure.</p><p>Even the tools we use daily, like Claude Code or Cursor, rely on summarization-based context compression that loses critical information. I just wanted to dump my research into one file and get good answers without the infrastructure breaking. Recursive Language Models (RLMs) solve this exact problem <a href="https://arxiv.org/abs/2512.24601">[3]</a>.</p><p>RLMs use an inference-time pattern that treats your input as an external environment the model interacts with programmatically. You do not need chunking infrastructure or embedding pipelines. The model writes code to explore, filter, and recursively process your data on demand.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jJY1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08afe0b8-6aa1-41ce-8bb3-cae88284181f_1400x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jJY1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08afe0b8-6aa1-41ce-8bb3-cae88284181f_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!jJY1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08afe0b8-6aa1-41ce-8bb3-cae88284181f_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!jJY1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08afe0b8-6aa1-41ce-8bb3-cae88284181f_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!jJY1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08afe0b8-6aa1-41ce-8bb3-cae88284181f_1400x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jJY1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08afe0b8-6aa1-41ce-8bb3-cae88284181f_1400x1000.png" width="1400" height="1000" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/08afe0b8-6aa1-41ce-8bb3-cae88284181f_1400x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1000,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The three approaches to processing large documents&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The three approaches to processing large documents" title="The three approaches to processing large documents" srcset="https://substackcdn.com/image/fetch/$s_!jJY1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08afe0b8-6aa1-41ce-8bb3-cae88284181f_1400x1000.png 424w, https://substackcdn.com/image/fetch/$s_!jJY1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08afe0b8-6aa1-41ce-8bb3-cae88284181f_1400x1000.png 848w, https://substackcdn.com/image/fetch/$s_!jJY1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08afe0b8-6aa1-41ce-8bb3-cae88284181f_1400x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!jJY1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08afe0b8-6aa1-41ce-8bb3-cae88284181f_1400x1000.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 1: The three approaches to processing large documents. RAG adds infrastructure complexity. Context stuffing causes degradation. RLMs treat the input as an external environment the model programs against.</em></figcaption></figure></div><p>This approach scales the effective input and output lengths of LLMs. Researchers tested RLMs up to 10 million tokens across GPT-5 and Qwen3-Coder, showing they easily outperform base models <a href="https://arxiv.org/abs/2512.24601">[3]</a>. Base model performance degrades as a function of input length and task complexity, while RLM performance scales with less degradation.</p><p>RLMs are also a model-agnostic inference strategy, meaning they work with any model you choose.</p><p>However, this architecture has honest downsides you must consider. The inference cost has high variance due to differences in trajectory lengths. The system suffers from code fragility, meaning that if the model writes buggy code, the entire reasoning chain fails.</p><p>Errors in sub-calls can compound through the recursive tree, propagating hallucinations. Sequential sub-calls also create latency bottlenecks. This makes RLMs best suited for deep thinking applications rather than real-time chat.</p><p>To understand how we bypass these infrastructure limits, we need to examine the specific programming trick that keeps the model&#8217;s memory clean.</p><p>Here is what you will learn about this pattern:</p><ul><li><p>The mechanism that keeps massive documents outside the context window.</p></li><li><p>The orchestration loop that drives programmatic data exploration.</p></li><li><p>The specific use cases where this pattern outperforms retrieval systems.</p></li><li><p>A practical method to approximate this behavior using Claude Code.</p></li></ul><div><hr></div><h2><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">If You Want To Go Deeper Into Production AI (Product)</a></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!59a6!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 424w, https://substackcdn.com/image/fetch/$s_!59a6!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 848w, https://substackcdn.com/image/fetch/$s_!59a6!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 1272w, https://substackcdn.com/image/fetch/$s_!59a6!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!59a6!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif" width="1200" height="1200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1200,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!59a6!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 424w, https://substackcdn.com/image/fetch/$s_!59a6!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 848w, https://substackcdn.com/image/fetch/$s_!59a6!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 1272w, https://substackcdn.com/image/fetch/$s_!59a6!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Patterns like RLMs show that the real challenge isn&#8217;t the model, but the infrastructure and systems around it, called the harness. If you want to master that harness, check out my <strong><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering course</a></strong>, built with Towards AI.</p><p>34 lessons. Three end-to-end portfolio projects. A certificate. And a Discord community with direct access to industry experts and me.</p><p>Rated 5/5 by 300+ students. The first 6 lessons are free:</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Start here&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Start here</span></a></p><div><hr></div><h2>The REPL Trick That Keeps Your Context Window Clean</h2><p>RLMs introduce a simple core idea. Do not feed the document into the model&#8217;s context window. Instead, load it as a variable in a persistent programming environment and let the model write code to interact with it <a href="https://www.primeintellect.ai/blog/rlm">[4]</a>.</p><p>The model never sees your 10-million-token document directly. In a traditional agent, the prompt goes into the model, completely blowing up your context window. In an RLM, the context stays outside as an external variable, and the model receives only a symbolic handle to it.</p><p>The system initializes a Read-Eval-Print Loop (REPL), which is a persistent interactive programming environment where variables and state persist across iterations <a href="https://arxiv.org/abs/2512.24601">[3]</a>.</p><p>The root model receives only metadata, such as the total character count and data structure. It also receives instructions on how to access the REPL. The model then writes code to peek into, filter with regex, chunk, or summarize the data.</p><p>When the model identifies a sub-task, it uses a specific primitive such as <code>llm_query(prompt, chunk)</code> to spawn a fresh, isolated worker sub-model <a href="https://arxiv.org/abs/2512.24601">[3]</a>. The system pauses, executes this sub-call, and returns the result to the root model&#8217;s REPL.</p><p>Variables persist across these REPL turns. The model aggregates findings into a buffer, building the response progressively across iterations. Once confident, it calls <code>FINAL(answer)</code> to stop the recursive loop and return the response <a href="https://dextralabs.com/blog/recursive-language-models-rlm/">[5]</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!i4L_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc48c578c-3a1e-4fbf-9c88-a08a748ee2bb_1400x1400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!i4L_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc48c578c-3a1e-4fbf-9c88-a08a748ee2bb_1400x1400.png 424w, https://substackcdn.com/image/fetch/$s_!i4L_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc48c578c-3a1e-4fbf-9c88-a08a748ee2bb_1400x1400.png 848w, https://substackcdn.com/image/fetch/$s_!i4L_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc48c578c-3a1e-4fbf-9c88-a08a748ee2bb_1400x1400.png 1272w, https://substackcdn.com/image/fetch/$s_!i4L_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc48c578c-3a1e-4fbf-9c88-a08a748ee2bb_1400x1400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!i4L_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc48c578c-3a1e-4fbf-9c88-a08a748ee2bb_1400x1400.png" width="1400" height="1400" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c48c578c-3a1e-4fbf-9c88-a08a748ee2bb_1400x1400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1400,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The RLM REPL mechanism&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The RLM REPL mechanism" title="The RLM REPL mechanism" srcset="https://substackcdn.com/image/fetch/$s_!i4L_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc48c578c-3a1e-4fbf-9c88-a08a748ee2bb_1400x1400.png 424w, https://substackcdn.com/image/fetch/$s_!i4L_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc48c578c-3a1e-4fbf-9c88-a08a748ee2bb_1400x1400.png 848w, https://substackcdn.com/image/fetch/$s_!i4L_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc48c578c-3a1e-4fbf-9c88-a08a748ee2bb_1400x1400.png 1272w, https://substackcdn.com/image/fetch/$s_!i4L_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc48c578c-3a1e-4fbf-9c88-a08a748ee2bb_1400x1400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 2: The RLM mechanism. The document stays outside the context window as a REPL variable. The model writes code to explore, decompose, and recursively process it.</em></figcaption></figure></div><p>RLMs essentially perform context engineering on autopilot. Traditional context engineering requires you to carefully curate what goes into the context window through retrieval and compression <a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents">[1]</a>. RLMs automate this by letting the model itself decide what to extract, filter, and process.</p><p>Costs and performance stay intact because the model filters the input context without explicitly seeing it. By writing Python scripts, the model processes only the relevant portions through sub-calls. Only constant-size metadata about execution results is appended to the root model&#8217;s history, keeping its context window small and clean.</p><p>Understanding this mechanical loop allows us to map the pattern directly to production harness engineering.</p><h2>Turn Any Agent Into a Plan-Execute-Validate Machine</h2><p>RLMs are an inference-time orchestration pattern that maps directly to production harness engineering. If you have built agent systems, you already know the components: a planning loop, tool execution and validation <a href="https://blog.langchain.com/the-anatomy-of-an-agent-harness/">[7]</a>. RLMs formalize this into a programmable, recursive architecture.</p><p>A robust RLM harness uses a multi-tiered architecture. The root controller is a frontier model that acts as the project manager. It plans the reasoning process, writes code, and coordinates execution, but never directly interacts with tools or the full document <a href="https://www.anthropic.com/engineering/building-effective-agents">[8]</a>.</p><p>Worker sub-models are cheaper, faster models spawned via an operation such as <code>llm_query()</code> to handle specific, localized sub-tasks. This reduces overall costs while maintaining high quality. The aggregation layer is the REPL environment that combines recursive step results into a final structured response via persistent variables.</p><p>This setup naturally follows the plan-execute-validate mapping. In the plan phase, the root controller reviews the query, creates a reasoning plan, and decides how to decompose the problem. It might plan to regex-filter a codebase, chunk a document, or batch sub-calls for parallel analysis.</p><p>In the execute phase, the model translates the plan into code. It writes Python scripts, issues <code>llm_query()</code> calls, and spawns worker sub-models for parallel execution in isolated REPL environments. External tools, like web search, are provided ONLY to worker sub-models, keeping the root model&#8217;s context perfectly clean.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OWkF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec670a03-124b-4954-93e7-745b5cf1a5d3_1400x1400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OWkF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec670a03-124b-4954-93e7-745b5cf1a5d3_1400x1400.png 424w, https://substackcdn.com/image/fetch/$s_!OWkF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec670a03-124b-4954-93e7-745b5cf1a5d3_1400x1400.png 848w, https://substackcdn.com/image/fetch/$s_!OWkF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec670a03-124b-4954-93e7-745b5cf1a5d3_1400x1400.png 1272w, https://substackcdn.com/image/fetch/$s_!OWkF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec670a03-124b-4954-93e7-745b5cf1a5d3_1400x1400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OWkF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec670a03-124b-4954-93e7-745b5cf1a5d3_1400x1400.png" width="1400" height="1400" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ec670a03-124b-4954-93e7-745b5cf1a5d3_1400x1400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1400,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The plan-validate-execute orchestration loop&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The plan-validate-execute orchestration loop" title="The plan-validate-execute orchestration loop" srcset="https://substackcdn.com/image/fetch/$s_!OWkF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec670a03-124b-4954-93e7-745b5cf1a5d3_1400x1400.png 424w, https://substackcdn.com/image/fetch/$s_!OWkF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec670a03-124b-4954-93e7-745b5cf1a5d3_1400x1400.png 848w, https://substackcdn.com/image/fetch/$s_!OWkF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec670a03-124b-4954-93e7-745b5cf1a5d3_1400x1400.png 1272w, https://substackcdn.com/image/fetch/$s_!OWkF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec670a03-124b-4954-93e7-745b5cf1a5d3_1400x1400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 3: The plan-execute-validate loop. The root controller plans, worker sub-models execute, the system validates, and the cycle repeats until FINAL().</em></figcaption></figure></div><p>After execution, the system enters the validation phase, where results feed back as observations. The root model assesses accuracy, launches verification sub-calls, and handles errors by dynamically adjusting its plan. If the Python code fails, the error traceback is yielded back to the model as an event.</p><p>This allows the model to adapt and fix its code on the next turn. The cycle repeats until the model calls <code>FINAL(answer)</code>.</p><p>Deploying this in the real world requires strict production guardrails. You must configure <code>maxIterations</code> to cap the number of REPL turns, typically between 10 and 50. You need <code>maxDepth</code> to limit the recursive stack depth, where a depth of 1 is usually sufficient.</p><p>You also need <code>maxStdoutLength</code> to truncate REPL output returned to the model to prevent context overflow. Finally, permission gating is required to provide sandboxed execution with explicit approval for sensitive operations.</p><p>Neither Claude Code nor OpenAI Codex uses true RLM patterns. They rely on summarization-based context compression, file-system state tracking and progressive disclosure techniques <a href="https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents">[9]</a>. This creates a succession of agents connected by prompts and file state, rather than maintaining a persistent REPL environment with programmatic sub-calls.</p><p>With this architecture in place, we can identify the specific real-world scenarios where this pattern outperforms traditional data processing.</p><h2>Four Scenarios Where RLMs Beat Traditional Approaches</h2><p>RLMs are best suited for deep thinking applications that require accuracy, multi-step reasoning, and reliability over massive contexts. They are not suited for real-time, low-latency chat applications.</p><p>The <strong>first scenario</strong> is parsing large files without building retrieval infrastructure. Instead of building a hybrid index with vector and graph search, you keep everything in one file or directory and use an RLM agent to extract information on demand.</p><p>We can view the relationship between RAG and RLMs as a spectrum. For simple cases, RLMs replace RAG entirely, removing the need for chunking and embeddings. For advanced scenarios, RLMs complement retrieval beautifully.</p><p>You use semantic search to find your first pool of candidates, write the results to disk as cached short-term memory, and use an RLM to query that refined dataset on demand.</p><p>The retrieval narrows the haystack, and the RLM reasons deeply over what is left. I use this exact workflow for my research, dumping everything into a massive text file and using an RLM to extract relevant information.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!K9A3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41df1265-00a8-4373-8862-dd260870cd6c_1400x1208.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!K9A3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41df1265-00a8-4373-8862-dd260870cd6c_1400x1208.png 424w, https://substackcdn.com/image/fetch/$s_!K9A3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41df1265-00a8-4373-8862-dd260870cd6c_1400x1208.png 848w, https://substackcdn.com/image/fetch/$s_!K9A3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41df1265-00a8-4373-8862-dd260870cd6c_1400x1208.png 1272w, https://substackcdn.com/image/fetch/$s_!K9A3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41df1265-00a8-4373-8862-dd260870cd6c_1400x1208.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!K9A3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41df1265-00a8-4373-8862-dd260870cd6c_1400x1208.png" width="1400" height="1208" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41df1265-00a8-4373-8862-dd260870cd6c_1400x1208.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1208,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1040009,&quot;alt&quot;:&quot;RLM replacing RAG for large file parsing&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="RLM replacing RAG for large file parsing" title="RLM replacing RAG for large file parsing" srcset="https://substackcdn.com/image/fetch/$s_!K9A3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41df1265-00a8-4373-8862-dd260870cd6c_1400x1208.png 424w, https://substackcdn.com/image/fetch/$s_!K9A3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41df1265-00a8-4373-8862-dd260870cd6c_1400x1208.png 848w, https://substackcdn.com/image/fetch/$s_!K9A3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41df1265-00a8-4373-8862-dd260870cd6c_1400x1208.png 1272w, https://substackcdn.com/image/fetch/$s_!K9A3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41df1265-00a8-4373-8862-dd260870cd6c_1400x1208.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 4: RLM replaces the entire RAG pipeline for large file parsing. One file, one agent, no retrieval infrastructure.</em></figcaption></figure></div><p>The <strong>second scenario</strong> is complex software engineering and codebase comprehension. RLMs ingest massive codebases containing millions of tokens to answer questions about architecture, map dependencies, and perform reviews.</p><p>The RLM paper tested this on LongBench-v2 CodeQA using Qwen3-Coder with a Python REPL. The model writes code to break down the codebase, launches sub-queries to smaller language models, and aggregates findings <a href="https://arxiv.org/abs/2512.24601">[3]</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HwsN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe733bf97-fdcf-4e10-9d2c-4d242b1baf1d_1400x1400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HwsN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe733bf97-fdcf-4e10-9d2c-4d242b1baf1d_1400x1400.png 424w, https://substackcdn.com/image/fetch/$s_!HwsN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe733bf97-fdcf-4e10-9d2c-4d242b1baf1d_1400x1400.png 848w, https://substackcdn.com/image/fetch/$s_!HwsN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe733bf97-fdcf-4e10-9d2c-4d242b1baf1d_1400x1400.png 1272w, https://substackcdn.com/image/fetch/$s_!HwsN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe733bf97-fdcf-4e10-9d2c-4d242b1baf1d_1400x1400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HwsN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe733bf97-fdcf-4e10-9d2c-4d242b1baf1d_1400x1400.png" width="1400" height="1400" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e733bf97-fdcf-4e10-9d2c-4d242b1baf1d_1400x1400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1400,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;RLM decomposing a codebase through recursive sub-queries&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="RLM decomposing a codebase through recursive sub-queries" title="RLM decomposing a codebase through recursive sub-queries" srcset="https://substackcdn.com/image/fetch/$s_!HwsN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe733bf97-fdcf-4e10-9d2c-4d242b1baf1d_1400x1400.png 424w, https://substackcdn.com/image/fetch/$s_!HwsN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe733bf97-fdcf-4e10-9d2c-4d242b1baf1d_1400x1400.png 848w, https://substackcdn.com/image/fetch/$s_!HwsN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe733bf97-fdcf-4e10-9d2c-4d242b1baf1d_1400x1400.png 1272w, https://substackcdn.com/image/fetch/$s_!HwsN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe733bf97-fdcf-4e10-9d2c-4d242b1baf1d_1400x1400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 5: An RLM decomposes a codebase question into parallel sub-queries, each handled by a worker sub-model, then aggregates the results.</em></figcaption></figure></div><p>The <strong>third scenario</strong> is enterprise legal and financial analysis. RLMs provide consistent interpretation across thousands of contracts, case files, and policies that would overwhelm a standard context window. They also excel at financial audits and due diligence by tracing, validating, and reasoning through massive financial datasets.</p><p>The <strong>fourth scenario</strong> is deep research and information synthesis. RLMs synthesize research across thousands of files by programmatically filtering, chunking, and summarizing. They enable knowledge graph exploration and multi-hop reasoning over large document dumps.</p><p>At scale, RLMs become both more accurate and cheaper than standard long-context approaches. They avoid paying for n-squared attention over massive contexts by having the model process only relevant slices via sub-calls. In all these scenarios, the RLM pattern succeeds because it treats the LLM as a project manager that decides what to look at and delegates sub-tasks to workers.</p><p>Knowing these optimal use cases helps us approximate the pattern using tools you likely already have installed.</p><h2>Build a Naive RLM SKILL in Claude Code</h2><p>Claude Code does not natively use the RLM pattern. It relies on summarization-based context compression, file-system state tracking, and progressive disclosure. However, you can approximate RLM behavior using Claude Code&#8217;s existing harness features to build a naive RLM SKILL.</p><p>First, you set up the environment by having the SKILL load the target file or directory as a reference. Instead of feeding it into the context window, it writes the file path and metadata to a prompt for the root agent.</p><p>Second, the root Claude Code agent receives only this metadata and a set of instructions for how to interact with it. It uses its Explore subagent type <br>to examine the data structure, identify relevant sections, and plan its approach.</p><p>Third, the SKILL uses Claude Code&#8217;s Agent tool to spawn subagents. Each subagent receives a focused prompt to read specific lines and extract mentions, returning a condensed summary of a few thousand tokens. This mirrors the RLM pattern of spawning isolated sub-calls that process slices of the input.</p><p>Finally, the root agent collects these subagent results. It aggregates them into a coherent answer and decides whether more exploration is needed or whether to finalize the output.</p><p>Here is what this naive RLM SKILL looks like as a <em>SKILL.md</em> file:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;markdown&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-markdown">---
name: rlm-research-analyzer
description: "Analyze large research files by treating
  them as an external environment. Instead of stuffing
  content into context, the model explores, decomposes,
  and recursively processes the data through subagents."
---

# Analyze Large Research Files Using the RLM Pattern

## Step 1 &#8212; Initialize the environment

Accept the target file path as an argument. Do NOT read
the file into context. Instead, run a Bash command to
collect metadata:

wc -l &lt;file_path&gt;   # total lines
wc -c &lt;file_path&gt;   # total bytes
head -5 &lt;file_path&gt;  # short prefix

Write the metadata and file path to a temporary prompt
file at &lt;working_dir&gt;/rlm_prompt.md. The root agent
receives ONLY this metadata, never the full content.

## Step 2 &#8212; Plan the exploration

Read rlm_prompt.md. Based on the metadata and prefix,
decide how to decompose the file. Use an Explore
subagent to scan the file structure:

- Identify section boundaries, headings, or delimiters
- Estimate which regions are relevant to the query
- Produce a ranked list of target ranges to process

## Step 3 &#8212; Delegate to worker subagents

For each target range, spawn an Agent subagent with a
focused prompt:

"Read lines {start}-{end} of {file_path}. Extract all
findings related to {query}. Return a summary under
2000 tokens."

Launch multiple subagents in parallel when ranges are
independent. Write each subagent's output to
&lt;working_dir&gt;/slice_{n}.md.

## Step 4 &#8212; Aggregate and finalize

Read all slice files. Synthesize the findings into a
single coherent answer. If gaps remain, return to
Step 3 with new target ranges. Otherwise, write the
final output to &lt;working_dir&gt;/answer.md and present
it to the user.</code></pre></div><p>Notice how the four steps map directly to RLM primitives. Step 1 mirrors REPL initialization, where the data becomes an external variable rather than context input. Step 3 replaces the theoretical <code>llm_query()</code> operation with Claude Code&#8217;s Agent tool. Step 4 mirrors the <code>FINAL()</code> call that terminates the recursive loop.</p><p>This naive approximation lacks several critical features. It has no true REPL persistence, as Claude Code subagents do not share a persistent variable space. The filesystem serves as a proxy for REPL state, but it is slower and less elegant.</p><p>It also lacks sandboxing, as Claude Code runs directly in your environment. Then you miss out on configurable guardrails like <code>max_iterations</code> and <code>max_output_chars</code>, requiring manual limits instead. You get the idea.</p><p>Still, I&#8217;ve been using a similar technique in all my current projects: instead of stuffing the research into a file, I dump everything into a dir and link everything together in an <code>index.yaml</code> file that contains URIs to all the files, plus metadata such as the title and a 1-2 sentence summary of each source. Like this, through the <code>index.yaml</code> file, Claude Code can efficiently navigate the whole research dump token through progressive disclosure.</p><p>My structure looks something like this:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">research/
&#9500;&#9472;&#9472; index.yaml
&#9500;&#9472;&#9472; file_1.md
&#9500;&#9472;&#9472; file_2.md
&#9500;&#9472;&#9472; ...
&#9492;&#9472;&#9472; file_N.md</code></pre></div><p>Also, the only out-of-the-box implementation I found is within the <a href="https://dspy.ai/api/modules/RLM/">DSPy framework</a>.</p><p>The naive SKILL is a useful thought exercise and a practical first step. For production use, you should reference the DSPy framework&#8217;s <code>dspy.RLM</code> module.</p><h2>What&#8217;s Next</h2><p>RLMs represent a fundamental shift in how we process large inputs. We are moving from asking how to fit data in the context window to asking how we let the model interact with it programmatically. This is a great thought exercise on integrating specialized inference-time functionality into your harness.</p><p>As models get better at writing code and REPL environments become more sophisticated, the boundary between the model and its infrastructure will blur. The model does not just use tools, it writes the tools on the fly to solve the specific problem in front of it.</p><p>Your next practical step is to experiment with our SKILL or with the DSPy framework&#8217;s <code>dspy.RLM</code> module on a real problem. Point it at a large codebase you need to understand or a research corpus you need to synthesize. Start with something you have been using RAG or context stuffing on, and see whether the RLM approach is more effective.</p><p><em>But here is what I&#8217;m wondering: </em></p><p><em><strong>How have you been passing large files, such as deep research results or books, to your agents so far? RAG, CAG or other creative techniques?</strong></em></p><p><em>Click the button below and tell me. I read every response.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/recursive-language-models/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/recursive-language-models/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to restack this for your readers. </em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/recursive-language-models?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/recursive-language-models?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h4>Whenever you&#8217;re ready, here is how I can help you</h4><p>If you want to go from zero to shipping production-grade AI agents, check out my <strong><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering course</a></strong>, built with Towards AI.</p><p>34 lessons. Three end-to-end portfolio projects. A certificate. And a Discord community with direct access to industry experts and me.   </p><p><em>Rated 5/5</em> by 300+ students. The first 6 lessons are free:</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Start here&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Start here</span></a></p><p><em>Not ready to commit?</em> Start with our <strong><a href="https://email-course.towardsai.net/?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">free Agentic AI Engineering Guide</a></strong>, a 6-day email course on the mistakes that silently break AI agents in production.</p><div><hr></div><h2>References</h2><ol><li><p>(n.d.). Effective Context Engineering for AI Agents. Anthropic. <a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents">https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents</a></p></li><li><p>(n.d.). MIT&#8217;s new &#8216;recursive&#8217; framework lets LLMs process 10 million tokens without context rot. VentureBeat. <a href="https://venturebeat.com/orchestration/mits-new-recursive-framework-lets-llms-process-10-million-tokens-without-context-rot/">https://venturebeat.com/orchestration/mits-new-recursive-framework-lets-llms-process-10-million-tokens-without-context-rot/</a></p></li><li><p>Zhang, A. L., Kraska, T., &amp; Khattab, O. (2025). Recursive Language Models. arXiv. <a href="https://venturebeat.com/orchestration/mits-new-recursive-framework-lets-llms-process-10-million-tokens-without-context-rot/">https://arxiv.org/abs/2512.24601</a></p></li><li><p>(n.d.). Recursive Language Models: the paradigm of 2026. Prime Intellect. <a href="https://venturebeat.com/orchestration/mits-new-recursive-framework-lets-llms-process-10-million-tokens-without-context-rot/">https://www.primeintellect.ai/blog/rlm</a></p></li><li><p>(n.d.). Why Recursive Language Models (RLMs) Beat Long-Context LLMs. Dextra Labs. <a href="https://venturebeat.com/orchestration/mits-new-recursive-framework-lets-llms-process-10-million-tokens-without-context-rot/">https://dextralabs.com/blog/recursive-language-models-rlm/</a></p></li><li><p>Mansurova, M. (2026, March 30). Going Beyond the Context Window: Recursive Language Models in Action. Towards Data Science. <a href="https://towardsdatascience.com/going-beyond-the-context-window-recursive-language-models-in-action/">https://towardsdatascience.com/going-beyond-the-context-window-recursive-language-models-in-action/</a></p></li><li><p>(2026, March 21). The Anatomy of an Agent Harness. LangChain Blog. <a href="https://towardsdatascience.com/going-beyond-the-context-window-recursive-language-models-in-action/">https://blog.langchain.com/the-anatomy-of-an-agent-harness/</a></p></li><li><p>(2025, December 24). Building Effective AI Agents. Anthropic. <a href="https://towardsdatascience.com/going-beyond-the-context-window-recursive-language-models-in-action/">https://www.anthropic.com/engineering/building-effective-agents</a></p></li><li><p>(2026, March 25). Effective Harnesses for Long-Running Agents. Anthropic. <a href="https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents">https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents</a></p></li></ol><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[Agentic Harness Engineering]]></title><description><![CDATA[Building systems that transform the LLM into the new operating system]]></description><link>https://www.decodingai.com/p/agentic-harness-engineering</link><guid isPermaLink="false">https://www.decodingai.com/p/agentic-harness-engineering</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Tue, 31 Mar 2026 11:03:40 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!imx1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7112216b-b547-4d35-a6f4-b4ad5d7324d1_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>At the AI start-up I&#8217;ve been working at, building a financial personal assistant, we implemented LlamaIndex, added the Model Context Protocol (MCP), and built complex Retrieval-Augmented Generation (RAG) pipelines. Each piece added complexity without adding direct business value.</p><p>When we stripped everything back to plain Python, simple API calls, and a custom ReAct engine, things finally worked. What we accidentally built was a harness featuring specialized financial tools, domain-specific guardrails, and purpose-built context engineering.</p><p>We did not know the term yet, but the lesson was clear. The model was never the problem. The system and infrastructure around it were.</p><p>Most engineering teams obsess over which model to use. They debate GPT-4o versus Claude Opus versus Gemini. They chase LLM benchmark scores and swap models, hoping for better results.</p><p>But the model is only half the equation. The system and infrastructure around it determine whether your agent actually works in production.</p><p>TerminalBench 2.0 proved this. Changing only the harness moved the DeepAgent from LangChain from outside the top 30 to the top 5 <a href="https://blog.langchain.com/the-anatomy-of-an-agent-harness/">[1]</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!imx1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7112216b-b547-4d35-a6f4-b4ad5d7324d1_1456x1048.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!imx1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7112216b-b547-4d35-a6f4-b4ad5d7324d1_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!imx1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7112216b-b547-4d35-a6f4-b4ad5d7324d1_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!imx1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7112216b-b547-4d35-a6f4-b4ad5d7324d1_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!imx1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7112216b-b547-4d35-a6f4-b4ad5d7324d1_1456x1048.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!imx1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7112216b-b547-4d35-a6f4-b4ad5d7324d1_1456x1048.png" width="1456" height="1048" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7112216b-b547-4d35-a6f4-b4ad5d7324d1_1456x1048.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1048,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Agent = Model + Harness&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Agent = Model + Harness" title="Agent = Model + Harness" srcset="https://substackcdn.com/image/fetch/$s_!imx1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7112216b-b547-4d35-a6f4-b4ad5d7324d1_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!imx1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7112216b-b547-4d35-a6f4-b4ad5d7324d1_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!imx1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7112216b-b547-4d35-a6f4-b4ad5d7324d1_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!imx1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7112216b-b547-4d35-a6f4-b4ad5d7324d1_1456x1048.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 1: Agent = Model + Harness. The harness is everything that isn&#8217;t the model.</em></figcaption></figure></div><p>This is what usually happens. You have a powerful model. You gave it tools and a prompt. It works in demos.</p><p>But shipping it to production means solving problems the model cannot solve alone. You must bridge context windows, recover from failures, serve multiple interfaces, and manage state across sessions.</p><p>The solution is harness engineering. This is the discipline of building the infrastructure around the model so it can do useful work reliably. As Mitchell Hashimoto noted, harness engineering is the practice of engineering a solution every time an agent makes a mistake, ensuring it never makes that specific mistake again <a href="https://mitchellh.com/writing/my-ai-adoption-journey">[2]</a>.</p><p>By the end of this article, you will learn:</p><ul><li><p>What an agent harness actually is.</p></li><li><p>The core components powering production AI systems.</p></li><li><p>How the planning loop dictates agent actions.</p></li><li><p>The design principles behind an effective toolset.</p></li><li><p>How to manage memory using the filesystem.</p></li></ul><p>Before we look at all its components and how they fit together, we must first define what a harness actually is.</p><div><hr></div><h2><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Your Path to Agentic AI Engineering for Production (Product)</a></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!59a6!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 424w, https://substackcdn.com/image/fetch/$s_!59a6!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 848w, https://substackcdn.com/image/fetch/$s_!59a6!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 1272w, https://substackcdn.com/image/fetch/$s_!59a6!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!59a6!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif" width="1200" height="1200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1200,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!59a6!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 424w, https://substackcdn.com/image/fetch/$s_!59a6!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 848w, https://substackcdn.com/image/fetch/$s_!59a6!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 1272w, https://substackcdn.com/image/fetch/$s_!59a6!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Most engineers know the theory behind agents, context engineering, and RAG. What they lack is the confidence to architect, evaluate, and deploy these systems in production. The <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering course</a>, built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Towards AI,</a> closes that gap across 34 lessons (articles, videos, and a lot of code).</p><p>By the end, you will have gone from <em>&#8220;I built a demo&#8221;</em> to <em>&#8220;I shipped a production-grade multi-agent system with evals, observability, and CI/CD.&#8221;</em> Three portfolio projects, a certificate to back them up in interviews, and a Discord community with direct access to industry experts.</p><p><strong>Rated 5/5</strong> &#11088;&#65039; by 300+ early students saying <em>&#8220;Every AI Engineer needs a course like this&#8221;</em> and that is <em>&#8220;An excellent bridge from experimental LLM projects to real-world AI engineering.&#8221;</em></p><p><em>Start learning today. The first 6 lessons are free:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Enroll here&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Enroll here</span></a></p><div><hr></div><h2>So... What the Heck Is a Harness?</h2><p>While talking with Jonathan Gennick from Manning, he said that the first time he heard about the term &#8220;harness&#8221; was in the context of horses. Let me explain. A horse is powerful on its own, but useless for farming without a harness. The straps and reins let you direct its strength toward useful work. The same applies to LLMs.</p><p>The model has intelligence. But without tools, memory, state, guardrails, and orchestration, you cannot put it to work reliably.</p><p>LangChain offers the clearest definition. <strong>An agent equals a model plus a harness.</strong> The harness is every piece of code, configuration, and execution logic that is not the model itself <a href="https://blog.langchain.com/the-anatomy-of-an-agent-harness/">[1]</a>.</p><p>A basic agent, as we know it so far, is just a model, a prompt, tools, and a planning loop. A harness extends this by adding memory systems, guardrails, advanced orchestration, context engineering, and multi-agent coordination.</p><p>Usually, it also includes a serving layer that connects the agent to various user interfaces, such as terminal apps, web dashboards, IDE plugins, and messaging apps like Telegram.</p><p>Ultimately, a harness is a term for building real software applications using LLMs or other models as the operating system. Applications like Claude Code, OpenCode, OpenClaw, and Codex are all harnesses. You could swap the model inside them, but the real engineering value lives in the harness itself.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Oe0K!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a23aea-fa54-4dad-a0c4-5b5bcabc096b_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Oe0K!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a23aea-fa54-4dad-a0c4-5b5bcabc096b_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!Oe0K!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a23aea-fa54-4dad-a0c4-5b5bcabc096b_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!Oe0K!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a23aea-fa54-4dad-a0c4-5b5bcabc096b_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!Oe0K!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a23aea-fa54-4dad-a0c4-5b5bcabc096b_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Oe0K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a23aea-fa54-4dad-a0c4-5b5bcabc096b_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/23a23aea-fa54-4dad-a0c4-5b5bcabc096b_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:757851,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/192391298?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a23aea-fa54-4dad-a0c4-5b5bcabc096b_1200x630.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Oe0K!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a23aea-fa54-4dad-a0c4-5b5bcabc096b_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!Oe0K!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a23aea-fa54-4dad-a0c4-5b5bcabc096b_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!Oe0K!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a23aea-fa54-4dad-a0c4-5b5bcabc096b_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!Oe0K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a23aea-fa54-4dad-a0c4-5b5bcabc096b_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image 2: The three levels of engineering: Prompt engineering is crafting instructions, context engineering is managing what the model sees, and harness engineering is the full infrastructure.</figcaption></figure></div><p>This introduces three distinct levels of engineering. Prompt engineering crafts the instructions. Context engineering dictates what goes into the context window and when.</p><p>Harness engineering is the full application and infrastructure. It controls when context loads, which tools are available, which actions are allowed, and how failures are handled. Each level encompasses the previous one <a href="https://youtube.com/watch?v=zYerCzIexCg">[3]</a>.</p><p>Now that you understand what a harness is, the next step is to explore the internal architecture and see how these pieces connect.</p><h2>The Anatomy of a Harness</h2><p>A complete harness consists of the LLM, tools, a planning loop, context engineering, a sandbox, memory, an orchestration layer, and a serving layer. In other words, everything that has been hovering within the AI space is finally falling into one beautiful system.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!T0f9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3bcf6c6-4ab2-4dc4-937e-8b4ae2456620_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!T0f9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3bcf6c6-4ab2-4dc4-937e-8b4ae2456620_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!T0f9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3bcf6c6-4ab2-4dc4-937e-8b4ae2456620_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!T0f9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3bcf6c6-4ab2-4dc4-937e-8b4ae2456620_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!T0f9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3bcf6c6-4ab2-4dc4-937e-8b4ae2456620_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!T0f9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3bcf6c6-4ab2-4dc4-937e-8b4ae2456620_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d3bcf6c6-4ab2-4dc4-937e-8b4ae2456620_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Full harness architecture&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Full harness architecture" title="Full harness architecture" srcset="https://substackcdn.com/image/fetch/$s_!T0f9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3bcf6c6-4ab2-4dc4-937e-8b4ae2456620_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!T0f9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3bcf6c6-4ab2-4dc4-937e-8b4ae2456620_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!T0f9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3bcf6c6-4ab2-4dc4-937e-8b4ae2456620_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!T0f9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3bcf6c6-4ab2-4dc4-937e-8b4ae2456620_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 3: The full harness architecture &#8212; from the model at the center to the serving layer at the edge.</em></figcaption></figure></div><p>One of the most distinctive features of modern harnesses is the multi-surface architecture. OpenClaw serves the same agent across a command-line interface (known as TUI), a web UI, desktop apps, Slack and Telegram/WhatsApp through a centralized Gateway using a typed WebSocket protocol.</p><p>Codex evolved from a simple terminal tool to an App Server using JSON-RPC over standard input and output. OpenCode uses a Bun JS HTTP server where any client connects via HTTP, utilizing an Event Bus to broadcast results in real-time <a href="https://theagentstack.substack.com/p/openclaw-architecture-part-1-control">[4]</a>, <a href="https://cefboud.com/posts/coding-agents-internals-opencode-deepdive/">[5]</a>, <a href="https://blog.bytebytego.com/p/how-openai-codex-works">[6]</a>.</p><p>This architecture introduces challenges. Multiple messages arrive in parallel from different clients. Users ask questions while the model is still processing.</p><p>To solve this, systems use priority queues and message buses. OpenClaw uses a lane-aware FIFO queue to ensure only one active run per session while allowing parallelism across different sessions.</p><p>At the core of all this infrastructure, the filesystem is king. As the most foundational harness primitive, it enables durable storage, workspace management, multi-agent collaboration, and versioning.</p><p>You heard me right, there is no fancy vector database in place. With AI, we are going back to basics, and nothing is purer than the filesystem itself.</p><p>Every production harness uses the filesystem as its primary state mechanism <a href="https://blog.langchain.com/the-anatomy-of-an-agent-harness/">[1]</a>.</p><p>You might wonder if this is just traditional orchestration like Airflow. It is different in three key ways. The agent loop is non-deterministic, context management is a first-class concern, and the programmer inside the loop is the LLM itself. It is common to add durability to the harness using tools such as Prefect, Temporal or DBOS that natively support dynamic pipelines rather than predefined, rigid DAGs.</p><p>Let us zoom in on the first and most fundamental component: the planning loop.</p><h2>How the Agent Decides What to Do Next</h2><p>The most common pattern for the planning loop is ReAct, which stands for Reasoning and Acting. The model receives the current state, reasons about what to do next, takes an action via a tool call, and observes the result. This cycle repeats continuously until a strict stopping condition is met <a href="https://cefboud.com/posts/coding-agents-internals-opencode-deepdive/">[5]</a>.</p><p>Consider a concrete example. A user asks the agent to fix a failing test. First, the model reads the test output, reasons that the import path is wrong, and edits the file through a tool.</p><p>Second, it re-runs the tests, sees a new type mismatch error, and fixes it. Third, it runs the tests again.</p><p>They pass, the model reasons the job is done, and it stops. The harness orchestrates this loop, while the model reasons and picks actions.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sIaN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41090c36-9a2e-4157-8f5a-df8cc2d9bb67_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sIaN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41090c36-9a2e-4157-8f5a-df8cc2d9bb67_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!sIaN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41090c36-9a2e-4157-8f5a-df8cc2d9bb67_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!sIaN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41090c36-9a2e-4157-8f5a-df8cc2d9bb67_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!sIaN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41090c36-9a2e-4157-8f5a-df8cc2d9bb67_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sIaN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41090c36-9a2e-4157-8f5a-df8cc2d9bb67_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41090c36-9a2e-4157-8f5a-df8cc2d9bb67_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;ReAct loop and orchestrator-worker pattern&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="ReAct loop and orchestrator-worker pattern" title="ReAct loop and orchestrator-worker pattern" srcset="https://substackcdn.com/image/fetch/$s_!sIaN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41090c36-9a2e-4157-8f5a-df8cc2d9bb67_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!sIaN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41090c36-9a2e-4157-8f5a-df8cc2d9bb67_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!sIaN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41090c36-9a2e-4157-8f5a-df8cc2d9bb67_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!sIaN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41090c36-9a2e-4157-8f5a-df8cc2d9bb67_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 4: The ReAct loop drives every agent action. For complex tasks, an orchestrator delegates to specialized workers, each with its own context window.</em></figcaption></figure></div><p>When tasks are too complex for a single agent, harnesses use orchestrator-worker patterns. The orchestrator decomposes a task, delegates subtasks to specialized workers, and aggregates the results.</p><p>In OpenCode, a dedicated <em>task</em> tool spawns subagents. Each subagent gets its own session, context window, and restricted tool set <a href="https://www.anthropic.com/research/building-effective-agents">[7]</a>.</p><p>For tasks that span multiple context windows, Claude Code implements <em>Ralph Loops</em>. This harness mechanism intercepts the model&#8217;s attempt to exit via a hook. It reinjects the original prompt in a clean context window, forcing the agent to continue against a completion goal using the state persisted on the filesystem <a href="https://blog.langchain.com/the-anatomy-of-an-agent-harness/">[1]</a>.</p><p>While automating my business with agents, I learned a hard lesson about orchestration. I initially built five specialized agents, each handling one step.</p><p>I eventually found that a single agent with memory and smart context engineering outperformed the whole swarm. Always start with one well-harnessed agent before reaching for multi-agent complexity.</p><blockquote><p><em>Here is a deep dive into planning:</em></p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;b381af34-4084-454a-8a68-0de07e50251c&quot;,&quot;caption&quot;:&quot;Welcome to the AI Agents Foundations series: A 9-part journey from Python developer to AI Engineer. Made by busy people. For busy people.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How Does Memory for AI Agents Work?&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:110559689,&quot;name&quot;:&quot;Paul Iusztin&quot;,&quot;bio&quot;:&quot;Senior AI Engineer &#8226; Founder @ Decoding AI &#8226; Author @ LLM Engineer&#8217;s Handbook I ship AI products and teach you about the process.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0714d360-396c-4b41-a676-1b58dc1dc5f3_1470x1470.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2025-12-02T12:03:49.149Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!G5CM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c6f2d58-f21f-4b49-b4f0-fb553fc28e36_1200x1200.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.decodingai.com/p/how-does-memory-for-ai-agents-work&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:180239220,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:76,&quot;comment_count&quot;:7,&quot;publication_id&quot;:1526003,&quot;publication_name&quot;:&quot;Decoding AI Magazine&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!k2ig!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00bc74e0-3601-49ce-8ab9-4c7b499ce597_1280x1280.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div></blockquote><p>While the planning loop decides the next step, the agent still needs a way to interact with its environment.</p><h2>The Tools That Let Agents Act</h2><p>This interaction happens through a specific toolkit designed for autonomous execution.</p><p>First, <em>Bash</em> is a general-purpose tool. The agent can run any shell command to execute tests, linters, or builds. This gives the model code execution capabilities so it can design its own tools on the fly rather than being constrained by fixed options.</p><p>For example, the agent runs Python code and executes it through <code>python -c "..."</code>, generates a script and runs it through <code>python main.py</code> or runs your code as <code>python -m my_module.main</code>.</p><p>Second, specialized filesystem tools handle common operations like reading, writing, editing, and searching. Doing file operations via Bash is slow and error-prone.</p><p>Specialized tools include safety checks. For instance, a read tool enforces absolute paths and line limits, while an edit tool validates the uniqueness of replacement strings.</p><p>Third, state management tools track session-scoped tasks. These give the agent working memory within a single session. For example, OpenCode has <code>ToDoAdd</code> and <code>ToDoRead</code> tools that add/read tasks from a queue to keep track of the plan it has to execute.</p><p>Finally, orchestration tools launch subagents with their own isolated prompts and context windows, such as OpenCode&#8217;s <code>task</code> tool or Claude Code&#8217;s <code>agent</code> tool.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kMuj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e3f1927-4655-4c2b-902a-ac42d37d9a13_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kMuj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e3f1927-4655-4c2b-902a-ac42d37d9a13_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!kMuj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e3f1927-4655-4c2b-902a-ac42d37d9a13_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!kMuj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e3f1927-4655-4c2b-902a-ac42d37d9a13_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!kMuj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e3f1927-4655-4c2b-902a-ac42d37d9a13_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kMuj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e3f1927-4655-4c2b-902a-ac42d37d9a13_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0e3f1927-4655-4c2b-902a-ac42d37d9a13_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Standard harness toolkit&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Standard harness toolkit" title="Standard harness toolkit" srcset="https://substackcdn.com/image/fetch/$s_!kMuj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e3f1927-4655-4c2b-902a-ac42d37d9a13_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!kMuj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e3f1927-4655-4c2b-902a-ac42d37d9a13_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!kMuj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e3f1927-4655-4c2b-902a-ac42d37d9a13_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!kMuj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e3f1927-4655-4c2b-902a-ac42d37d9a13_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 5: The standard harness toolkit organized by design principle &#8212; from general-purpose bash to specialized filesystem tools to orchestration.</em></figcaption></figure></div><p>Feedback loops are the most important principle around tooling. Boris Cherny, the creator of Claude Code, noted that giving the model a way to verify its work improves quality by two to three times. For example, OpenCode integrates the Language Server Protocol (LSP) to get real-time code definitions and diagnostics.</p><p>Undefined variables and type errors are fed back to the LLM for immediate correction. These tools do not act on the world. They feed vital information back to the planning loop.</p><p>Harnesses also enforce tool access control. In OpenCode, the planning agent cannot call edit tools. This prevents exploratory agents from accidentally modifying your code <a href="https://cefboud.com/posts/coding-agents-internals-opencode-deepdive/">[5]</a>.</p><blockquote><p><em>Here is a deep dive into tool calling:</em></p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;8cbbec02-7a2c-4e79-a765-07ca9904f17e&quot;,&quot;caption&quot;:&quot;Welcome to the AI Agents Foundations series&#8212;a 9-part journey from Python developer to AI Engineer. Made by busy people. For busy people.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Tool Calling From Scratch to Production: The Complete Guide&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:110559689,&quot;name&quot;:&quot;Paul Iusztin&quot;,&quot;bio&quot;:&quot;Senior AI Engineer &#8226; Founder @ Decoding AI &#8226; Author @ LLM Engineer&#8217;s Handbook I ship AI products and teach you about the process.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0714d360-396c-4b41-a676-1b58dc1dc5f3_1470x1470.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2025-10-28T08:00:55.938Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!cv8k!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F362128ff-7821-482e-b08a-8252d0faab99_1200x1200.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.decodingai.com/p/tool-calling-from-scratch-to-production&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:176436971,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:50,&quot;comment_count&quot;:5,&quot;publication_id&quot;:1526003,&quot;publication_name&quot;:&quot;Decoding AI Magazine&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!k2ig!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00bc74e0-3601-49ce-8ab9-4c7b499ce597_1280x1280.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div></blockquote><p>Once the agent has its tools, it needs a secure place to use them. In production, this requires strict isolation.</p><h2>Where Agents Run</h2><p>Agents execute code, and that code can fail, crash, or delete all your files. I know I want my precious notes protected. Sandboxes isolate agent execution so failures do not affect the host system or other agents. The cherry on top is that they also enable horizontal scaling across parallel environments.</p><p>There is a strict tradeoff between security and capability. Not every harness uses the same approach. Codex uses a hard sandbox.</p><p>Each task runs in an isolated cloud container preloaded with the repository. This provides maximum safety, but the agent cannot access the host filesystem <a href="https://blog.bytebytego.com/p/how-openai-codex-works">[6]</a>.</p><p>Conversely, OpenClaw uses a soft sandbox. The workspace is the default working directory. This grants maximum capability but introduces more risk.</p><p>OpenClaw deliberately avoids hard sandboxing to preserve full filesystem access. Most production harnesses sit somewhere between these extremes, depending on the trust model.</p><p>When you submit a task to Codex, the harness spins up a fresh cloud container. The agent works inside this container to read files, run tests, and install packages.</p><p>It cannot touch your local machine. When the job finishes, the results are extracted, and the container is destroyed.</p><p>Along with security, a major benefit of cloud sandbox environments is that they give the agent access to powerful computing resources. For example, if you want to train a model using a GPU, you can ask the agent to implement and run a training pipeline hosted in a sandbox powered by a GPU.</p><p>This is similar to manually SSHing to different VMs and running the code manually there. Based on the same principles, you can easily spin up multiple cloud sandboxes and run your agents in parallel.</p><p>On the other side of the spectrum, you can also run sandbox environments locally through Docker containers or isolated processes, similar to what Cursor does. Super useful when you want to try something out and give the agent full permissions to avoid having to supervise it.</p><p>While sandboxes provide a safe space for execution, they are ephemeral by design.</p><h2>Memory Is Just the Filesystem</h2><p>To survive across sessions and context windows, every harness manages state across three distinct memory layers. The first layer is the filesystem. This is the long-term memory.</p><p>It is durable and persistent, surviving across sessions. This is where progress files, git history, and session transcripts live.</p><p>The second layer is the RAM. This is the short-term memory, also known as the working memory. It holds the conversation history and tool results during an active session. It is fast but volatile.</p><p>The third layer is the context window. This is what the model actually sees. It is the strictest constraint, as everything the model knows about the current task must fit here.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jici!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e1df83-93f4-4c49-93c5-d11a372173d1_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jici!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e1df83-93f4-4c49-93c5-d11a372173d1_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!jici!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e1df83-93f4-4c49-93c5-d11a372173d1_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!jici!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e1df83-93f4-4c49-93c5-d11a372173d1_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!jici!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e1df83-93f4-4c49-93c5-d11a372173d1_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jici!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e1df83-93f4-4c49-93c5-d11a372173d1_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c1e1df83-93f4-4c49-93c5-d11a372173d1_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:681800,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/192391298?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e1df83-93f4-4c49-93c5-d11a372173d1_1200x630.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jici!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e1df83-93f4-4c49-93c5-d11a372173d1_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!jici!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e1df83-93f4-4c49-93c5-d11a372173d1_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!jici!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e1df83-93f4-4c49-93c5-d11a372173d1_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!jici!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e1df83-93f4-4c49-93c5-d11a372173d1_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image 6: The three-layer memory dynamics &#8212; filesystem as long-term state, RAM as working memory, context window as what the model sees. The cycle repeats: load &#8594; process &#8594; flush.</figcaption></figure></div><p>The harness orchestrates the dynamics between these layers. On the read path, the harness selectively loads relevant state from the disk into the RAM.</p><p>It then assembles the context window using context engineering techniques such as compaction, progressive disclosure, and just-in-time retrieval. On the write path, the harness persists important state back to the disk after processing.</p><p>OpenClaw enforces a strict invariant that memory is always flushed to disk before being discarded from context. Rehydration is treated as a tool-shaped action, where the agent searches and then retrieves specific data, rather than dumping everything into the context window <a href="https://theagentstack.substack.com/p/openclaw-architecture-part-3-memory">[8]</a>.</p><p>Context engineering makes this possible. When token counts exceed ninety percent of the limit, OpenCode automatically summarizes the conversation. Codex assembles prompts from multiple sources and exploits prompt caching.</p><p>Anthropic recommends using structured note-taking files and sub-agent architectures to isolate context <a href="https://cefboud.com/posts/coding-agents-internals-opencode-deepdive/">[5]</a>, <a href="https://blog.bytebytego.com/p/how-openai-codex-works">[6]</a>, <a href="https://www.anthropic.com/engineering/effective-context-engineering">[9]</a>.</p><p>In Anthropic&#8217;s long-running agent pattern, an initializer agent creates a script, a progress file, and a feature list. The coding agent reads the git logs and progress files at the start of each session and updates the progress file as it progresses.</p><p>The beauty? There is no database or vector store. It is just the filesystem <a href="https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents">[10]</a>.</p><blockquote><p><em>Here is a deep dive into memory:</em></p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;df376f07-24cf-47ea-b865-72794d074c9d&quot;,&quot;caption&quot;:&quot;Welcome to the AI Agents Foundations series: A 9-part journey from Python developer to AI Engineer. Made by busy people. For busy people.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How Does Memory for AI Agents Work?&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:110559689,&quot;name&quot;:&quot;Paul Iusztin&quot;,&quot;bio&quot;:&quot;Senior AI Engineer &#8226; Founder @ Decoding AI &#8226; Author @ LLM Engineer&#8217;s Handbook I ship AI products and teach you about the process.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0714d360-396c-4b41-a676-1b58dc1dc5f3_1470x1470.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2025-12-02T12:03:49.149Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!G5CM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c6f2d58-f21f-4b49-b4f0-fb553fc28e36_1200x1200.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.decodingai.com/p/how-does-memory-for-ai-agents-work&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:180239220,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:76,&quot;comment_count&quot;:7,&quot;publication_id&quot;:1526003,&quot;publication_name&quot;:&quot;Decoding AI Magazine&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!k2ig!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00bc74e0-3601-49ce-8ab9-4c7b499ce597_1280x1280.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div></blockquote><p>Now that you have seen all the pieces, from planning and tools to sandboxes and memory, the question is what this means for how you build software.</p><h2>What&#8217;s Next</h2><p>We are witnessing a new way of building software. Instead of software engineers building traditional frontend and backend applications, the next generation of production software will be harnesses. Harness engineering is merging software engineering with AI, moving it one level up <a href="https://youtube.com/watch?v=zYerCzIexCg">[3]</a>.</p><p>Popular tools like Claude Code are just the beginning. In the long run, no company will want to depend entirely on proprietary harnesses. Even open-source solutions like OpenCode will not cover every specific use case.</p><p>Companies will inevitably build their own. As we experienced at ZTRON, custom systems and infrastructure are what finally make an agent work in production.</p><p>However, we must be honest about current limitations. Memory still breaks across long sessions. Validation loops still miss edge cases. Furthermore, orchestrating hundreds of parallel agents on shared codebases remains an open research problem.</p><p>Harness engineering is real engineering. Your harness becomes its own product with its own bugs, its own drift, and its own maintenance burden.</p><div><hr></div><p><em>What&#8217;s your opinion? Do you agree, disagree, or is there something I missed?</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/agentic-harness-engineering/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/agentic-harness-engineering/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to share our work.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/agentic-harness-engineering?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/agentic-harness-engineering?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h2>Go Deeper</h2><p><strong>Go from zero to production-grade AI agents</strong> with the <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering self-paced course</a>. Built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Towards AI</a>.</p><p>Across <strong>34 lessons</strong> (articles, videos, and a lot of code), you&#8217;ll design, build, evaluate, and deploy production-grade AI agents end to end. By the final lesson, you&#8217;ll have <strong>built a multi-agent system</strong> and a <strong>capstone project</strong> where you apply everything you&#8217;ve learned on your own.</p><p><em>Three portfolio projects and a certificate to showcase in interviews. Plus a Discord community where you have direct access to other industry experts and me.</em></p><p>Rated 4.9/5 &#11088;&#65039; by 300+ students saying <em>&#8220;Every AI Engineer needs a course like this.&#8221;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Learn more&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Learn more</span></a></p><p><em>Not ready to commit?</em> We also prepared a free 6-day email course to reveal the <em><strong>6 critical mistakes that silently destroy agentic systems. </strong><a href="https://email-course.towardsai.net/?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Get the free email course.</a></em></p><div><hr></div><h2>References</h2><ol><li><p>LangChain. (2026, March 21). The Anatomy of an Agent Harness. LangChain Blog. <a href="https://youtube.com/watch?v=zYerCzIexCg">https://blog.langchain.com/the-anatomy-of-an-agent-harness/</a></p></li><li><p>Hashimoto, M. (2026, March 25). My AI Adoption Journey. Mitchell Hashimoto. <a href="https://youtube.com/watch?v=zYerCzIexCg">https://mitchellh.com/writing/my-ai-adoption-journey</a></p></li><li><p>Bouchard, L. (2026, March 25). What Harness Engineering Actually Means. What&#8217;s AI by Louis-Fran&#231;ois Bouchard.  <a href="https://youtube.com/watch?v=zYerCzIexCg">https://youtube.com/watch?v=zYerCzIexCg</a></p></li><li><p>Govindarajan, V. (2026, March 21). OpenClaw Architecture Part 1 - The Agent Stack. The Agent Stack. <a href="https://theagentstack.substack.com/p/openclaw-architecture-part-1-control">https://theagentstack.substack.com/p/openclaw-architecture-part-1-control</a></p></li><li><p>Abboud, M. (2026, March 17). How Coding Agents Actually Work: Inside OpenCode. Moncef Abboud. <a href="https://youtube.com/watch?v=zYerCzIexCg">https://cefboud.com/posts/coding-agents-internals-opencode-deepdive/</a></p></li><li><p>ByteByteGo. (2026, March 26). How OpenAI Codex Works. ByteByteGo. <a href="https://blog.bytebytego.com/p/how-openai-codex-works">https://blog.bytebytego.com/p/how-openai-codex-works</a></p></li><li><p>Anthropic. (2025, December 24). Building Effective AI Agents. Anthropic. <a href="https://youtube.com/watch?v=zYerCzIexCg">https://www.anthropic.com/research/building-effective-agents</a></p></li><li><p>Govindarajan, V. (2026, March 24). OpenClaw Architecture Part 3: Memory and State Ownership. The Agent Stack. <a href="https://theagentstack.substack.com/p/openclaw-architecture-part-3-memory">https://theagentstack.substack.com/p/openclaw-architecture-part-3-memory</a></p></li><li><p>Anthropic. (2025, October 22). Effective Context Engineering for AI Agents. Anthropic. <a href="https://youtube.com/watch?v=zYerCzIexCg">https://www.anthropic.com/engineering/effective-context-engineering</a></p></li><li><p>Anthropic. (2026, March 25). Effective Harnesses for Long-Running Agents. Anthropic. <a href="https://youtube.com/watch?v=zYerCzIexCg">https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents</a></p></li></ol><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[From 12 Agents to 1]]></title><description><![CDATA[The mental model that prevents you from overengineering your next AI system.]]></description><link>https://www.decodingai.com/p/from-12-agents-to-1-ai-agent-architecture-decision-guide</link><guid isPermaLink="false">https://www.decodingai.com/p/from-12-agents-to-1-ai-agent-architecture-decision-guide</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Thu, 26 Mar 2026 12:01:52 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Gnrt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b250f4b-eb33-479b-9791-4fa8a58ae816_1200x630.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tlx1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b68fec-e7dc-49ea-a30a-5394c6ffc2d1_1456x932.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tlx1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b68fec-e7dc-49ea-a30a-5394c6ffc2d1_1456x932.png 424w, https://substackcdn.com/image/fetch/$s_!tlx1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b68fec-e7dc-49ea-a30a-5394c6ffc2d1_1456x932.png 848w, https://substackcdn.com/image/fetch/$s_!tlx1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b68fec-e7dc-49ea-a30a-5394c6ffc2d1_1456x932.png 1272w, https://substackcdn.com/image/fetch/$s_!tlx1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b68fec-e7dc-49ea-a30a-5394c6ffc2d1_1456x932.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tlx1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b68fec-e7dc-49ea-a30a-5394c6ffc2d1_1456x932.png" width="1456" height="932" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b3b68fec-e7dc-49ea-a30a-5394c6ffc2d1_1456x932.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:932,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1604167,&quot;alt&quot;:&quot;The complexity spectrum from workflows to single agents to multi-agent systems, with decision triggers between each stage.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The complexity spectrum from workflows to single agents to multi-agent systems, with decision triggers between each stage." title="The complexity spectrum from workflows to single agents to multi-agent systems, with decision triggers between each stage." srcset="https://substackcdn.com/image/fetch/$s_!tlx1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b68fec-e7dc-49ea-a30a-5394c6ffc2d1_1456x932.png 424w, https://substackcdn.com/image/fetch/$s_!tlx1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b68fec-e7dc-49ea-a30a-5394c6ffc2d1_1456x932.png 848w, https://substackcdn.com/image/fetch/$s_!tlx1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b68fec-e7dc-49ea-a30a-5394c6ffc2d1_1456x932.png 1272w, https://substackcdn.com/image/fetch/$s_!tlx1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3b68fec-e7dc-49ea-a30a-5394c6ffc2d1_1456x932.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>It is 2026. People across the industry still mix up words like workflows, agents, tools, and multi-agent systems. Beyond terminology, this confusion has led to massively overengineered solutions.</p><p>Teams jump to multi-agent architectures because it sounds impressive and helps raise money. In reality, a simple workflow would have been faster to build, cheaper to run, and easier to debug. The result is bloated systems, wasted tokens, and debugging nightmares.</p><p>Our goal is to provide a clear mental model of what architecture to choose for your AI project: workflows vs. single agents vs. multi-agent systems.</p><p><span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Louis-Fran&#231;ois Bouchard&quot;,&quot;id&quot;:130571458,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!f-b9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c5d976-f699-4595-8b6d-6ffa3e42a5e5_400x400.jpeg&quot;,&quot;uuid&quot;:&quot;9e0714fe-b02b-42ac-bcc8-906b6ba10cb9&quot;}" data-component-name="MentionToDOM"></span> from Towards AI has been working on this exact problem with his clients and distilled his decision framework into two YouTube videos: <a href="https://www.youtube.com/watch?v=_rO2fv6tSsQ">Stop Overengineering: Workflows vs AI Agents Explained</a> and <a href="https://www.youtube.com/watch?v=iOpLKJYOvXs">From Workflows to Multi-Agent Systems: How to Choose</a>. He allowed me to take that framework and turn it into this article. Kudos to Louis-Fran&#231;ois!</p><p>This decision framework is a spectrum from simple to complex that tells you exactly what to build based on your actual constraints. The goal is to stay as far left on the complexity spectrum as possible while still solving your problem.</p><p>Here is what you will learn:</p><ul><li><p>The fundamental difference between an agent and a workflow.</p></li><li><p>How to use the complexity spectrum to make architecture decisions.</p></li><li><p>When to rely on simple workflows for predictable tasks.</p></li><li><p>Why a single agent with tools is often enough for dynamic problems.</p></li><li><p>The exact breaking points that justify moving to a multi-agent system.</p></li></ul><p>To apply this spectrum effectively, you must first define the terms. Here are the core misconceptions that lead to bad architecture decisions.</p><p><em>Before we continue, a quick word from the Decoding AI team.</em> &#8595;</p><div><hr></div><h2><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Go Deeper: Your Path to Agentic AI for Production</a></h2><p>Most engineers know the theory behind agents, context engineering, and RAG. What they lack is the confidence to architect, evaluate, and deploy these systems in production. The <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering course</a>, built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Towards AI,</a> closes that gap across 34 lessons (articles, videos, and a lot of code).</p><p>By the end, you will have gone from <em>&#8220;I built a demo&#8221;</em> to <em>&#8220;I shipped a production-grade multi-agent system with evals, observability, and CI/CD.&#8221;</em> Three portfolio projects, a certificate to back them up in interviews, and a Discord community with direct access to industry experts like <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Louis-Fran&#231;ois Bouchard&quot;,&quot;id&quot;:130571458,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!f-b9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c5d976-f699-4595-8b6d-6ffa3e42a5e5_400x400.jpeg&quot;,&quot;uuid&quot;:&quot;3ba979eb-b2db-4d6f-85c0-178e50443138&quot;}" data-component-name="MentionToDOM"></span> and me.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6Qcm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 424w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 848w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6Qcm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg" width="1280" height="720" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:720,&quot;width&quot;:1280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6Qcm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 424w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 848w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>34 lessons from first principles to production. Learn about context engineering, workflows, agents, evals, and the design of AI systems.</em></figcaption></figure></div><p>Rated 4.9/5 &#11088;&#65039; by 300+ early students saying <em>&#8221;Every AI Engineer needs a course like this&#8221;</em> and that is <em>&#8221;an excellent bridge from experimental LLM projects to real-world AI engineering.&#8221;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Start learning today&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Start learning today</span></a></p><div><hr></div><p>&#8595; <em>Now, back to the article.</em></p><h2>Clarifying the Confusion: Not Everything Is an Agent</h2><p>The first major misconception is that every LLM application is an agent. The key difference is autonomy. In a workflow, you control the flow.</p><p>You decide the steps and their order. In an agent, the model controls the flow. It decides what to do next based on the goal you give it.</p><p>If you can write down the exact sequence of steps in advance, you are building a workflow. You are not building an agent.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5bE2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5af291e-e122-48ee-87c8-4ee8283b3ff9_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5bE2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5af291e-e122-48ee-87c8-4ee8283b3ff9_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!5bE2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5af291e-e122-48ee-87c8-4ee8283b3ff9_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!5bE2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5af291e-e122-48ee-87c8-4ee8283b3ff9_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!5bE2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5af291e-e122-48ee-87c8-4ee8283b3ff9_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5bE2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5af291e-e122-48ee-87c8-4ee8283b3ff9_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a5af291e-e122-48ee-87c8-4ee8283b3ff9_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;A side-by-side comparison of a predetermined workflow and an autonomous agent.&quot;,&quot;title&quot;:&quot;A side-by-side comparison of a predetermined workflow and an autonomous agent.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A side-by-side comparison of a predetermined workflow and an autonomous agent." title="A side-by-side comparison of a predetermined workflow and an autonomous agent." srcset="https://substackcdn.com/image/fetch/$s_!5bE2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5af291e-e122-48ee-87c8-4ee8283b3ff9_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!5bE2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5af291e-e122-48ee-87c8-4ee8283b3ff9_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!5bE2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5af291e-e122-48ee-87c8-4ee8283b3ff9_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!5bE2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5af291e-e122-48ee-87c8-4ee8283b3ff9_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 1: A side-by-side comparison of a predetermined workflow and an autonomous agent, highlighting who controls the flow.</em></figcaption></figure></div><p>The second misconception is that tools are agents. A tool is a capability. It can be a calculator, a database query, a web browser, a validator, or an API call.</p><p>It can even be another LLM. An agent is the decision maker who chooses which tools to use and when.</p><p>If someone tells you they built a multi-agent system, but it is actually one model calling ten different APIs, that is not multi-agent. That is a single agent with ten tools.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Hg-F!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f18d144-8860-4b4b-baf2-c1df3722fb66_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Hg-F!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f18d144-8860-4b4b-baf2-c1df3722fb66_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!Hg-F!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f18d144-8860-4b4b-baf2-c1df3722fb66_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!Hg-F!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f18d144-8860-4b4b-baf2-c1df3722fb66_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!Hg-F!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f18d144-8860-4b4b-baf2-c1df3722fb66_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Hg-F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f18d144-8860-4b4b-baf2-c1df3722fb66_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9f18d144-8860-4b4b-baf2-c1df3722fb66_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;A central agent connected to multiple tools &#8212; calculator, database, web browser, validator, API, and another LLM.&quot;,&quot;title&quot;:&quot;A central agent connected to multiple tools &#8212; calculator, database, web browser, validator, API, and another LLM.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A central agent connected to multiple tools &#8212; calculator, database, web browser, validator, API, and another LLM." title="A central agent connected to multiple tools &#8212; calculator, database, web browser, validator, API, and another LLM." srcset="https://substackcdn.com/image/fetch/$s_!Hg-F!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f18d144-8860-4b4b-baf2-c1df3722fb66_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!Hg-F!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f18d144-8860-4b4b-baf2-c1df3722fb66_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!Hg-F!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f18d144-8860-4b4b-baf2-c1df3722fb66_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!Hg-F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f18d144-8860-4b4b-baf2-c1df3722fb66_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 2: A visual showing the distinction between tools and agents, with a central agent utilizing various tools.</em></figcaption></figure></div><p>This distinction matters. It defines how you architect, debug, and scale your system. It drives your core architecture choice between a workflow, a single agent with tools, or multiple agents.</p><h2>The Complexity Spectrum: A Mental Model for Architecture Decisions</h2><p>To make this architecture choice easier, we use a complexity spectrum. It is a slider going from the most control to the most autonomy. Your goal is to stay as far left as possible while still solving the problem.</p><p><strong>Level 1</strong> represents workflows. Here, you chain multiple LLM calls together in a predefined sequence. You control every step.</p><p><strong>Level 2</strong> represents a single agent with tools. The model makes decisions about what to do next. You have one decision maker and multiple capabilities.</p><p><strong>Level 3</strong> represents multi-agent systems. Here, you have multiple decision makers who need to coordinate with each other.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!idxn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F294a748b-0308-412d-a1f5-9d75ff5b3bdc_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!idxn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F294a748b-0308-412d-a1f5-9d75ff5b3bdc_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!idxn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F294a748b-0308-412d-a1f5-9d75ff5b3bdc_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!idxn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F294a748b-0308-412d-a1f5-9d75ff5b3bdc_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!idxn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F294a748b-0308-412d-a1f5-9d75ff5b3bdc_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!idxn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F294a748b-0308-412d-a1f5-9d75ff5b3bdc_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/294a748b-0308-412d-a1f5-9d75ff5b3bdc_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;A horizontal spectrum with three levels: Workflows, Single Agent, and Multi-Agent, with a cost and complexity slider.&quot;,&quot;title&quot;:&quot;A horizontal spectrum with three levels: Workflows, Single Agent, and Multi-Agent, with a cost and complexity slider.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A horizontal spectrum with three levels: Workflows, Single Agent, and Multi-Agent, with a cost and complexity slider." title="A horizontal spectrum with three levels: Workflows, Single Agent, and Multi-Agent, with a cost and complexity slider." srcset="https://substackcdn.com/image/fetch/$s_!idxn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F294a748b-0308-412d-a1f5-9d75ff5b3bdc_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!idxn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F294a748b-0308-412d-a1f5-9d75ff5b3bdc_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!idxn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F294a748b-0308-412d-a1f5-9d75ff5b3bdc_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!idxn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F294a748b-0308-412d-a1f5-9d75ff5b3bdc_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 3: A horizontal spectrum showing three levels of autonomy with increasing cost and complexity.</em></figcaption></figure></div><p>The core principle is straightforward. Move right on this spectrum only when you absolutely have to. Each step to the right increases costs, latency, and debugging complexity.</p><p>More LLM calls mean more tokens, more traces to follow, and more places where things can go wrong.</p><p>In practice, start simple and escalate only where things break. Write a prompt first. Test it.</p><p>Implement it with minimal complexity. Measure the results. Add what is missing.</p><p>If the model lacks information, add retrieval. If it needs calculations, add a tool. Only when you genuinely need autonomous decision-making should you reach for an agent.</p><p>Even then, start with one. The best AI systems are the simplest ones that reliably solve the problem. That usually means starting with workflows.</p><h2>When a Workflow Is the Right Answer</h2><p>Workflows are the right answer when your steps are known and stable. If the process is largely the same each time, regardless of input, a workflow is almost always the best choice.</p><p>Workflows win because they are predictable. They are easy to test because you can write unit tests for each step. They are easy to debug because you can trace exactly what happened when something goes wrong.</p><p>They are also cheap because you are not burning tokens on the model, figuring out what to do next.</p><p>Consider a support ticket system. A ticket comes in. You classify it.</p><p>You route it to the right team. You draft a response from templates and context. You validate it against the policy.</p><p>Finally, you send it. Each step might involve an LLM call, but the model does not need to decide whether to classify before routing. That is always the order.</p><p>Building this as an agent adds overhead without adding capability.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Za9K!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94971ee8-618a-40c5-8687-d0dedfa73bd6_1200x489.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Za9K!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94971ee8-618a-40c5-8687-d0dedfa73bd6_1200x489.png 424w, https://substackcdn.com/image/fetch/$s_!Za9K!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94971ee8-618a-40c5-8687-d0dedfa73bd6_1200x489.png 848w, https://substackcdn.com/image/fetch/$s_!Za9K!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94971ee8-618a-40c5-8687-d0dedfa73bd6_1200x489.png 1272w, https://substackcdn.com/image/fetch/$s_!Za9K!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94971ee8-618a-40c5-8687-d0dedfa73bd6_1200x489.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Za9K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94971ee8-618a-40c5-8687-d0dedfa73bd6_1200x489.png" width="1200" height="489" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/94971ee8-618a-40c5-8687-d0dedfa73bd6_1200x489.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:489,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:516854,&quot;alt&quot;:&quot;A horizontal flowchart showing six sequential steps of a support ticket workflow.&quot;,&quot;title&quot;:&quot;A horizontal flowchart showing six sequential steps of a support ticket workflow.&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A horizontal flowchart showing six sequential steps of a support ticket workflow." title="A horizontal flowchart showing six sequential steps of a support ticket workflow." srcset="https://substackcdn.com/image/fetch/$s_!Za9K!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94971ee8-618a-40c5-8687-d0dedfa73bd6_1200x489.png 424w, https://substackcdn.com/image/fetch/$s_!Za9K!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94971ee8-618a-40c5-8687-d0dedfa73bd6_1200x489.png 848w, https://substackcdn.com/image/fetch/$s_!Za9K!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94971ee8-618a-40c5-8687-d0dedfa73bd6_1200x489.png 1272w, https://substackcdn.com/image/fetch/$s_!Za9K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94971ee8-618a-40c5-8687-d0dedfa73bd6_1200x489.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 4: A horizontal flowchart illustrating the support ticket workflow with six sequential steps.</em></figcaption></figure></div><p>Do not underestimate workflows. They are not limited to simple sequential chains. They can include routing to pick different models based on input.</p><p>They can use parallel execution with majority voting to aggregate answers. They can also use generator-evaluator loops where one LLM generates and another validates until quality criteria are met. They can even leverage tools in designs like the orchestrator-worker. These patterns handle complex tasks without any agent overhead.</p><p>If you can write down the exact sequence of steps in advance, like a recipe, it is a workflow.</p><h2>When a Single Agent with Tools Wins</h2><p>Sometimes the order of work is not fixed. You genuinely cannot write down the steps in advance. This happens when the path changes depending on what you discover along the way.</p><p>Maybe the first API call fails, and you need to try an alternative. Maybe the retrieved data is incomplete, and you need clarification. This is what agents handle well.</p><p>When is an agent worth the risk? Anthropic offers a useful framework. Agents make sense when the task is complex enough to need autonomous decisions and delivers real value.</p><p>Critically, the cost of errors and the cost of discovering those errors must be low. This is why AI coding agents are great. A human reviews the code before production, so mistakes are cheap to fix.</p><p>A purchasing agent who accidentally buys the wrong hardware makes an expensive error. You must match your architecture to your error tolerance <a href="https://www.anthropic.com/engineering/building-effective-agents">[3]</a>.</p><p>The rule is to always start with one agent. A single agent with tools works best when tasks are tightly coupled and mostly sequential. It works well when global context matters, meaning step one affects step five.</p><p>It is also ideal when you need fewer than twenty tools and face strict budget or latency constraints.</p><p>Take a marketing content platform from Louis-Fran&#231;ois&#8217;s client work at Towards AI. The client wanted AI-assisted content generation for emails, text messages, and promotional materials. Their initial specification called for a multi-agent setup with over a dozen specialized agents.</p><p>They wanted an orchestrator, a request analyzer, a content generator, and many others. On paper, it looked clean with specialists doing specialist work.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!z1td!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ea9f84f-dc70-4c98-9f72-af3a1dfb68c9_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!z1td!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ea9f84f-dc70-4c98-9f72-af3a1dfb68c9_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!z1td!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ea9f84f-dc70-4c98-9f72-af3a1dfb68c9_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!z1td!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ea9f84f-dc70-4c98-9f72-af3a1dfb68c9_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!z1td!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ea9f84f-dc70-4c98-9f72-af3a1dfb68c9_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!z1td!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ea9f84f-dc70-4c98-9f72-af3a1dfb68c9_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3ea9f84f-dc70-4c98-9f72-af3a1dfb68c9_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Comparison of initial multi-agent setup versus actual single-agent solution for a marketing platform.&quot;,&quot;title&quot;:&quot;Comparison of initial multi-agent setup versus actual single-agent solution for a marketing platform.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Comparison of initial multi-agent setup versus actual single-agent solution for a marketing platform." title="Comparison of initial multi-agent setup versus actual single-agent solution for a marketing platform." srcset="https://substackcdn.com/image/fetch/$s_!z1td!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ea9f84f-dc70-4c98-9f72-af3a1dfb68c9_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!z1td!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ea9f84f-dc70-4c98-9f72-af3a1dfb68c9_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!z1td!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ea9f84f-dc70-4c98-9f72-af3a1dfb68c9_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!z1td!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ea9f84f-dc70-4c98-9f72-af3a1dfb68c9_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 5: Comparison of initial multi-agent setup versus actual single-agent solution for a marketing platform.</em></figcaption></figure></div><p>A single agent was the right call. The tasks were tightly coupled and sequential. The template choice affects the content.</p><p>Personalization depends on both content and contact data. Splitting this across multiple decision makers creates information silos and handoff errors. They did not need parallelism.</p><p>The flow was to plan, generate, validate, and fix if needed.</p><p>The key insight is that tools can be smart. A tool can have its own system prompt and use a different model. The validation tool can use its own LLM with instructions to catch errors.</p><p>The text message tool can treat character limits as deterministic engineering constraints instead of prompting problems. You get specialists, but you keep one brain to maintain context and make final decisions.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vqMr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b7f3d18-2a3d-48a3-8442-704aa5ba8c34_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vqMr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b7f3d18-2a3d-48a3-8442-704aa5ba8c34_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!vqMr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b7f3d18-2a3d-48a3-8442-704aa5ba8c34_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!vqMr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b7f3d18-2a3d-48a3-8442-704aa5ba8c34_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!vqMr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b7f3d18-2a3d-48a3-8442-704aa5ba8c34_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vqMr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b7f3d18-2a3d-48a3-8442-704aa5ba8c34_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6b7f3d18-2a3d-48a3-8442-704aa5ba8c34_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;An agentic loop diagram showing how a single agent plans, executes, and reflects.&quot;,&quot;title&quot;:&quot;An agentic loop diagram showing how a single agent plans, executes, and reflects.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="An agentic loop diagram showing how a single agent plans, executes, and reflects." title="An agentic loop diagram showing how a single agent plans, executes, and reflects." srcset="https://substackcdn.com/image/fetch/$s_!vqMr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b7f3d18-2a3d-48a3-8442-704aa5ba8c34_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!vqMr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b7f3d18-2a3d-48a3-8442-704aa5ba8c34_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!vqMr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b7f3d18-2a3d-48a3-8442-704aa5ba8c34_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!vqMr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b7f3d18-2a3d-48a3-8442-704aa5ba8c34_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 6: An agentic loop diagram showing how a single agent plans, executes, and reflects.</em></figcaption></figure></div><p>This results in a system that is faster to build, cheaper to run, and easier to debug. You get the same capabilities without the coordination overhead.</p><h2>The Tool Count Problem: When One Agent Isn&#8217;t Enough</h2><p>As your tool list grows, tool selection gets harder. This is one of the main ways agent systems quietly break down. It is also one of the clearest signals that splitting into multiple agents might be worth it.</p><p>Every tool has a name, description, and schema that the model needs in context to use correctly. The more tools you add, the more of your context budget you burn before the agent even starts thinking about the actual task. You also have to add system instructions, a few-shot examples, retrieved documents, and conversation history on top of that.</p><p>A single agent tends to work best with roughly 10 to 20 tools. Past that threshold, tool selection degrades. The agent has to choose among too many options in an already packed context.</p><p>This mechanism is known as context rot. LLM performance measurably degrades as context grows, well before hitting the advertised limit. Two forces drive this issue.</p><p>First, more context means more noise competing for the model&#8217;s attention. Second, models suffer from loss in the middle. They tend to attend more to the beginning and end of their context, underweighting information in the middle.</p><p>As your tool schemas and instructions pile up, the model gets worse at picking the right tool.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YqbJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F584e3d74-3c06-40b9-9ed8-ccee10d8da97_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YqbJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F584e3d74-3c06-40b9-9ed8-ccee10d8da97_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!YqbJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F584e3d74-3c06-40b9-9ed8-ccee10d8da97_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!YqbJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F584e3d74-3c06-40b9-9ed8-ccee10d8da97_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!YqbJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F584e3d74-3c06-40b9-9ed8-ccee10d8da97_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YqbJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F584e3d74-3c06-40b9-9ed8-ccee10d8da97_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/584e3d74-3c06-40b9-9ed8-ccee10d8da97_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The context window budget problem &#8212; comparing 10 tools vs 25 tools.&quot;,&quot;title&quot;:&quot;The context window budget problem &#8212; comparing 10 tools vs 25 tools.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The context window budget problem &#8212; comparing 10 tools vs 25 tools." title="The context window budget problem &#8212; comparing 10 tools vs 25 tools." srcset="https://substackcdn.com/image/fetch/$s_!YqbJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F584e3d74-3c06-40b9-9ed8-ccee10d8da97_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!YqbJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F584e3d74-3c06-40b9-9ed8-ccee10d8da97_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!YqbJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F584e3d74-3c06-40b9-9ed8-ccee10d8da97_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!YqbJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F584e3d74-3c06-40b9-9ed8-ccee10d8da97_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 7: The context window budget problem, more tools mean less room for actual task reasoning.</em></figcaption></figure></div><p>Managing context can reduce history and retrieved content, but not the tool schema load. Those definitions must always be there. The only approach that actually reduces how many tool definitions the model sees per call is splitting across agents.</p><p>If one agent sees only email tools and another only sees validation tools, each call stays smaller. Tool selection gets easier. Once you split tools across agents to keep calls small, you enter multi-agent territory.</p><h2>When Multi-Agent Is Actually the Right Call</h2><p>Specific reasons justify multiple agents, not because the architecture sounds impressive. There are four legitimate reasons to go multi-agent. First, you need true parallelism where tasks are genuinely independent and run simultaneously.</p><p>Second, you face context overload where instructions and tools degrade performance. Third, you need modularity to connect with third-party agent systems you do not control. Fourth, you have hard separation requirements like security boundaries or sensitive data handling.</p><p>Consider the professional article generation system that Louis-Fran&#231;ois and I built as one of the projects for our <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering course</a>. We started with a single agent for research and writing but had to pivot because the two phases have fundamentally different needs.</p><p>The research phase is exploratory and dynamic. It needs flexibility and broad tool access across web search, video transcription, and document processing. The agent searches, reads, pivots based on what it finds, and iterates based on human feedback.</p><p>The writing phase is constrained and deterministic. It needs focused constraints, consistent style enforcement, and iterative refinement against fixed rubrics.</p><p>These agents communicate through explicit artifacts. The research agent produces a structured markdown file that the writer agent consumes as context. There is no complex runtime orchestration.</p><p>It is just a sequential handoff with a clear contract between them. Each agent has its own optimized context without the bloat of carrying the other&#8217;s tools.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8P9n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d7b4acc-9771-4b65-a99e-4a48eb2a278b_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8P9n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d7b4acc-9771-4b65-a99e-4a48eb2a278b_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!8P9n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d7b4acc-9771-4b65-a99e-4a48eb2a278b_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!8P9n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d7b4acc-9771-4b65-a99e-4a48eb2a278b_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!8P9n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d7b4acc-9771-4b65-a99e-4a48eb2a278b_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8P9n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d7b4acc-9771-4b65-a99e-4a48eb2a278b_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4d7b4acc-9771-4b65-a99e-4a48eb2a278b_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The article generation multi-agent system with Research Agent and Writing Agent.&quot;,&quot;title&quot;:&quot;The article generation multi-agent system with Research Agent and Writing Agent.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The article generation multi-agent system with Research Agent and Writing Agent." title="The article generation multi-agent system with Research Agent and Writing Agent." srcset="https://substackcdn.com/image/fetch/$s_!8P9n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d7b4acc-9771-4b65-a99e-4a48eb2a278b_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!8P9n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d7b4acc-9771-4b65-a99e-4a48eb2a278b_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!8P9n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d7b4acc-9771-4b65-a99e-4a48eb2a278b_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!8P9n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d7b4acc-9771-4b65-a99e-4a48eb2a278b_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 8: The article generation multi-agent system with a Research Agent, a Writing Agent, and an artifact handoff.</em></figcaption></figure></div><p>If you do go multi-agent, we recommend the plan-and-execute combined with the orchestrator-worker pattern. You do not want everyone talking to everyone. One orchestrator maintains the main context and delegates specific tasks to worker agents.</p><p>Then, it synthesizes the results. This prevents the information silos that kill multi-agent systems.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Gnrt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b250f4b-eb33-479b-9791-4fa8a58ae816_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Gnrt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b250f4b-eb33-479b-9791-4fa8a58ae816_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!Gnrt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b250f4b-eb33-479b-9791-4fa8a58ae816_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!Gnrt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b250f4b-eb33-479b-9791-4fa8a58ae816_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!Gnrt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b250f4b-eb33-479b-9791-4fa8a58ae816_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Gnrt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b250f4b-eb33-479b-9791-4fa8a58ae816_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5b250f4b-eb33-479b-9791-4fa8a58ae816_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The Orchestrator-Worker pattern with delegation and result arrows.&quot;,&quot;title&quot;:&quot;The Orchestrator-Worker pattern with delegation and result arrows.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Orchestrator-Worker pattern with delegation and result arrows." title="The Orchestrator-Worker pattern with delegation and result arrows." srcset="https://substackcdn.com/image/fetch/$s_!Gnrt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b250f4b-eb33-479b-9791-4fa8a58ae816_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!Gnrt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b250f4b-eb33-479b-9791-4fa8a58ae816_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!Gnrt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b250f4b-eb33-479b-9791-4fa8a58ae816_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!Gnrt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b250f4b-eb33-479b-9791-4fa8a58ae816_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 9: The Orchestrator-Worker pattern with no direct communication between workers.</em></figcaption></figure></div><p>Multi-agent systems can simplify individual contexts and enable specialization. However, they increase coordination costs. You will face more token usage, added latency, more failure points, and handoff complexity.</p><p>Only accept those costs when you hit a real constraint that simpler architectures cannot solve.</p><h2>To Wrap Up</h2><p>To build reliable AI applications, you must stay as far left on the complexity spectrum as possible while still solving your problem.</p><p>Keep these key takeaways in mind:</p><ul><li><p>Not every LLM application is an agent, and not every tool is an agent.</p></li><li><p>Always start with workflows because they are predictable, cheap, and testable.</p></li><li><p>Use one agent when the path cannot be predetermined, but keep the tool count manageable.</p></li><li><p>Move to multi-agent architectures only when you hit a real constraint like true parallelism or context overload.</p></li></ul><p>Each step right on the spectrum increases cost, latency, and debugging complexity. The simplest system that reliably solves the problem is always the best system.</p><blockquote><p>&#128161; If you want <strong>a step-by-step framework to help you decide what architecture to pick for your next project,</strong> Louis-Fran&#231;ois and the Towards AI team put together a <strong><a href="https://academy.towardsai.net/products/digital_downloads/agents-cheatsheet?ref=b3ab31">free cheatsheet</a></strong> that walks you through the decision process from workflows to multi-agent systems.</p></blockquote><div><hr></div><p><em>What&#8217;s your opinion? Do you agree, disagree, or is there something I missed?</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/from-12-agents-to-1-ai-agent-architecture-decision-guide/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/from-12-agents-to-1-ai-agent-architecture-decision-guide/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to share our work.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/from-12-agents-to-1-ai-agent-architecture-decision-guide?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/from-12-agents-to-1-ai-agent-architecture-decision-guide?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h2>Go Deeper</h2><p><strong>Go from zero to production-grade AI agents</strong> with the <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering self-paced course</a>. Built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Towards AI</a>.</p><p>Across <strong>34 lessons</strong> (articles, videos, and a lot of code), you&#8217;ll design, build, evaluate, and deploy production-grade AI agents end to end. By the final lesson, you&#8217;ll have <strong>built a multi-agent system</strong> and a <strong>capstone project</strong> where you apply everything you&#8217;ve learned on your own.</p><p><em>Three portfolio projects and a certificate to showcase in interviews. Plus a Discord community where you have direct access to other industry experts and me.</em></p><p>Rated 4.9/5 &#11088;&#65039; by 300+ students saying <em>&#8220;Every AI Engineer needs a course like this.&#8221;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Learn more&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Learn more</span></a></p><p><em>Not ready to commit?</em> We also prepared a free 6-day email course to reveal the <em><strong>6 critical mistakes that silently destroy agentic systems. </strong><a href="https://email-course.towardsai.net/?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Get the free email course.</a></em></p><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[The AI Evals Roadmap I Wish I Had]]></title><description><![CDATA[From vibe checking to trusted agents in production]]></description><link>https://www.decodingai.com/p/the-ai-evals-roadmap-i-wish-i-had</link><guid isPermaLink="false">https://www.decodingai.com/p/the-ai-evals-roadmap-i-wish-i-had</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Tue, 24 Mar 2026 12:04:28 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!RTZT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9caacf1c-71bf-48f1-8b03-ff89346e15f8_1200x1200.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RTZT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9caacf1c-71bf-48f1-8b03-ff89346e15f8_1200x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RTZT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9caacf1c-71bf-48f1-8b03-ff89346e15f8_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!RTZT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9caacf1c-71bf-48f1-8b03-ff89346e15f8_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!RTZT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9caacf1c-71bf-48f1-8b03-ff89346e15f8_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!RTZT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9caacf1c-71bf-48f1-8b03-ff89346e15f8_1200x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RTZT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9caacf1c-71bf-48f1-8b03-ff89346e15f8_1200x1200.png" width="1200" height="1200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9caacf1c-71bf-48f1-8b03-ff89346e15f8_1200x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1200,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:183160,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/191463108?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9caacf1c-71bf-48f1-8b03-ff89346e15f8_1200x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RTZT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9caacf1c-71bf-48f1-8b03-ff89346e15f8_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!RTZT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9caacf1c-71bf-48f1-8b03-ff89346e15f8_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!RTZT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9caacf1c-71bf-48f1-8b03-ff89346e15f8_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!RTZT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9caacf1c-71bf-48f1-8b03-ff89346e15f8_1200x1200.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Welcome to the <strong><a href="https://www.decodingai.com/t/ai-evals-and-observability">AI Evals &amp; Observability series</a></strong>: A 7-part journey from shipping AI apps to systematically improving them. Made by busy people. For busy people.</p><p>AI Evals is the topic most AI engineers know they should invest in, but do not know where to start. I remember struggling with this myself.</p><p>I did not know how to properly integrate evals into my app until I understood there are three core layers: optimization during development, regression testing before merging, and production monitoring on live traffic. Once that clicked, everything else fell into place.</p><p>I did not know how to build LLM judges and evaluators that I could actually trust and use. Every guide I found either hand-waved the details or dumped a generic &#8220;helpfulness&#8221; metric and moved on. Instead, I needed evaluators grounded in my actual business requirements.</p><p>I did not know how to gather custom datasets without wasting too much time. I tried generating hundreds of synthetic test cases up front, but the real unlock came from learning how to organically grow a high-quality dataset from production data, starting small and letting the error-analysis flywheel do the heavy lifting.</p><p>The information was scattered across blog posts, talks, and vendor docs. Most of it focused on isolated techniques without showing how everything connects. I built this series as the structured, end-to-end guide I wish I had.</p><p>This 7-lesson series breaks it all down from first principles. By the end, you will know how to integrate AI evaluations that actually track and improve your product's performance. No vibe checking required.</p><p>The series follows a natural progression. You start by understanding where evals fit. Then, you build the dataset.</p><p>Next, you design and validate the evaluators. Finally, you handle specialized domains like RAG and see how it all works in production.</p><p>You can read front-to-back for the full journey. Alternatively, jump to the lesson that matches your current pain point. Each lesson stands on its own but references the others.</p><p>Without more yada, yada, here are the 7 lessons of the series:<br><em>(Scroll down to find more about each lesson individually.)</em></p><ol><li><p><a href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app">Integrating AI Evals Into Your AI App</a></p></li><li><p><a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Build an AI Evals Dataset from Scratch</a></p></li><li><p><a href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals">Generate Synthetic Datasets for AI Evals</a></p></li><li><p><a href="https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures">How to Design Evaluators</a></p></li><li><p><a href="https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge">How to Evaluate the Evaluator</a></p></li><li><p><a href="https://www.decodingai.com/p/rag-evaluation-6-metrics-framework">RAG Evaluation: The Only 6 Metrics You Need</a></p></li><li><p><a href="https://www.decodingai.com/p/behind-the-scenes-of-ai-observability">Lessons from 6 Months of Evals on a Production AI Companion</a></p></li></ol><p><em>Everything is completely free, without any hidden costs, thanks to our sponsor, Opik</em> &#8595;</p><div><hr></div><h2><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik: Open-Source LLMOps Platform (Sponsored)</a></h2><p>This <strong>AI Evals &amp; Observability</strong> series is brought to you by <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a>, the LLMOps open-source platform used by Uber, Etsy, Netflix, and more. </p><p>We use Opik daily across our courses and AI products. Not just for observability, but as our <strong>end-to-end evaluation harness</strong>, all from the same platform.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yCWf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe3381bc-dda5-4624-8bb9-a961bb331c6d_1764x694.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yCWf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe3381bc-dda5-4624-8bb9-a961bb331c6d_1764x694.png 424w, https://substackcdn.com/image/fetch/$s_!yCWf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe3381bc-dda5-4624-8bb9-a961bb331c6d_1764x694.png 848w, https://substackcdn.com/image/fetch/$s_!yCWf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe3381bc-dda5-4624-8bb9-a961bb331c6d_1764x694.png 1272w, https://substackcdn.com/image/fetch/$s_!yCWf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe3381bc-dda5-4624-8bb9-a961bb331c6d_1764x694.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yCWf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe3381bc-dda5-4624-8bb9-a961bb331c6d_1764x694.png" width="1764" height="694" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/be3381bc-dda5-4624-8bb9-a961bb331c6d_1764x694.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:694,&quot;width&quot;:1764,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:169484,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/191463108?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcce72ecb-bb9c-42b8-98eb-e99d51a624d4_1784x702.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yCWf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe3381bc-dda5-4624-8bb9-a961bb331c6d_1764x694.png 424w, https://substackcdn.com/image/fetch/$s_!yCWf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe3381bc-dda5-4624-8bb9-a961bb331c6d_1764x694.png 848w, https://substackcdn.com/image/fetch/$s_!yCWf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe3381bc-dda5-4624-8bb9-a961bb331c6d_1764x694.png 1272w, https://substackcdn.com/image/fetch/$s_!yCWf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe3381bc-dda5-4624-8bb9-a961bb331c6d_1764x694.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Try Opik for free here</a> (25k spans/month free)</figcaption></figure></div><p>This series teaches you how to build evals from scratch (custom datasets, LLM judges, optimization loops, and production monitoring), while Opik gives you the platform to run everything at scale. </p><p><em>Here is how we use it:</em></p><ul><li><p><strong>Custom LLM judges</strong>: Build evaluators by defining your criteria, adding a few-shot examples, and running them across hundreds of traces automatically.</p></li><li><p><strong>Run experiments, compare results</strong>: Test different prompts, models, or parameters from your AI app side by side. Opik scores each variant with your evaluators and shows you which one wins.</p></li><li><p><strong>Plug evaluators into production</strong>: The same LLM judges you design for offline testing run on live traces too. Set up alarms when scores drop below your threshold so you catch regressions before users do.</p></li></ul><p><strong><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a></strong> is fully <strong>open-source</strong> and works with custom code and with every popular AI framework or tool (<em>including OpenClaw</em>). You can also use the managed version for free (with 25K spans/month on their generous free tier):</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;text&quot;:&quot;Try Opik for free&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul"><span>Try Opik for free</span></a></p><div><hr></div><p><em>&#8595;</em>  <em>Now, let&#8217;s move back to the article.</em></p><h2>Lesson 1: Integrating AI Evals Into Your AI App</h2><p>To build a reliable system, you first need to know where evaluation fits into the development lifecycle.</p><p>Most teams start by <em>&#8220;vibe checking&#8221;</em> their AI app. They manually test a few inputs and eyeball whether the outputs look right. That works for the first version.</p><p>But the moment you start adding features, onboarding real users, or trying to improve existing capabilities, vibe checking collapses. This first article gives you the holistic map of where AI Evals fit, so you never feel lost again.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Y_0d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Y_0d!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 424w, https://substackcdn.com/image/fetch/$s_!Y_0d!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 848w, https://substackcdn.com/image/fetch/$s_!Y_0d!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 1272w, https://substackcdn.com/image/fetch/$s_!Y_0d!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Y_0d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png" width="1200" height="1075" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1075,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Y_0d!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 424w, https://substackcdn.com/image/fetch/$s_!Y_0d!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 848w, https://substackcdn.com/image/fetch/$s_!Y_0d!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 1272w, https://substackcdn.com/image/fetch/$s_!Y_0d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here is what you will learn:</p><ul><li><p>The three core scenarios where evals matter: optimization during development, regression testing before merging, and production monitoring on live traffic.</p></li><li><p>The difference between guardrails and evaluators. Confusing them leads to gaps in your system.</p></li><li><p>The minimum viable tech stack required to start: a custom annotation tool and an LLMOps platform.</p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app&quot;,&quot;text&quot;:&quot;Go to Lesson 1&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app"><span>Go to Lesson 1</span></a></p><h2>Lesson 2: Build an AI Evals Dataset from Scratch</h2><p>Once you understand where evals fit, the next step is gathering the data required to measure performance.</p><p>You cannot evaluate what you cannot measure. You cannot measure without data. Most teams either skip this step entirely or fire off a generic prompt to create 100 test cases and call it done.</p><p>This article teaches the error analysis framework. It is a practical flywheel that turns 20-50 real production traces into a growing, high-quality evals dataset.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HoRg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HoRg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!HoRg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!HoRg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!HoRg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HoRg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png" width="1200" height="1200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1200,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HoRg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!HoRg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!HoRg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!HoRg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here is what you will learn:</p><ul><li><p>The error analysis flywheel: sample traces, label manually, build evaluators iteratively, perform error analysis, and create specialized evaluators.</p></li><li><p>Why one &#8220;<em>benevolent dictator&#8221;</em> should own labeling consistency across your team.</p></li><li><p>How to graduate from generic to specialized evaluators as your understanding deepens.</p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis&quot;,&quot;text&quot;:&quot;Go to Lesson 2&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis"><span>Go to Lesson 2</span></a></p><h2>Lesson 3: Generate Synthetic Datasets for AI Evals</h2><p>Production traces alone have limits. You need traffic to get data, and that traffic rarely covers every scenario. What about before you have users?</p><p>What about rare failure modes you have never seen in production? Yet! Synthetic data solves the cold start problem and fills coverage gaps.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FVJv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FVJv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png 424w, https://substackcdn.com/image/fetch/$s_!FVJv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png 848w, https://substackcdn.com/image/fetch/$s_!FVJv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png 1272w, https://substackcdn.com/image/fetch/$s_!FVJv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FVJv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png" width="1200" height="676" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:676,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Synthetic data and production traces both feed into the evals dataset, which drives the error analysis flywheel&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Synthetic data and production traces both feed into the evals dataset, which drives the error analysis flywheel" title="Synthetic data and production traces both feed into the evals dataset, which drives the error analysis flywheel" srcset="https://substackcdn.com/image/fetch/$s_!FVJv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png 424w, https://substackcdn.com/image/fetch/$s_!FVJv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png 848w, https://substackcdn.com/image/fetch/$s_!FVJv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png 1272w, https://substackcdn.com/image/fetch/$s_!FVJv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here is what you will learn:</p><ul><li><p>Why you should generate only inputs, not outputs, and let your real app produce the outputs.</p></li><li><p>How to think in dimensions like persona, feature, scenario, and input modality to avoid mode collapse.</p></li><li><p>Tester agents for simulating multi-turn conversations.</p></li><li><p>The reverse workflow for RAG: generate questions from your knowledge base, not the other way around.</p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals&quot;,&quot;text&quot;:&quot;Go to Lesson 3&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals"><span>Go to Lesson 3</span></a></p><h2>Lesson 4: How to Design Evaluators</h2><p>You have the dataset. Now you need evaluators who can actually tell you whether your app is working. This is where most teams make their biggest mistake.</p><p>They grab a generic helpfulness metric off the shelf and call it done. This article teaches you how to design evaluators grounded in your actual business requirements.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!a1uV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!a1uV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!a1uV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!a1uV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!a1uV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!a1uV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png" width="1456" height="1048" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1048,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Designing evaluators for AI applications: from code-based checks to LLM judges.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Designing evaluators for AI applications: from code-based checks to LLM judges." title="Designing evaluators for AI applications: from code-based checks to LLM judges." srcset="https://substackcdn.com/image/fetch/$s_!a1uV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!a1uV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!a1uV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!a1uV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here is what you will learn:</p><ul><li><p>The evaluation harness: the infrastructure that automates running evaluators across your dataset.</p></li><li><p>When to use fast, deterministic code-based evaluators versus flexible, nuanced LLM judges.</p></li><li><p>Common design mistakes</p></li><li><p>Advanced designs for multi-turn conversations and agentic workflows.</p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures&quot;,&quot;text&quot;:&quot;Go to Lesson 4&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures"><span>Go to Lesson 4</span></a></p><h2>Lesson 5: How to Evaluate the Evaluator</h2><p>You built an evaluator. It says everything is great. But is it?</p><p>An evaluator that validates every output is worse than no evaluator at all. It gives you false confidence. This article teaches you how to validate your evaluator against human judgment and close the gap when they disagree.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1am-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1am-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!1am-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!1am-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!1am-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1am-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png" width="1456" height="1048" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1048,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The evaluator validation workflow&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The evaluator validation workflow" title="The evaluator validation workflow" srcset="https://substackcdn.com/image/fetch/$s_!1am-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!1am-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!1am-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!1am-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here is what you will learn:</p><ul><li><p>The iterative refinement loop: measure alignment, diagnose disagreements, adjust few-shot examples, and re-measure.</p></li><li><p>Dealing with non-determinism: why LLM judges give different answers on the same input, and how to stabilize them.</p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge&quot;,&quot;text&quot;:&quot;Go to Lesson 5&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge"><span>Go to Lesson 5</span></a></p><h2>Lesson 6: RAG Evaluation: The Only 6 Metrics You Need</h2><p>After mastering general evaluators, you can apply these principles to specific architectures like RAG.</p><p>RAG evaluation feels overwhelming because everyone proposes different metrics. But it does not have to be complicated. This article proves that there are exactly three variables in any RAG system: Question, Context, and Answer.</p><p>There are exactly six possible relationships between them. That is it. Every RAG metric maps to one of these six relationships.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gtpu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gtpu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!gtpu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!gtpu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!gtpu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gtpu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png" width="1456" height="1048" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1048,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The six exhaustive relationships between the three RAG variables &#8212; Question, Context, and Answer.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The six exhaustive relationships between the three RAG variables &#8212; Question, Context, and Answer." title="The six exhaustive relationships between the three RAG variables &#8212; Question, Context, and Answer." srcset="https://substackcdn.com/image/fetch/$s_!gtpu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!gtpu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!gtpu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!gtpu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here is what you will learn:</p><ul><li><p>The three RAG variables and six exhaustive relationships.</p></li><li><p>Tier 1: Retrieval metrics. If retrieval is broken, nothing else matters.</p></li><li><p>Tier 2: The three core RAG metrics you always need.</p></li><li><p>Tier 3: When core metrics cannot explain the failure.</p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/rag-evaluation-6-metrics-framework&quot;,&quot;text&quot;:&quot;Go to Lesson 6&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/rag-evaluation-6-metrics-framework"><span>Go to Lesson 6</span></a></p><h2>Lesson 7: Lessons from 6 Months of Evals on a Production AI Companion</h2><p>Theory and isolated metrics are useful. But the ultimate test is running this entire system on live user traffic.</p><p>The first six articles teach you how to build the system. This final article shows you what it looks like after six months of running it in production.</p><p>Written as a guest post by <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Alejandro Aboy&quot;,&quot;id&quot;:22949723,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de90c745-7f5a-404e-b2d6-eaab9420dd98_881x881.png&quot;,&quot;uuid&quot;:&quot;1ba8f91f-628f-41c3-883d-003ee4b9e225&quot;}" data-component-name="MentionToDOM"></span>, Senior Data Engineer at Workpath, it shares the real lessons. We cover what worked, what failed, and what they wish they had known from the start.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0pKO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0pKO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png 424w, https://substackcdn.com/image/fetch/$s_!0pKO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png 848w, https://substackcdn.com/image/fetch/$s_!0pKO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png 1272w, https://substackcdn.com/image/fetch/$s_!0pKO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0pKO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png" width="616" height="672.2692307692307" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1589,&quot;width&quot;:1456,&quot;resizeWidth&quot;:616,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0pKO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png 424w, https://substackcdn.com/image/fetch/$s_!0pKO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png 848w, https://substackcdn.com/image/fetch/$s_!0pKO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png 1272w, https://substackcdn.com/image/fetch/$s_!0pKO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here is what you will learn:</p><ul><li><p>The three observability problems most teams hit: falling for generic metrics, skipping manual annotation, and not treating AI agents as data products.</p></li><li><p>How to use Opik&#8217;s architecture, including traces, spans, threads, and prompt versioning, for production monitoring and evals.</p></li><li><p>How to reverse-engineer evaluation criteria from real traces instead of guessing upfront.</p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/behind-the-scenes-of-ai-observability&quot;,&quot;text&quot;:&quot;Go to Lesson 7&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/behind-the-scenes-of-ai-observability"><span>Go to Lesson 7</span></a></p><h2>How to Take the Course?</h2><p>After completing these seven articles, you will have the complete mental model for AI Evals. You will understand everything from strategy to production.</p><p>As the course is 100% free, with no hidden costs or registration required, taking it is a no-brainer.</p><p>Each lesson is a free article hosted on the <a href="https://www.decodingai.com/t/ai-evals-and-observability">Decoding AI Magazine</a>.</p><p>Just open each lesson in the order provided by us, and you are good to go:</p><ol><li><p><a href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app">Integrating AI Evals Into Your AI App</a></p></li><li><p><a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Build an AI Evals Dataset from Scratch</a></p></li><li><p><a href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals">Generate Synthetic Datasets for AI Evals</a></p></li><li><p><a href="https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures">How to Design Evaluators</a></p></li><li><p><a href="https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge">How to Evaluate the Evaluator</a></p></li><li><p><a href="https://www.decodingai.com/p/rag-evaluation-6-metrics-framework">RAG Evaluation: The Only 6 Metrics You Need</a></p></li><li><p><a href="https://www.decodingai.com/p/behind-the-scenes-of-ai-observability">Lessons from 6 Months of Evals on a Production AI Companion</a></p></li></ol><p>Each lesson will guide you through the required steps.</p><p>Enjoy!</p><h2>Now What?</h2><p>After completing these lessons, if you want the information to stick, you have to put everything into practice by building a cool project!</p><p>I am sorry to say there is no other way to make learning worthwhile. Pick one problem and get your hands dirty with a project.</p><p><strong>&#128161;</strong><em><strong> Want to share your work on my socials with my 140k+ audience?</strong> If you build a project you are excited about, I will be too. Trust me! I love seeing people build cool stuff. To share it, you can contact me <a href="https://www.pauliusztin.ai/contact">here</a>.</em></p><p>See you next Tuesday.</p><p><a href="https://www.pauliusztin.ai/">Paul Iusztin</a></p><div><hr></div><p><em>What&#8217;s your opinion? Do you agree, disagree, or is there something I missed?</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/the-ai-evals-roadmap-i-wish-i-had/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.decodingai.com/p/the-ai-evals-roadmap-i-wish-i-had/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to share our work.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/the-ai-evals-roadmap-i-wish-i-had?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.decodingai.com/p/the-ai-evals-roadmap-i-wish-i-had?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h2>Go Deeper</h2><p><strong>Go from zero to production-grade AI agents</strong> with the <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering self-paced course</a>. Built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Towards AI</a>.</p><p>Across <strong>34 lessons</strong> (articles, videos, and a lot of code), you&#8217;ll design, build, evaluate, and deploy production-grade AI agents end to end. By the final lesson, you&#8217;ll have <strong>built a multi-agent system</strong> and a <strong>capstone project</strong> where you apply everything you've learned on your own.</p><p><em>Three portfolio projects and a certificate to showcase in interviews. Plus a Discord community where you have direct access to other industry experts and me.</em></p><p>Rated 4.9/5 &#11088;&#65039; by 300+ students &#8212; <em>&#8220;Every AI Engineer needs a course like this.&#8221;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Learn more&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Learn more</span></a></p><p><em>Not ready to commit?</em> We also prepared a free 6-day email course to reveal the <em><strong>6 critical mistakes that silently destroy agentic systems. </strong><a href="https://email-course.towardsai.net/?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Get the free email course.</a></em></p><div><hr></div><p><em>Thanks again to <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a> for sponsoring the series and keeping it free!</em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oSDm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oSDm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 424w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 848w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1272w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png" width="1456" height="364" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:364,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Opik Banner&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Opik Banner" title="Opik Banner" srcset="https://substackcdn.com/image/fetch/$s_!oSDm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 424w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 848w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1272w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Try Opik for free here</a> (25k spans/month free)</figcaption></figure></div><p><strong>If you want to monitor, evaluate and optimize your AI workflows and agents:</strong></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;text&quot;:&quot;Try Opik for free&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul"><span>Try Opik for free</span></a></p><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[Agentic AI Engineering Guide]]></title><description><![CDATA[The 6 critical mistakes that silently destroy agentic systems]]></description><link>https://www.decodingai.com/p/agentic-ai-engineering-guide-6-mistakes</link><guid isPermaLink="false">https://www.decodingai.com/p/agentic-ai-engineering-guide-6-mistakes</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Thu, 19 Mar 2026 12:03:14 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!dUK-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23767fe-eb70-41ea-89c6-3f403021f221_1200x1200.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I have spent the past two years building and breaking AI agents in production. Along the way, I have seen the same patterns destroy systems over and over. This happens not because the models are bad, but because the system design is wrong.</p><p>Most agents fail silently. They work well in demos but drift unpredictably in production. Costs spike with no clear explanation.</p><p>Behavior becomes erratic, and every release feels risky. Ultimately, teams end up stuck in PoC purgatory, unable to ship, debug, or trust their own system.</p><p>The root cause is almost never the model. It is subtle system design mistakes that individually look small but compound into production disasters.</p><p>To fix this, together with <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Louis-Fran&#231;ois Bouchard&quot;,&quot;id&quot;:130571458,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!f-b9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c5d976-f699-4595-8b6d-6ffa3e42a5e5_400x400.jpeg&quot;,&quot;uuid&quot;:&quot;f96a24cf-9c78-4536-a64d-226395c6b6bb&quot;}" data-component-name="MentionToDOM"></span>, we<strong> </strong>created a <strong>diagnostic framework for six specific mistakes that cause agentic systems to break in production.</strong> Each has a clear problem, a reason why it happens, and a proven fix. Once you know what to look for, you can trace most production failures back to one of these patterns.</p><p>The first and most common failure starts right at the input level, where engineers mishandle the context window.</p><h2>Mistake #1: Treating the Context Window as an Afterthought</h2><p>When something breaks, the instinct is to add more context. Engineers add more rules, more history, more tools, and more examples. The assumption is that if the model sees everything, it will behave better.</p><p>But this turns the context window into a dumping ground instead of a carefully scoped working memory. As the context grows, the model starts to ignore instructions and apply constraints inconsistently. It hallucinates more and drifts across runs.</p><p>Latency spikes and costs compound. This is the lost in the middle problem. Many teams respond by splitting one giant prompt into dozens of smaller ones.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!C1jF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b1db21-8c96-487b-9fe2-53e071434fc8_1096x549.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!C1jF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b1db21-8c96-487b-9fe2-53e071434fc8_1096x549.png 424w, https://substackcdn.com/image/fetch/$s_!C1jF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b1db21-8c96-487b-9fe2-53e071434fc8_1096x549.png 848w, https://substackcdn.com/image/fetch/$s_!C1jF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b1db21-8c96-487b-9fe2-53e071434fc8_1096x549.png 1272w, https://substackcdn.com/image/fetch/$s_!C1jF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b1db21-8c96-487b-9fe2-53e071434fc8_1096x549.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!C1jF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b1db21-8c96-487b-9fe2-53e071434fc8_1096x549.png" width="1096" height="549" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/18b1db21-8c96-487b-9fe2-53e071434fc8_1096x549.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:549,&quot;width&quot;:1096,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:90904,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/191159222?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b1db21-8c96-487b-9fe2-53e071434fc8_1096x549.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!C1jF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b1db21-8c96-487b-9fe2-53e071434fc8_1096x549.png 424w, https://substackcdn.com/image/fetch/$s_!C1jF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b1db21-8c96-487b-9fe2-53e071434fc8_1096x549.png 848w, https://substackcdn.com/image/fetch/$s_!C1jF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b1db21-8c96-487b-9fe2-53e071434fc8_1096x549.png 1272w, https://substackcdn.com/image/fetch/$s_!C1jF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b1db21-8c96-487b-9fe2-53e071434fc8_1096x549.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>But that introduces its own problems, such as more LLM calls, higher latency, and harder debugging.</p><blockquote><p>&#128161; <em>Treat the context window as a scarce resource.</em></p></blockquote><p>Every LLM call should have one clearly scoped job. You must curate context aggressively by selecting, compressing, and pruning before every call. Move persistence into a memory layer.</p><p>The context window holds only what matters for the next decision, and everything else lives in memory, which you write to and read from continuously.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dUK-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23767fe-eb70-41ea-89c6-3f403021f221_1200x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dUK-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23767fe-eb70-41ea-89c6-3f403021f221_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!dUK-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23767fe-eb70-41ea-89c6-3f403021f221_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!dUK-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23767fe-eb70-41ea-89c6-3f403021f221_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!dUK-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23767fe-eb70-41ea-89c6-3f403021f221_1200x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dUK-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23767fe-eb70-41ea-89c6-3f403021f221_1200x1200.png" width="1200" height="1200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f23767fe-eb70-41ea-89c6-3f403021f221_1200x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1200,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:318560,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/191159222?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23767fe-eb70-41ea-89c6-3f403021f221_1200x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dUK-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23767fe-eb70-41ea-89c6-3f403021f221_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!dUK-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23767fe-eb70-41ea-89c6-3f403021f221_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!dUK-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23767fe-eb70-41ea-89c6-3f403021f221_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!dUK-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff23767fe-eb70-41ea-89c6-3f403021f221_1200x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As a rule of thumb, start with a single prompt. If it works, stop. If it fails, do not jump to agents.</p><p>Introduce a small number of specialized steps and tune until you hit the balance. Context engineering is about deliberate selection.</p><p>Once the context window is secure, the next trap is overengineering the architecture before the problem demands it.</p><h2>Mistake #2: Starting with Complicated Solutions</h2><p>You have a clear problem, so you immediately reach for multi-agent architectures or heavy frameworks. You build RAG pipelines, hybrid retrieval, multiple databases, or adopt new protocols like MCP. You do this not because the problem demands it, but because it feels like the right way to build serious AI.</p><p>Every layer adds a hidden tax. You get more dependencies, higher latency, higher costs, and harder debugging. Complexity compounds operational pain.</p><p>Teams end up spending months building infrastructure and shipping nothing.</p><p>At our startup, ZTRON, we built a multi-index RAG system. We had OCR pipelines, separate embedding pipelines, hybrid retrieval, and agentic RAG loops.</p><p>It worked, but simple queries took 10 to 15 seconds. Costs climbed, and debugging was a nightmare.</p><p>When we finally asked if we actually needed all this, the answer was no. Our data fit within modern context windows. We replaced agentic RAG with cache-augmented generation (CAG) for most workflows.</p><p>This gave us fewer LLM calls, lower latency, fewer errors, and an easier system to debug.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2ULn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27878c4a-d98c-4819-ac87-ad57e8a042c2_1024x796.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2ULn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27878c4a-d98c-4819-ac87-ad57e8a042c2_1024x796.png 424w, https://substackcdn.com/image/fetch/$s_!2ULn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27878c4a-d98c-4819-ac87-ad57e8a042c2_1024x796.png 848w, https://substackcdn.com/image/fetch/$s_!2ULn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27878c4a-d98c-4819-ac87-ad57e8a042c2_1024x796.png 1272w, https://substackcdn.com/image/fetch/$s_!2ULn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27878c4a-d98c-4819-ac87-ad57e8a042c2_1024x796.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2ULn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27878c4a-d98c-4819-ac87-ad57e8a042c2_1024x796.png" width="1024" height="796" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/27878c4a-d98c-4819-ac87-ad57e8a042c2_1024x796.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:796,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:314336,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/191159222?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27878c4a-d98c-4819-ac87-ad57e8a042c2_1024x796.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2ULn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27878c4a-d98c-4819-ac87-ad57e8a042c2_1024x796.png 424w, https://substackcdn.com/image/fetch/$s_!2ULn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27878c4a-d98c-4819-ac87-ad57e8a042c2_1024x796.png 848w, https://substackcdn.com/image/fetch/$s_!2ULn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27878c4a-d98c-4819-ac87-ad57e8a042c2_1024x796.png 1272w, https://substackcdn.com/image/fetch/$s_!2ULn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27878c4a-d98c-4819-ac87-ad57e8a042c2_1024x796.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Start with the simplest solution that could work. Prove the core task works first. Only add memory, tools, retrieval, or multiple agents when the problem demands it.</p><p>Production-grade AI is built by engineers who ship simple systems first and scale complexity intentionally.</p><p>Earning complexity often means realizing that you do not need an agent at all, which brings us to the third mistake.</p><h2>Mistake #3: Building Agents When a Workflow Will Do</h2><p>Predictable tasks like data ingestion, summarization, or report generation need predictable execution. That is a workflow. Open-ended tasks like deep research or dynamic decision-making under uncertainty may need autonomy.</p><p>Agents handle these open-ended scenarios. Most teams treat predictable problems as if they need agents. When you use an agent for a structured task, you pay for autonomy you do not need.</p><p>You get unpredictable behavior, variable latency, higher token usage, and inconsistent outputs. The system works 80% of the time and fails when it matters most.</p><p>Workflows and agents are not binary choices. They sit on a spectrum known as the autonomy slider. More autonomy buys flexibility but costs predictability, cost control, and debuggability.</p><p>You must set the slider intentionally.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-tHI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d9a94a6-742b-4f35-9edb-87a0649194cf_1170x816.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-tHI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d9a94a6-742b-4f35-9edb-87a0649194cf_1170x816.png 424w, https://substackcdn.com/image/fetch/$s_!-tHI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d9a94a6-742b-4f35-9edb-87a0649194cf_1170x816.png 848w, https://substackcdn.com/image/fetch/$s_!-tHI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d9a94a6-742b-4f35-9edb-87a0649194cf_1170x816.png 1272w, https://substackcdn.com/image/fetch/$s_!-tHI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d9a94a6-742b-4f35-9edb-87a0649194cf_1170x816.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-tHI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d9a94a6-742b-4f35-9edb-87a0649194cf_1170x816.png" width="1170" height="816" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8d9a94a6-742b-4f35-9edb-87a0649194cf_1170x816.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:816,&quot;width&quot;:1170,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:176414,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/191159222?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d9a94a6-742b-4f35-9edb-87a0649194cf_1170x816.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-tHI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d9a94a6-742b-4f35-9edb-87a0649194cf_1170x816.png 424w, https://substackcdn.com/image/fetch/$s_!-tHI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d9a94a6-742b-4f35-9edb-87a0649194cf_1170x816.png 848w, https://substackcdn.com/image/fetch/$s_!-tHI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d9a94a6-742b-4f35-9edb-87a0649194cf_1170x816.png 1272w, https://substackcdn.com/image/fetch/$s_!-tHI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d9a94a6-742b-4f35-9edb-87a0649194cf_1170x816.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Adopt a workflow-first approach. Start with prompt chaining, routing, parallelization, or an orchestrator-worker pattern. Introduce agents only when the system must autonomously plan, explore unknown paths, or recover from failures dynamically.</p><p>For vertical AI agents, use a hybrid approach. Route known patterns to workflows and open-ended requests to agents.</p><p>Whether you use a workflow or an agent, you must handle the data they produce, which exposes a flaw in how engineers process outputs.</p><h2>Mistake #4: Fragile Parsing of LLM Outputs</h2><p>You ask the model for something structured, and it responds with something that looks structured. You parse it with regex, string splitting, or custom logic. It works in staging.</p><p>Then one day, a missing comma or different bullet style crashes production. LLMs are non-deterministic. Even with identical prompts, output can drift due to context changes, model updates, or variations in tool outputs.</p><p>Fragile parsing is a time bomb. Many teams respond by prompting the model to output JSON. That is better than free-form text, but it still is not a contract.</p><p>You still get missing keys, wrong types, and drifting nested fields.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6-NP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F825874b4-daca-42ce-8f09-50f116c21a68_1175x1036.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6-NP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F825874b4-daca-42ce-8f09-50f116c21a68_1175x1036.png 424w, https://substackcdn.com/image/fetch/$s_!6-NP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F825874b4-daca-42ce-8f09-50f116c21a68_1175x1036.png 848w, https://substackcdn.com/image/fetch/$s_!6-NP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F825874b4-daca-42ce-8f09-50f116c21a68_1175x1036.png 1272w, https://substackcdn.com/image/fetch/$s_!6-NP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F825874b4-daca-42ce-8f09-50f116c21a68_1175x1036.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6-NP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F825874b4-daca-42ce-8f09-50f116c21a68_1175x1036.png" width="1175" height="1036" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/825874b4-daca-42ce-8f09-50f116c21a68_1175x1036.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1036,&quot;width&quot;:1175,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:233234,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/191159222?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F825874b4-daca-42ce-8f09-50f116c21a68_1175x1036.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6-NP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F825874b4-daca-42ce-8f09-50f116c21a68_1175x1036.png 424w, https://substackcdn.com/image/fetch/$s_!6-NP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F825874b4-daca-42ce-8f09-50f116c21a68_1175x1036.png 848w, https://substackcdn.com/image/fetch/$s_!6-NP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F825874b4-daca-42ce-8f09-50f116c21a68_1175x1036.png 1272w, https://substackcdn.com/image/fetch/$s_!6-NP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F825874b4-daca-42ce-8f09-50f116c21a68_1175x1036.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Stop treating LLM outputs like text and treating them like data. Define a schema, enforce it at generation time, validate at runtime, and fail fast when wrong. Use Pydantic as the bridge between probabilistic generation and deterministic code.</p><p>But only use structured outputs when structure is required. If you only need a plain string, accept a string and keep schemas shallow and minimal.</p><p>If you have secured your context, simplified your architecture, chosen the right autonomy, and enforced output schemas, you are ready to build an agent. However, many teams still fail by omitting actual planning from their loops.</p><h2>Mistake #5: Forgetting Agents Need Planning</h2><p>You give a model tools, let it pick one, feed the tool output back, and repeat. At a glance, it looks agentic, but it is just a workflow with randomness. The system is reacting to the last tool output, not driving toward a goal.</p><p>Without embedded planning, the loop cannot decompose tasks into meaningful steps. It cannot evaluate progress or choose next actions intentionally. The result is random behavior, unnecessary tool calls, infinite loops, and shallow reasoning.</p><p>Copying ReAct or Plan-and-Execute from blog posts without adapting them to your domain makes it worse.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UaKc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95526d4d-1c95-4cc4-9722-bbe9a8955f47_1175x1014.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UaKc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95526d4d-1c95-4cc4-9722-bbe9a8955f47_1175x1014.png 424w, https://substackcdn.com/image/fetch/$s_!UaKc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95526d4d-1c95-4cc4-9722-bbe9a8955f47_1175x1014.png 848w, https://substackcdn.com/image/fetch/$s_!UaKc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95526d4d-1c95-4cc4-9722-bbe9a8955f47_1175x1014.png 1272w, https://substackcdn.com/image/fetch/$s_!UaKc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95526d4d-1c95-4cc4-9722-bbe9a8955f47_1175x1014.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UaKc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95526d4d-1c95-4cc4-9722-bbe9a8955f47_1175x1014.png" width="1175" height="1014" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/95526d4d-1c95-4cc4-9722-bbe9a8955f47_1175x1014.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1014,&quot;width&quot;:1175,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:226655,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/191159222?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95526d4d-1c95-4cc4-9722-bbe9a8955f47_1175x1014.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UaKc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95526d4d-1c95-4cc4-9722-bbe9a8955f47_1175x1014.png 424w, https://substackcdn.com/image/fetch/$s_!UaKc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95526d4d-1c95-4cc4-9722-bbe9a8955f47_1175x1014.png 848w, https://substackcdn.com/image/fetch/$s_!UaKc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95526d4d-1c95-4cc4-9722-bbe9a8955f47_1175x1014.png 1272w, https://substackcdn.com/image/fetch/$s_!UaKc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95526d4d-1c95-4cc4-9722-bbe9a8955f47_1175x1014.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You must embed planning into the loop. Before calling a tool, require a reasoning step. Ask what the goal is, what the next best action is, and what evidence you need.</p><p>Add progress checks and stop conditions like max steps, token budgets, and escalation when stuck. Make planning use-case specific, because generic ReAct is not a product. Tailor planning to your tools, data, constraints, and failure modes.</p><p>Even a well-planned agent will degrade over time if you do not measure its performance continuously.</p><h2>Mistake #6: Not Starting with AI Evals from Day Zero</h2><p>You build features without tracking how well your AI behaves. You have no tests, no evaluation metrics, and no defined success criteria. Every new feature is a gamble, and teams silently ship regressions.</p><p>AI systems do not fail all at once. They decay. A prompt change, a new tool, or a model upgrade causes subtle behavior shifts.</p><p>Without evals, nobody can answer whether a change made the system better or worse. Teams get stuck relying on vibe evals, which are manual, gut-feel testing that does not scale. Many teams think they are doing evals, but rely on generic scores like helpfulness or 1-5 star scales.</p><p>A score of 3.7 helpfulness tells you nothing about what to fix.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!c0La!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c9395e-8c34-4669-9e5a-4b3e56f2c934_1060x1010.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!c0La!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c9395e-8c34-4669-9e5a-4b3e56f2c934_1060x1010.png 424w, https://substackcdn.com/image/fetch/$s_!c0La!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c9395e-8c34-4669-9e5a-4b3e56f2c934_1060x1010.png 848w, https://substackcdn.com/image/fetch/$s_!c0La!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c9395e-8c34-4669-9e5a-4b3e56f2c934_1060x1010.png 1272w, https://substackcdn.com/image/fetch/$s_!c0La!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c9395e-8c34-4669-9e5a-4b3e56f2c934_1060x1010.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!c0La!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c9395e-8c34-4669-9e5a-4b3e56f2c934_1060x1010.png" width="1060" height="1010" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a7c9395e-8c34-4669-9e5a-4b3e56f2c934_1060x1010.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1010,&quot;width&quot;:1060,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:248948,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/191159222?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c9395e-8c34-4669-9e5a-4b3e56f2c934_1060x1010.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!c0La!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c9395e-8c34-4669-9e5a-4b3e56f2c934_1060x1010.png 424w, https://substackcdn.com/image/fetch/$s_!c0La!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c9395e-8c34-4669-9e5a-4b3e56f2c934_1060x1010.png 848w, https://substackcdn.com/image/fetch/$s_!c0La!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c9395e-8c34-4669-9e5a-4b3e56f2c934_1060x1010.png 1272w, https://substackcdn.com/image/fetch/$s_!c0La!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7c9395e-8c34-4669-9e5a-4b3e56f2c934_1060x1010.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Use evals as your north star. Define task-specific, binary metrics tied to real system behavior and business requirements from day one. Use evals to drive the optimization flywheel.</p><p>Integrate evals into your development workflow to catch regressions before users do.</p><p>Recognizing these six mistakes is the first step to escaping PoC purgatory.</p><h2>Conclusion</h2><p>These six mistakes are not exotic edge cases. They are the exact patterns that repeatedly break real agentic systems. Individually, they look small, but in production, they compound into disasters.</p><p>Each of these mistakes deserves a deeper breakdown with real examples and production-tested fixes. That is why we turned them into a <strong><a href="https://email-course.towardsai.net/?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">free 6-day email course</a></strong>. We cover one mistake per day, with the exact patterns and solutions we use in production.</p><p><strong>&#128161;</strong><em><strong> If you want the complete breakdown, sign up <a href="https://email-course.towardsai.net/?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">here</a>.</strong></em></p><p>Otherwise, see you next Tuesday.</p><p><a href="https://www.pauliusztin.ai/">Paul Iusztin</a></p><div><hr></div><p><em>What&#8217;s your opinion? Do you agree, disagree, or is there something I missed?</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/agentic-ai-engineering-guide-6-mistakes/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/agentic-ai-engineering-guide-6-mistakes/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to share our work.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/agentic-ai-engineering-guide-6-mistakes?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/agentic-ai-engineering-guide-6-mistakes?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h2>Go Deeper</h2><p><strong>Go from zero to production-grade AI agents</strong> with the <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering self-paced course</a>. Built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Towards AI</a>.</p><p>Across <strong>34 lessons</strong> (articles, videos, and a lot of code), you&#8217;ll design, build, evaluate, and deploy production-grade AI agents end to end. By the final lesson, you&#8217;ll have <strong>built a multi-agent system</strong> and a <strong>capstone project</strong> where you apply everything you&#8217;ve learned on your own.</p><p><em>Three portfolio projects and a certificate to showcase in interviews. Plus a Discord community where you have direct access to other industry experts and me.</em></p><p>Rated 4.9/5 &#11088;&#65039; by 300+ students saying <em>&#8220;Every AI Engineer needs a course like this.&#8221;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Learn more&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Learn more</span></a></p><p><em>Not ready to commit?</em> We also prepared a free 6-day email course to reveal the <em><strong>6 critical mistakes that silently destroy agentic systems. </strong><a href="https://email-course.towardsai.net/?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Get the free email course.</a></em></p><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[Why RAG Has Exactly 6 Failure Modes. No More, No Less.]]></title><description><![CDATA[A complete guide for evaluating your retrieval-augmented generation systems.]]></description><link>https://www.decodingai.com/p/rag-evaluation-6-metrics-framework</link><guid isPermaLink="false">https://www.decodingai.com/p/rag-evaluation-6-metrics-framework</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Tue, 17 Mar 2026 12:03:16 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!gtpu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Welcome to the <strong><a href="https://www.decodingai.com/t/ai-evals-and-observability">AI Evals &amp; Observability series</a></strong>: A 7-part journey from shipping AI apps to systematically improving them. Made by busy people. For busy people.</em></p><p>&#129488; Everyone says you need AI evals. Few explain how to actually build them and answer questions such as&#8230;</p><p>How do we avoid creating evals that waste our time and resources? How do we build datasets and design evaluators that matter? How do we adapt them for RAG? ...and most importantly, how do we stop &#8220;vibe checking&#8221; and leverage evals to actually track and optimize our app?</p><p><em>This 7-article series breaks it all down from first principles:</em></p><ol><li><p><a href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app">Integrating AI Evals Into Your AI App</a><strong> </strong></p></li><li><p><a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Build an AI Evals Dataset from Scratch</a> </p></li><li><p><a href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals">Generate Synthetic Datasets for AI Evals </a></p></li><li><p><a href="https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures">How to Design Evaluators</a> </p></li><li><p><a href="https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge">How to Evaluate the Evaluator</a></p></li><li><p><em><strong>RAG Evaluation: The Only 6 Metrics You Need</strong> &#8592; You are here</em></p></li><li><p><a href="https://www.decodingai.com/p/behind-the-scenes-of-ai-observability">Lessons from 6 Months of Evals on a Production AI Companion</a></p></li></ol><p>By the end, you&#8217;ll know how to integrate AI evals that actually track and improve the performance of your AI product. No vibe checking required!</p><p><em>Let&#8217;s get started.</em></p><div><hr></div><h2>RAG Evaluation: The Only 6 Metrics You Need</h2><p>In our previous article, we covered how to validate your AI judges. We measured agreement with human judgment and iterated until alignment was high. Thus, you can now deploy with confidence.</p><p>However, a specialized challenge exists that general-purpose grading tools do not fully address. Evaluating RAG systems introduces a third variable, specifically the retrieved context. With this new element comes a distinct set of failure modes requiring their own metrics.</p><p>I am currently building a financial personal assistant at the stealth AI startup I work for. The application runs heavily on RAG. It pulls financial data from Postgres and integrates with external services such as email, Customer Relationship Management (CRM) tools, and cloud drives.</p><p>When it came time to evaluate the system, building the dataset proved harder than choosing metrics. Fortunately, we had a domain expert on the team who manually tested the application from the start. Therefore, we translated all of that Quality Assurance (QA) work into our AI evals collection using the error analysis workflow from <a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Article 2</a>.</p><p>Evaluating RAG systems introduces a unique difficulty. Each data sample required the correct context to be loaded into the database. We solved this by coupling every test case with a Postgres SQL export.</p><p>This file contained documents, chunks, embeddings, and metadata. We injected it directly into the storage system. This effectively created a cache that bypassed the ingestion pipeline during evals.</p><p>Once the data was in place, implementing the core RAG metrics became straightforward. We used tools like <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a> and foundational models like Gemini Pro as the LLM judge. We had the context, the query, and the answer, which is everything you need.</p><p>What surprised me was that not every capability needed this level of dissection. For our report generation feature, we expect an exact format with specific values pulled from the storage. Checking the final document against a ground truth served as a better proxy than tracing every retrieval step.</p><p><em>Sometimes assessing the destination matters more than checking the route.</em></p><p>RAG evaluation feels needlessly complex. Vendors have an incentive to make it difficult. Every framework ships with many metrics and a dashboard, making you feel like you need a PhD to know if your system works.</p><p>Underneath all the complexity, <strong>RAG systems possess exactly three core components</strong>. These are the Question (Q), the retrieved Context (C), and the generated Answer (A). Furthermore, with these elements, there are <strong>exactly six possible relationships</strong> between them. When your RAG system fails, it breaks along one of these six relationships every single time. </p><p>The beauty of this framework is its exhaustive nature. There are no hidden variables.</p><p>You do not always need to evaluate all six individually. For core conversational features, you need the primary metrics because there are many silent failure modes. However, for structured output tasks, an end-to-end check against expected results can be sufficient.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gtpu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gtpu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!gtpu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!gtpu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!gtpu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gtpu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png" width="1456" height="1048" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1048,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The six exhaustive relationships between the three RAG variables &#8212; Question, Context, and Answer.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The six exhaustive relationships between the three RAG variables &#8212; Question, Context, and Answer." title="The six exhaustive relationships between the three RAG variables &#8212; Question, Context, and Answer." srcset="https://substackcdn.com/image/fetch/$s_!gtpu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!gtpu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!gtpu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!gtpu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2378513-3c4f-4119-92d0-1cd651bb2be3_1456x1048.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"></div></div></a><figcaption class="image-caption"><em>Image 1: The six exhaustive relationships between the three RAG variables &#8212; Question, Context, and Answer.</em></figcaption></figure></div><p><strong>Here is what you will learn in this article:</strong></p><ul><li><p>The only six relationships that exist in a RAG system.</p></li><li><p>How to evaluate your retrieval step before looking at generation.</p></li><li><p>The three core metrics every RAG application needs.</p></li><li><p>Advanced metrics for diagnosing subtle hallucinations.</p></li><li><p>How to match evaluation frequency and strictness to your domain.</p></li><li><p>How to collect and prepare the data your evaluators need.</p></li></ul><p><em>Before digging into the article, a quick word from our sponsor, Opik.</em> &#8595;</p><div><hr></div><h2><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik: Open-Source LLMOps Platform (Sponsored)</a></h2><p>This <strong>AI Evals &amp; Observability</strong> series is brought to you by <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a>, the LLMOps open-source platform used by Uber, Etsy, Netflix, and more. </p><p>We use Opik daily across our courses and AI products. Not just for observability, but to design and run the exact RAG evaluators this article teaches. All from the same platform.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Y-VS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35092-8407-4336-9b50-972f57252a3d_869x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Y-VS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35092-8407-4336-9b50-972f57252a3d_869x450.png 424w, https://substackcdn.com/image/fetch/$s_!Y-VS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35092-8407-4336-9b50-972f57252a3d_869x450.png 848w, https://substackcdn.com/image/fetch/$s_!Y-VS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35092-8407-4336-9b50-972f57252a3d_869x450.png 1272w, https://substackcdn.com/image/fetch/$s_!Y-VS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35092-8407-4336-9b50-972f57252a3d_869x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Y-VS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35092-8407-4336-9b50-972f57252a3d_869x450.png" width="869" height="450" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4ad35092-8407-4336-9b50-972f57252a3d_869x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:869,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:92542,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/191141901?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35092-8407-4336-9b50-972f57252a3d_869x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Y-VS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35092-8407-4336-9b50-972f57252a3d_869x450.png 424w, https://substackcdn.com/image/fetch/$s_!Y-VS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35092-8407-4336-9b50-972f57252a3d_869x450.png 848w, https://substackcdn.com/image/fetch/$s_!Y-VS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35092-8407-4336-9b50-972f57252a3d_869x450.png 1272w, https://substackcdn.com/image/fetch/$s_!Y-VS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ad35092-8407-4336-9b50-972f57252a3d_869x450.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"></div></div></a><figcaption class="image-caption"><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Try Opik for free here</a> (25k spans/month free)</figcaption></figure></div><p>This article shows you how to evaluate RAG systems. Opik gives you the harness to run those evaluations at scale. Here is how we use it:</p><ul><li><p><strong>Custom LLM judges with rubrics</strong> &#8212; Build the evaluators this article describes: define your criteria, add few-shot examples, and run them across hundreds of traces automatically.</p></li><li><p><strong>Run experiments, compare results</strong> &#8212; Test different prompts, models, or configurations side by side. Opik scores each variant with your evaluators and shows you which one wins.</p></li><li><p><strong>Plug evaluators into production</strong> &#8212; The same LLM judges you design for testing run on live traces too. Set up alarms when scores drop below your threshold so you catch regressions before users do.</p></li></ul><p><strong><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a></strong> is fully <strong>open-source</strong> and works with custom code or most AI frameworks. You can also use the managed version for free (with 25K spans/month on their generous free tier):</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;text&quot;:&quot;Try Opik for free&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul"><span>Try Opik for free</span></a></p><div><hr></div><p><em>&#8595;</em>  <em>Now, let&#8217;s move back to the article.</em></p><h2>The Only 6 RAG Evaluation Metrics That Can Exist</h2><p>Jason Liu properly articulated the framework I am about to walk you through <a href="https://jxnl.co/writing/2025/05/19/there-are-only-6-rag-evals/">[1]</a>. Since I wrote the <a href="https://www.amazon.com/LLM-Engineers-Handbook-engineering-production/dp/1836200072/">LLM Engineer&#8217;s Handbook</a> two years ago, I have watched many RAG evaluation tools emerge. They overcomplicate everything with proprietary metric suites.</p><p>Through all of that, I already internalized that only three variables matter in any RAG system. Testing the combinations between them is the only thing you should actually do. Jason Liu gave a clean, formal articulation to what I had in mind.</p><p>He nailed the structure and deserves the recognition for that.</p><p>Every RAG system has three variables. We define <code>Q</code> as the user&#8217;s question, <code>C</code> as the retrieved context, and <code>A</code> representing the generated answer. Thus, we use the notation <code>X|Y</code> to mean the quality of <code>X</code> given <code>Y</code>.</p><p>There are <strong>exactly six relationships</strong> between these variables:</p><ol><li><p><code>C|Q</code> (Context Relevance) asks if the retrieved context addresses the question. This measures your retriever, because if it pulls irrelevant passages, the generator cannot fix the issue. </p></li><li><p><code>A|C</code> (Faithfulness) checks if the answer sticks to what is in the context. This measures your generator to see if the model hallucinated or stayed grounded in the documents. </p></li><li><p><code>A|Q</code> (Answer Relevance) verifies if the response actually addresses the prompt. This is the end-to-end user experience metric. Even if the context is good and the reply is faithful, it must help the person asking. </p></li><li><p><code>C|A</code> (Context Support) ensures the retrieved text contains everything needed to support every claim in the answer. This checks if the provided information was sufficient. </p></li><li><p><code>Q|C</code> (Question Answerability) evaluates if the prompt can even be resolved with this context. This determines whether the system should attempt to reply at all.</p></li><li><p><code>Q|A</code> (Self-Containment) asks if someone can infer the original question from reading the answer alone. This measures whether the output provides enough background to stand on its own.</p></li></ol><p>This framework is exhaustive. Three components produce exactly six conditional relationships. There are no hidden factors.</p><p>Therefore, when your RAG system fails, one of these six metrics is broken.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!j-9F!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8214a72c-1511-4a4d-9806-7914a70aba3d_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!j-9F!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8214a72c-1511-4a4d-9806-7914a70aba3d_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!j-9F!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8214a72c-1511-4a4d-9806-7914a70aba3d_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!j-9F!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8214a72c-1511-4a4d-9806-7914a70aba3d_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!j-9F!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8214a72c-1511-4a4d-9806-7914a70aba3d_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!j-9F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8214a72c-1511-4a4d-9806-7914a70aba3d_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8214a72c-1511-4a4d-9806-7914a70aba3d_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The complete grid of six RAG relationships &#8212; each mapped to the component it diagnoses.&quot;,&quot;title&quot;:&quot;The complete grid of six RAG relationships &#8212; each mapped to the component it diagnoses.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The complete grid of six RAG relationships &#8212; each mapped to the component it diagnoses." title="The complete grid of six RAG relationships &#8212; each mapped to the component it diagnoses." srcset="https://substackcdn.com/image/fetch/$s_!j-9F!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8214a72c-1511-4a4d-9806-7914a70aba3d_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!j-9F!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8214a72c-1511-4a4d-9806-7914a70aba3d_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!j-9F!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8214a72c-1511-4a4d-9806-7914a70aba3d_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!j-9F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8214a72c-1511-4a4d-9806-7914a70aba3d_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"></div></div></a><figcaption class="image-caption"><em>Image 2: The complete grid of six RAG relationships &#8212; each mapped to the component it diagnoses (Retriever, Generator, or End-to-End).</em></figcaption></figure></div><p>Not all six relationships matter equally in every context. We organize them into three tiers. Let us start with retrieval metrics as the prerequisite foundation.</p><h2>Tier 1: If Retrieval Is Broken, Nothing Else Matters</h2><p>RAG is first and foremost a retrieval problem. If the search mechanism does not retrieve the right documents, nothing downstream can save you. The generator will either hallucinate or produce irrelevant answers based on whatever junk it received.</p><p>Before evaluating any of the six RAG relationships, you need to know if your retriever even works. You can use classic information retrieval metrics that measure how well you find relevant documents before generation starts. They are fast to compute and do not require LLM judges.</p><p>Hence, these measurements give quick feedback for tuning your retriever.</p><p>You must establish ground-truth labels to compute these metrics. For each query, you must know which text blocks are actually relevant. You can build this dataset using the reverse workflow presented in depth in <a href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals">Article 3</a>.</p><p>As a quick recap, you start from your knowledge base of document chunks. Then, based on a set of closely related chunks, you generate realistic questions that can only be answered using that unique set of chunks.</p><p>Because the prompt derives from the source material, you know exactly which segment should be retrieved. This gives you a perfectly aligned ground-truth triplet: (question, answer, context). Thus, it becomes straightforward to check whether your search tool actually surfaces the right information.</p><p>There are four main metrics. <code>Precision@K</code> measures the fraction of the top K retrieved chunks that are actually relevant. If your retriever returns 5 chunks but only 2 are useful, your precision is 40%. <code>Recall@K</code> asks: of all the relevant chunks that exist in your entire corpus, how many did your retriever actually find in the top K? If the database has 4 chunks that could answer the question but you only retrieved 2 of them, your recall is 50%.</p><p>In addition, Mean Average Precision (<code>MAP@K</code>) averages precision across multiple queries, rewarding retrievers that consistently rank relevant chunks early. It works by computing precision at every position where a relevant item appears, then averaging those values. Here is a step-by-step example where the truly relevant items are A and C:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aVBN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d2e729b-b697-4ddb-b112-d86baa24b08e_1920x415.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aVBN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d2e729b-b697-4ddb-b112-d86baa24b08e_1920x415.png 424w, https://substackcdn.com/image/fetch/$s_!aVBN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d2e729b-b697-4ddb-b112-d86baa24b08e_1920x415.png 848w, https://substackcdn.com/image/fetch/$s_!aVBN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d2e729b-b697-4ddb-b112-d86baa24b08e_1920x415.png 1272w, https://substackcdn.com/image/fetch/$s_!aVBN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d2e729b-b697-4ddb-b112-d86baa24b08e_1920x415.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aVBN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d2e729b-b697-4ddb-b112-d86baa24b08e_1920x415.png" width="1456" height="315" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2d2e729b-b697-4ddb-b112-d86baa24b08e_1920x415.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:315,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;table&quot;,&quot;title&quot;:&quot;table&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="table" title="table" srcset="https://substackcdn.com/image/fetch/$s_!aVBN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d2e729b-b697-4ddb-b112-d86baa24b08e_1920x415.png 424w, https://substackcdn.com/image/fetch/$s_!aVBN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d2e729b-b697-4ddb-b112-d86baa24b08e_1920x415.png 848w, https://substackcdn.com/image/fetch/$s_!aVBN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d2e729b-b697-4ddb-b112-d86baa24b08e_1920x415.png 1272w, https://substackcdn.com/image/fetch/$s_!aVBN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d2e729b-b697-4ddb-b112-d86baa24b08e_1920x415.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Average Precision for this query = (1.0 + 0.66) / 2 = 0.83. We only average the precision values at positions where a relevant item appeared (ranks 1 and 3). <code>MAP@K</code> then takes this score and averages it across all your queries.</p><p>Finally, Mean Reciprocal Rank (<code>MRR@K</code>) focuses on the position of the first relevant match. If the first relevant chunk appears at position 3, the reciprocal rank is 1/3; if it appears at position 1, it is 1/1. Higher is better.</p><p>Use these for daily development. These indicators are great for tuning embeddings and chunk sizes, while also being perfect for A/B testing retrieval strategies. No LLM is needed, making the process cheap and fast.</p><p>These numbers tell you if the search phase works, as illustrated in Image 3. The six RAG relationships tell you if the whole system functions properly, meaning you need both.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EMDz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e356fc-efd5-420e-a2d4-5025295e9996_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EMDz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e356fc-efd5-420e-a2d4-5025295e9996_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!EMDz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e356fc-efd5-420e-a2d4-5025295e9996_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!EMDz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e356fc-efd5-420e-a2d4-5025295e9996_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!EMDz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e356fc-efd5-420e-a2d4-5025295e9996_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EMDz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e356fc-efd5-420e-a2d4-5025295e9996_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/32e356fc-efd5-420e-a2d4-5025295e9996_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Retrieval metrics applied to a financial assistant query.&quot;,&quot;title&quot;:&quot;Retrieval metrics applied to a financial assistant query.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Retrieval metrics applied to a financial assistant query." title="Retrieval metrics applied to a financial assistant query." srcset="https://substackcdn.com/image/fetch/$s_!EMDz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e356fc-efd5-420e-a2d4-5025295e9996_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!EMDz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e356fc-efd5-420e-a2d4-5025295e9996_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!EMDz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e356fc-efd5-420e-a2d4-5025295e9996_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!EMDz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32e356fc-efd5-420e-a2d4-5025295e9996_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"></div></div></a><figcaption class="image-caption"><em>Image 3: Retrieval metrics applied to a financial assistant query &#8212; checking whether the retriever surfaces the right chunks.</em></figcaption></figure></div><p>With the retrieval confirmed working, you can evaluate the generation step. Let us look at the three core RAG relationships that every system needs.</p><h2>Tier 2: The Three RAG Metrics You Always Need</h2><p>These three metrics directly assess how well your RAG system functions. Most evaluation frameworks prioritize these specific measurements. They map to the three most critical of the six relationships.</p><p>First, we have <strong>Context Relevance</strong> (<code>C|Q</code>). This checks if the retrieved text actually addresses the prompt&#8217;s information needs. Therefore, it measures your search component similar to the metrics from Tier 1, but only looking at the dynamics between the context and question, without any ground truth.</p><p>Suppose we have a query about recent payouts from Q4. A good example is when the retrieved data contains the user&#8217;s dividend payment records from Q4, which passes. On the other side, a bad scenario occurs when the system returns general information about how these distributions work and their tax implications.</p><p>This represents the most common RAG failure mode. In our financial assistant, this often happened when the search tool pulled educational content instead of actual account data.</p><p>Second, we have <strong>Faithfulness</strong> (<code>A|C</code>). This asks if the reply restricts itself to claims that can be verified from the provided text. Hence, it measures whether your generator hallucinates or not.</p><p>In our use case, a good example is when the source contains a CRM record showing a client meeting scheduled for portfolio rebalancing. If the response states exactly that, it passes. A bad example happens when the model adds hallucinated agenda items like tax-loss harvesting strategies, resulting in a failure.</p><p>Third, we have <strong>Answer Relevance</strong> (<code>A|Q</code>). This checks if the output directly addresses the specific query from the prompt. This serves as the end-to-end user experience metric.</p><p>A good example is when a person asks how much their investments grew last month. The reply provides the specific percentage change and absolute dollar amount. A bad scenario is when the text discusses general market performance without mentioning the actual account.</p><p>We measure all three metrics using LLM judges as designed in <a href="https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures">Article 4</a> and validated in <a href="https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge">Article 5</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gNmS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13d9fabd-661e-47ec-a8fc-1afbdcead725_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gNmS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13d9fabd-661e-47ec-a8fc-1afbdcead725_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!gNmS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13d9fabd-661e-47ec-a8fc-1afbdcead725_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!gNmS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13d9fabd-661e-47ec-a8fc-1afbdcead725_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!gNmS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13d9fabd-661e-47ec-a8fc-1afbdcead725_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gNmS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13d9fabd-661e-47ec-a8fc-1afbdcead725_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/13d9fabd-661e-47ec-a8fc-1afbdcead725_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The three core RAG metrics illustrated with financial assistant examples.&quot;,&quot;title&quot;:&quot;The three core RAG metrics illustrated with financial assistant examples.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The three core RAG metrics illustrated with financial assistant examples." title="The three core RAG metrics illustrated with financial assistant examples." srcset="https://substackcdn.com/image/fetch/$s_!gNmS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13d9fabd-661e-47ec-a8fc-1afbdcead725_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!gNmS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13d9fabd-661e-47ec-a8fc-1afbdcead725_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!gNmS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13d9fabd-661e-47ec-a8fc-1afbdcead725_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!gNmS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13d9fabd-661e-47ec-a8fc-1afbdcead725_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"></div></div></a><figcaption class="image-caption"><em>Image 4: The three core RAG metrics illustrated with financial assistant examples &#8212; each measures a critical relationship between Question, Context, and Answer.</em></figcaption></figure></div><p>These three metrics cover the most common failure modes. For specific domains and failure cases, we have to dig deeper into the next 3 metrics.</p><h2>Tier 3: When the Core Metrics Can&#8217;t Explain the Failure</h2><p>The last three metrics provide deeper diagnostic insights usually required in sensitive domains or use cases.</p><p>First, we have <strong>Context Support</strong> (<code>C|A</code>). This checks if the retrieved context contains all the information needed to fully back every claim in the response. While this sounds similar to Faithfulness (<code>A|C</code>), the direction is different. Faithfulness asks: <em>&#8220;did the answer deviate from the context?&#8221;</em> , where you look at the answer and check if it introduced claims that aren&#8217;t there. Context Support asks: <em>&#8220;was the context sufficient to support the answer?&#8221;</em>, where you look at the context and check if it actually contains everything the answer needs.</p><p>Here is a concrete example. Suppose the answer says your total Q4 dividend income was 2,340 <em>across</em> 5 <em>holdings</em>, <em>with</em> <em>the</em> <em>largest</em> <em>payout</em> <em>from</em> <em>MSFT</em> <em>at </em>890. Now look at the context: it only contains the total dividend amount of $2,340. The per-holding breakdown is nowhere in the retrieved documents. The context was insufficient. It had the total but not the details. The LLM produced a plausible breakdown, but the context could not support it. This is low-context support.</p><p>Second, we have <strong>Question Answerability</strong> (<code>Q|C</code>). This asks if the user's question can even be resolved with the given information.</p><p>Suppose the user asks about crypto portfolio performance, but the retrieved documents only contain equity data. This makes the request unanswerable. The system should refuse rather than guess. This metric is important when you want to validate that your agent answers with &#8220;I don&#8217;t know&#8221; instead of confidently hallucinating an answer due to insufficient context.</p><p>In our financial assistant, this was important because some queries can only be resolved if the agent has permissions to access the right external tool first.</p><p>Third, we have <strong>Self-Containment</strong> (<code>Q|A</code>). This checks if someone can infer the original prompt from the reply alone.</p><p>A response stating your portfolio&#8217;s return is 12.4% stands alone. A reply stating just 12.4% does not. Prioritize this metric when outputs are forwarded via email, logged in CRM notes, or read without the original conversation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TV-5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef6c62c8-1af1-4dce-b455-48ed7fea0797_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TV-5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef6c62c8-1af1-4dce-b455-48ed7fea0797_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!TV-5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef6c62c8-1af1-4dce-b455-48ed7fea0797_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!TV-5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef6c62c8-1af1-4dce-b455-48ed7fea0797_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!TV-5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef6c62c8-1af1-4dce-b455-48ed7fea0797_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TV-5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef6c62c8-1af1-4dce-b455-48ed7fea0797_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ef6c62c8-1af1-4dce-b455-48ed7fea0797_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Faithfulness vs Context Support &#8212; two types of hallucination detection.&quot;,&quot;title&quot;:&quot;Faithfulness vs Context Support &#8212; two types of hallucination detection.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Faithfulness vs Context Support &#8212; two types of hallucination detection." title="Faithfulness vs Context Support &#8212; two types of hallucination detection." srcset="https://substackcdn.com/image/fetch/$s_!TV-5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef6c62c8-1af1-4dce-b455-48ed7fea0797_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!TV-5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef6c62c8-1af1-4dce-b455-48ed7fea0797_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!TV-5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef6c62c8-1af1-4dce-b455-48ed7fea0797_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!TV-5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef6c62c8-1af1-4dce-b455-48ed7fea0797_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"></div></div></a><figcaption class="image-caption"><em>Image 5: Faithfulness catches obvious hallucinations where the answer deviates from context. Context Support catches the subtler case where the context was insufficient, and the LLM silently filled the gaps.</em></figcaption></figure></div><p>You now know what to measure at each tier. Two questions remain. How often should you run each one? Which metrics deserve the most attention for your specific domain?</p><h2>Matching Frequency and Strictness to Your Domain</h2><p>Each tier maps to a different running frequency depending on how fast and cheap you can run the evaluations. It also depends on their overall impact on the system.</p><p><strong>Start with Tier 1</strong> on a daily basis. Implement fast retrieval metrics for everyday development and to tune your retrieval component. These are the cheapest to execute as they do not require LLM judges.</p><p>Furthermore, they provide quick feedback cycles. Use them for the improvement flywheel with synthetic data from day zero, focusing on these basic indicators before moving to more complex approaches.</p><p><strong>Move to Tier 2</strong> on a weekly basis. Implement the three primary RAG connections. These core metrics directly assess how well your system functions.</p><p>Use LLM-based grading for a more nuanced assessment of these interactions.</p><p><strong>Incorporate Tier 3</strong> on a monthly basis. Introduce advanced metrics when you need deeper insights. Run a full evaluation to identify prompts that the application should not be answering.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!P9iE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7db8998-edea-4843-96e6-f2767bb2d9a9_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!P9iE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7db8998-edea-4843-96e6-f2767bb2d9a9_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!P9iE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7db8998-edea-4843-96e6-f2767bb2d9a9_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!P9iE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7db8998-edea-4843-96e6-f2767bb2d9a9_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!P9iE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7db8998-edea-4843-96e6-f2767bb2d9a9_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!P9iE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7db8998-edea-4843-96e6-f2767bb2d9a9_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e7db8998-edea-4843-96e6-f2767bb2d9a9_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The tiered evaluation cadence.&quot;,&quot;title&quot;:&quot;The tiered evaluation cadence.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The tiered evaluation cadence." title="The tiered evaluation cadence." srcset="https://substackcdn.com/image/fetch/$s_!P9iE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7db8998-edea-4843-96e6-f2767bb2d9a9_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!P9iE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7db8998-edea-4843-96e6-f2767bb2d9a9_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!P9iE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7db8998-edea-4843-96e6-f2767bb2d9a9_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!P9iE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7db8998-edea-4843-96e6-f2767bb2d9a9_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"></div></div></a><figcaption class="image-caption"><em>Image 6: The tiered evaluation cadence &#8212; cheapest and fastest at the top, deepest and most expensive at the bottom.</em></figcaption></figure></div><p>Here, we focused only on RAG-related measurements. However, this actually applies to any type of AI evals layer. You could implement Tier 1 checks in your CI/CD pipeline to execute on each commit.</p><p>You can trigger Tier 2 evaluations manually before merging your code from your feature branch. Finally, manually run Tier 3 metrics before major releases and strategic decisions.</p><p>There is another dimension to consider when choosing metrics for your use case, which is the good old domain.</p><p><strong>Different domains require emphasis on distinct indicators</strong>. What matters most depends on the severity of the use case.</p><p><strong>High-severity domains</strong> include finance, medical, and legal applications. In these fields, Faithfulness (<code>A|C</code>) and Context Support (<code>C|A</code>) are non-negotiable because every claim must be traceable. Answerability (<code>Q|C</code>) is also critical, meaning the application must refuse rather than guess.</p><p>Thus, you want precision over recall, which is the exact profile we use for our financial assistant.</p><p><strong>Medium severity domains</strong> include customer support and technical documents. Answer Relevance (<code>A|Q</code>) leads here, as the output must be helpful and correct. Answerability (<code>Q|C</code>) helps you know when to hand off to a human, and you generally want recall over precision in retrieval.</p><p><strong>Low-severity domains</strong> include research, writing, and content generation, where synthesis and creative reframing are expected. Context Relevance (<code>C|Q</code>) and Answer Relevance (<code>A|Q</code>) is primary, while Faithfulness (<code>A|C</code>) thresholds remain lower. The generator is supposed to add value beyond the raw text.</p><p>Therefore, you want high recall in the search phase to cast a wide net across sources.</p><p>You know what to measure, when, and what to prioritize. None of this works without the right data and infrastructure. Let us explore how to build the evaluation harness.</p><h2>Building the RAG Evaluation Harness</h2><p>RAG evaluation requires inputs, outputs, and the retrieved context. You need the full triplet.</p><p>The most common blind spot involves treating RAG testing like any other LLM assessment. Teams measure the final reply&#8217;s quality, but never capture what background data the generator actually worked with. Without that information, half the metrics in this article are impossible to compute.</p><p>Next, you should ground your RAG dataset in real human judgment. In our financial assistant, we had a domain specialist on the team who manually QA&#8217;d the application from the start. They ran queries, checked whether the right data was retrieved, and verified that the replies made sense.</p><p>We translated all of that manual work into our AI evals collection using the error analysis workflow from <a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Article 2</a>.</p><p>Also, building RAG datasets introduces a unique difficulty. Each test case needs the right documents, chunks, and embeddings available in the database. Otherwise, the search tool has nothing to work with.</p><p>Running the full ingestion pipeline for every evaluation run is slow and introduces variability.</p><p>We solved this by coupling each data point with a Postgres SQL export containing the relevant documents, chunks, embeddings, and metadata. We loaded this file directly into the storage system for each test, effectively creating a context cache. This made the process fast and reproducible.</p><p>We inject the records, run the query, evaluate the trace, reset the environment, and move to the next item. Image 7 illustrates these data preparation paths.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1KvQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f9c048e-214e-481e-b46b-4ca2c6d9a397_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1KvQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f9c048e-214e-481e-b46b-4ca2c6d9a397_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!1KvQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f9c048e-214e-481e-b46b-4ca2c6d9a397_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!1KvQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f9c048e-214e-481e-b46b-4ca2c6d9a397_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!1KvQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f9c048e-214e-481e-b46b-4ca2c6d9a397_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1KvQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f9c048e-214e-481e-b46b-4ca2c6d9a397_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0f9c048e-214e-481e-b46b-4ca2c6d9a397_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Two paths for building your RAG evaluation dataset.&quot;,&quot;title&quot;:&quot;Two paths for building your RAG evaluation dataset.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Two paths for building your RAG evaluation dataset." title="Two paths for building your RAG evaluation dataset." srcset="https://substackcdn.com/image/fetch/$s_!1KvQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f9c048e-214e-481e-b46b-4ca2c6d9a397_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!1KvQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f9c048e-214e-481e-b46b-4ca2c6d9a397_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!1KvQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f9c048e-214e-481e-b46b-4ca2c6d9a397_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!1KvQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f9c048e-214e-481e-b46b-4ca2c6d9a397_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"></div></div></a><figcaption class="image-caption"><em>Image 7: Two paths for building your RAG evaluation dataset &#8212; manual expert QA (Article 2) and synthetic reverse workflow (Article 3) &#8212; both requiring proper context preparation.</em></figcaption></figure></div><p>If you do not have enough production data or expert QA samples, you can create synthetic RAG evaluation sets. Use the reverse workflow from <a href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals">Article 3</a> by starting from your knowledge base. Use an LLM to extract key facts from specific passages.</p><p>Then, formulate realistic user questions that can only be answered using that exact text block.</p><p>Because the prompt derives directly from the source material, the input, expected retrieval context, and expected reply are perfectly aligned by construction. This gives you a complete ground-truth triplet. Furthermore, this technique is especially powerful for bootstrapping coverage across your entire document corpus.</p><p>Include unanswerable queries in your collection. Do not only formulate prompts that the application should resolve correctly. Instead, create scenarios where the context deliberately lacks the information needed, forcing the agent to refuse or say it does not know.</p><p>Without these negative examples, your testing suite is one-sided. Your evals will optimize for always attempting a reply, whereas adding them directly exercises the Answerability metric from Tier 3.</p><p>Next, if your RAG architecture integrates with external services, the retrieval path is not just a vector database search. Your agent needs to decide which tool to call first. Should it query Postgres, search the CRM, or check the user&#8217;s email?</p><p>The best retrieval metrics will not help if your model invoked the wrong data source entirely.</p><p>In our financial assistant, this was critical. A query about a client meeting should hit the CRM, not the transaction database. Therefore, we added code-based checks for tool selection alongside our RAG metrics.</p><p>Another important trick is to run separate graders per RAG dimension. Do not ask one LLM to evaluate context relevance, faithfulness, and answer relevance in a single prompt. Isolated checks with dimension-specific rubrics produce more consistent results than a unified evaluation.</p><p>Ultimately, you need to log specific data for every trace using tools such as <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a>. Record retrieved chunks to see what the generator had access to. If faithfulness fails, check whether the reply used information that was not provided. Track metadata such as document IDs and scores, because when context relevance fails, you need to know which items ranked highest. This represents the same observability infrastructure from <a href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app">Article 1</a>.</p><h2>Next Steps</h2><p>RAG evaluation is not complex. It is just three variables and six relationships. When your RAG system fails, one of these specific links is broken.</p><p>Fix that exact issue and ignore the complexity theater.</p><p>Start with Tier 1 retrieval checks as daily prerequisites. Add Tier 2 primary indicators weekly. Extend to Tier 3 when specific failure modes demand it.</p><p>Ultimately, match your evaluation priorities to your domain&#8217;s risk profile.</p><p>Next time you see a vendor dashboard with dozens of RAG metrics, map each one back to the six relationships. If an indicator does not clearly measure one of the core links, it is noise. Drop it and focus on what actually diagnoses failures.</p><p>Next up is the <a href="https://www.decodingai.com/p/behind-the-scenes-of-ai-observability">final piece in the series</a>. We will explore real-world lessons from months of running evals on a production AI companion. We will discuss what worked, what failed, and what the team would do differently.</p><p>Also, remember that this article is part of a <strong><a href="https://www.decodingai.com/t/ai-evals-and-observability">7-piece series on AI Evals &amp; Observability</a></strong>. <strong>Here&#8217;s what&#8217;s ahead:</strong></p><ol><li><p><a href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app">Integrating AI Evals Into Your AI App</a> </p></li><li><p><a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Build an AI Evals Dataset from Scratch</a>  </p></li><li><p><a href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals">Generate Synthetic Datasets for AI Evals</a>  </p></li><li><p><a href="https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures">How to Design Evaluators</a></p></li><li><p><a href="https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge">How to Evaluate the Evaluator</a></p></li><li><p><em><strong>RAG Evaluation: The Only 6 Metrics You Need</strong> &#8592; You just finished this one</em></p></li><li><p><a href="https://www.decodingai.com/p/behind-the-scenes-of-ai-observability">Lessons from 6 Months of Evals on a Production AI Companion</a></p></li></ol><p>See you next Tuesday.</p><p><a href="https://www.pauliusztin.ai/">Paul Iusztin</a></p><div><hr></div><p><em>What&#8217;s your opinion? Do you agree, disagree, or is there something I missed?</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/rag-evaluation-6-metrics-framework/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.decodingai.com/p/rag-evaluation-6-metrics-framework/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to share our work.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/rag-evaluation-6-metrics-framework?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.decodingai.com/p/rag-evaluation-6-metrics-framework?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h2>Go Deeper</h2><p><strong>Go from zero to production-grade AI agents</strong> with the <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Agentic AI Engineering self-paced course</a>. Built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Towards AI</a>.</p><p>Across <strong>34 lessons</strong> (articles, videos, and a lot of code), you&#8217;ll design, build, evaluate, and deploy production-grade AI agents end to end. By the final lesson, you&#8217;ll have <strong>built a multi-agent system</strong> and a <strong>capstone project</strong> where you apply everything you've learned on your own.</p><p><em>Three portfolio projects and a certificate to showcase in interviews. Plus a Discord community where you have direct access to other industry experts and me.</em></p><p>Rated 4.9/5 &#11088;&#65039; by 300+ students &#8212; <em>&#8220;Every AI Engineer needs a course like this.&#8221;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering&quot;,&quot;text&quot;:&quot;Learn more&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering"><span>Learn more</span></a></p><p><em>Not ready to commit?</em> We also prepared a free 6-day email course to reveal the <em><strong>6 critical mistakes that silently destroy agentic systems. </strong><a href="https://email-course.towardsai.net/?ref=b3ab31&amp;utm_source=decodingai&amp;utm_medium=partner&amp;utm_campaign=agent_engineering">Get the free email course.</a></em></p><div><hr></div><p><em>Thanks again to <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a> for sponsoring the series and keeping it free!</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yeD8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 424w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 848w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1272w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yeD8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png" width="1200" height="400" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/deaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:400,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Opik Banner&quot;,&quot;title&quot;:&quot;Opik Banner&quot;,&quot;type&quot;:null,&quot;href&quot;:&quot;https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Opik Banner" title="Opik Banner" srcset="https://substackcdn.com/image/fetch/$s_!yeD8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 424w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 848w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1272w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"></div></div></a><figcaption class="image-caption"><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Try Opik for free here</a> (25k spans/month free)</figcaption></figure></div><p><strong>If you want to monitor, evaluate and optimize your AI workflows and agents:</strong></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;text&quot;:&quot;Try Opik for free&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul"><span>Try Opik for free</span></a></p><div><hr></div><h2>References</h2><ol><li><p>Liu, J. (2025, May 19). There Are Only 6 RAG Evals. jxnl.co. <a href="https://jxnl.co/writing/2025/05/19/there-are-only-6-rag-evals/">https://jxnl.co/writing/2025/05/19/there-are-only-6-rag-evals/</a></p></li><li><p>Grace, M., Hadfield, J., Olivares, R., &amp; De Jonghe, J. (2026, January 09). Demystifying Evals for AI Agents. Anthropic Engineering Blog. <a href="https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents">https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents</a></p></li></ol><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[Why Most RAG Tutorials Fail You]]></title><description><![CDATA[How a senior architect learned RAG from scratch, the production way]]></description><link>https://www.decodingai.com/p/production-rag-from-scratch-senior-architect-guide</link><guid isPermaLink="false">https://www.decodingai.com/p/production-rag-from-scratch-senior-architect-guide</guid><dc:creator><![CDATA[Priya]]></dc:creator><pubDate>Thu, 12 Mar 2026 12:02:03 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!_xRY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d099bbe-2f2c-4c6e-bd90-33f5ea6b75e2_1200x657.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Paul:</strong> Today, the stage belongs to <a href="https://substack.com/@pmarwa">Priya</a>, a Senior Software Architect who&#8217;s spent years shipping production-scale systems at Publicis Sapient and Tesco.</p><p>She&#8217;s deconstructing RAG with a production-first mindset, skipping the theoretical demos to focus on building for architectural reliability.</p><p>This one is packed. Let&#8217;s get into it &#128064; &#8595;</p><div><hr></div><h2>The &#8220;Deer in the Headlights&#8221; Moment</h2><p>I&#8217;ve navigated many shifts since the early days of the web, from monoliths to cloud-native microservices and SOAP to REST. But the AI wave felt different. I found myself in a &#8220;deer in the headlights&#8221; moment, completely unsure of what to learn or even where to start. Should I dive into neural network math, focus on model training, or master context engineering (AI moves quickly)?</p><p>Eventually, the path became clear when I realized my real value lay in applying AI to complex business problems. In an enterprise context, that led me straight to RAG. It isn&#8217;t just about the model, it&#8217;s also about the robust system you build around it. It felt like a return to architecture, a concrete problem to solve where using AI could make a profound difference. However, as I started building, I hit a second roadblock...</p><h2>Why Most RAG Tutorials Didn&#8217;t Help Me Learn RAG</h2><p>Most RAG tutorials optimize for one outcome: getting an answer out of a model as quickly as possible. That&#8217;s fine for demos. It&#8217;s a poor way to learn how RAG systems behave in production.</p><p>I&#8217;m not new to building production software. I&#8217;ve spent decades shipping and maintaining systems where debuggability, operability, and failure modes matter. What&#8217;s new to me here is RAG, not the discipline of building systems that survive contact with reality. While learning RAG, I wanted to internalize the constraints I&#8217;d eventually face anyway: inspectability, idempotent ingestion, debuggable retrieval, and controllable generation. That meant resisting framework-managed chains and owning the control flow early, even if it slowed me down.</p><p>This post documents how I&#8217;m teaching myself RAG by building a production-grade system in deliberate phases, using frameworks as utilities rather than architecture.</p><p>That approach was heavily influenced by, and indeed, inspired by Paul Iusztin&#8217;s <em><a href="https://www.decodingai.com/p/my-ai-production-tech-stack">From 100+ AI Tools to 4: My Production Stack</a></em>, especially this idea:</p><p><em>AI frameworks are good utilities. They should not dictate the architecture or control flow of your system.</em></p><p>That became my guiding principle.</p><p><em>Before we continue, a quick word from the Decoding AI team.</em> &#8595;</p><div><hr></div><h2><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Go Deeper: Your Path to Agentic AI for Production</a></h2><p>Most engineers know the theory behind agents, context engineering, and RAG. What they lack is the confidence to architect, evaluate, and deploy these systems in production. The <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Agentic AI Engineering course</a>, built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Towards AI,</a> closes that gap across 34 lessons (articles, videos, and a lot of code).</p><p>By the end, you will have gone from <em>&#8220;I built a demo&#8221;</em> to <em>&#8220;I shipped a production-grade multi-agent system with evals, observability, and CI/CD.&#8221;</em> Three portfolio projects, a certificate to back it up in interviews, and a Discord community with direct access to industry experts and Paul Iusztin.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6Qcm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6Qcm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 424w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 848w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6Qcm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg" width="1280" height="720" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:720,&quot;width&quot;:1280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6Qcm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 424w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 848w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>34 lessons from first principles to production &#8212; context engineering, workflows, agents, evals, and deployment</em></figcaption></figure></div><p>Rated 4.9/5 &#11088;&#65039; by 290+ early students &#8212; <em>&#8221;Every AI Engineer needs a course like this&#8221;</em> and <em>&#8221;an excellent bridge from experimental LLM projects to real-world AI engineering.&#8221;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&quot;,&quot;text&quot;:&quot;Start learning today&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31"><span>Start learning today</span></a></p><div><hr></div><p>&#8595; <em>Now, back to the article.</em></p><h2>The Architecture</h2><p>Before diving into the details, here is the end-to-end architecture of the RAG system. This diagram serves as a reference model, and we&#8217;ll walk through each layer and the production considerations that shaped these choices.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_xRY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d099bbe-2f2c-4c6e-bd90-33f5ea6b75e2_1200x657.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_xRY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d099bbe-2f2c-4c6e-bd90-33f5ea6b75e2_1200x657.png 424w, https://substackcdn.com/image/fetch/$s_!_xRY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d099bbe-2f2c-4c6e-bd90-33f5ea6b75e2_1200x657.png 848w, https://substackcdn.com/image/fetch/$s_!_xRY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d099bbe-2f2c-4c6e-bd90-33f5ea6b75e2_1200x657.png 1272w, https://substackcdn.com/image/fetch/$s_!_xRY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d099bbe-2f2c-4c6e-bd90-33f5ea6b75e2_1200x657.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_xRY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d099bbe-2f2c-4c6e-bd90-33f5ea6b75e2_1200x657.png" width="1200" height="657" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8d099bbe-2f2c-4c6e-bd90-33f5ea6b75e2_1200x657.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:657,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_xRY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d099bbe-2f2c-4c6e-bd90-33f5ea6b75e2_1200x657.png 424w, https://substackcdn.com/image/fetch/$s_!_xRY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d099bbe-2f2c-4c6e-bd90-33f5ea6b75e2_1200x657.png 848w, https://substackcdn.com/image/fetch/$s_!_xRY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d099bbe-2f2c-4c6e-bd90-33f5ea6b75e2_1200x657.png 1272w, https://substackcdn.com/image/fetch/$s_!_xRY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d099bbe-2f2c-4c6e-bd90-33f5ea6b75e2_1200x657.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Phase 1. Ingestion: Own the Data</h3><p><strong>What I built:</strong> a pipeline that discovers files &#8594; loads documents &#8594; normalizes text &#8594; chunks &#8594; embeds &#8594; stores everything in Postgres.</p><p>From experience building production systems, ingestion pipelines are where complexity quietly accumulates if they lack idempotence, i.e., the ability to safely re-run without ending up in an inconsistent state, such as duplicate data, partial updates, or stale artifacts. The same applies to traceability, i.e., the ability to trace exactly what happened, to which data, and when. I assumed the same risks would apply here.</p><p>What I didn&#8217;t account for was how the nature of debugging would differ so vastly from what I was used to in the past. It wasn&#8217;t just about emitting log and error information at the right places anymore. A bad chunk doesn&#8217;t throw an exception, it just hallucinates an answer three steps later.</p><h4>Single database, many uses</h4><p>Instead of introducing a separate vector database, I used <strong>Postgres + pgvector</strong>. Chunks, embeddings, and metadata live together. That decision buys me a lot:</p><ul><li><p>I can inspect ingestion results with plain SQL</p></li><li><p>I can join vectors with relational metadata</p></li><li><p>I can reproduce retrieval behavior outside the application</p></li></ul><p>That inspectability matters when you&#8217;re still learning, and having less infrastructure to maintain pays off long after.</p><h4>Frameworks as utilities, not architecture</h4><p>I use LangChain&#8217;s document loaders (<em>TextLoader, PyMuPDFLoader</em>) for format handling. But the control flow is explicit and mine:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">for file_info in discover_files(folder_path):

    raw_docs = load_document(file_info.file_path)

    clean_text = normalize_text(raw_docs)

    chunks = chunk_text(clean_text, chunk_size=512)

    embeddings = await embed_chunks(chunks)

    await save_to_postgres(file_info, chunks, embeddings)</code></pre></div><p>Each step is isolated. Each step can be logged, rerun, or replaced independently. When something breaks, I debug <em>my</em> code, not a framework-managed chain. For instance, during my initial tests, I used PyPDFLoader for the document loading phase. When I inspected the chunking, I realised the chunks had incorrect spaces due to kerning (e.g., &#8221;P r e - C h u n k&#8221;). This was easy to address just by swapping PyPDFLoader with PyMuPDFLoader, which handled the complex layouts better.</p><h4>Idempotence and safe re-runs</h4><p>I mentioned earlier that pipelines break down when they lack idempotence. Here&#8217;s how I addressed it.</p><p>Every file&#8217;s contents are hashed. If the content hash matches what&#8217;s already stored, the file is skipped, no wasted compute, no risk. If the content has changed, its old chunks and embeddings are completely removed before the new ones are written. The database never ends up with a mix of old and new states for the same source.</p><p>During development, it makes experimentation safe. For instance, I can tweak chunk sizes, swap embedding models, or change preprocessing logic, then re-run the entire pipeline and trust the result. Without this, every experiment would mean manually cleaning up the database first, or worse, not realizing stale data was still there, silently affecting retrieval quality. More importantly, though, in production, it makes the pipeline resilient to failure. If ingestion crashes halfway through, I can simply restart it. Files already processed are skipped, and the rest pick up where they left off. No manual cleanup, no risk of corruption.</p><h3>Phase 2. Retrieval: Make Failure Visible</h3><p>Retrieval is where the quality of your results is determined, which makes debugging discipline more important than clever algorithms.</p><p><strong>What I built:</strong> query preprocessing &#8594; embedding &#8594; similarity search &#8594; optional reranking.</p><p>Most LangChain tutorials show you how to build a RAG pipeline as a &#8220;chain,&#8221; i.e.,  a single call where the framework retrieves context, sends it to the LLM, and returns the answer. I chose not to do that. Consistent with the architecture philosophy above,  retrieval is an explicit phase, and every step in the retrieval pipeline is an explicit function call I control and invoke directly:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">async def retrieve(query: str, top_k: int = 5, rerank: bool = False):

    processed_query = preprocess_query(query)

    query_embedding = embed_query(processed_query)

    results = await search_similar_chunks(query_embedding, top_k)

    if rerank:

        results = rerank_results(query, results, top_k)

    return RetrievalResponse(query=query, results=results)</code></pre></div><p>Keeping retrieval explicit makes failures legible. When an answer is wrong, I can tell whether the issue came from:</p><ul><li><p>query preprocessing</p></li><li><p>embedding quality</p></li><li><p>recall</p></li><li><p>ranking</p></li></ul><p>Because vectors live in Postgres, I can reproduce retrieval behavior with SQL alone.</p><p>That inspectability is invaluable when learning.</p><h4>Retrieval &#8594; Generation boundary</h4><p>This is the boundary where many RAG systems start to blur failure modes. But they are fundamentally different problems.</p><p>Retrieval, including reranking, decides <strong>what context is allowed to reach the model</strong>. It is a search problem. It fails by missing relevant information (poor recall) or burying it in noise (poor precision).</p><p>Generation decides <strong>what the model does with the provided context</strong>. It is a reasoning problem. It fails by misinterpreting the context, hallucinating facts, or ignoring instructions.</p><p>Keeping this boundary explicit helps you immediately diagnose which problem you effectively have. If the answer is wrong but the context contains the truth, you fix the prompt. If the context is missing the truth, you fix the search.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kNfZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65e695d8-f6a4-4909-b9ad-87314bc50dd5_1032x193.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kNfZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65e695d8-f6a4-4909-b9ad-87314bc50dd5_1032x193.png 424w, https://substackcdn.com/image/fetch/$s_!kNfZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65e695d8-f6a4-4909-b9ad-87314bc50dd5_1032x193.png 848w, https://substackcdn.com/image/fetch/$s_!kNfZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65e695d8-f6a4-4909-b9ad-87314bc50dd5_1032x193.png 1272w, https://substackcdn.com/image/fetch/$s_!kNfZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65e695d8-f6a4-4909-b9ad-87314bc50dd5_1032x193.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kNfZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65e695d8-f6a4-4909-b9ad-87314bc50dd5_1032x193.png" width="1032" height="193" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/65e695d8-f6a4-4909-b9ad-87314bc50dd5_1032x193.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:193,&quot;width&quot;:1032,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kNfZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65e695d8-f6a4-4909-b9ad-87314bc50dd5_1032x193.png 424w, https://substackcdn.com/image/fetch/$s_!kNfZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65e695d8-f6a4-4909-b9ad-87314bc50dd5_1032x193.png 848w, https://substackcdn.com/image/fetch/$s_!kNfZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65e695d8-f6a4-4909-b9ad-87314bc50dd5_1032x193.png 1272w, https://substackcdn.com/image/fetch/$s_!kNfZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65e695d8-f6a4-4909-b9ad-87314bc50dd5_1032x193.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h3>Phase 3. Generation: Treat the LLM as an Unreliable Dependency</h3><p><strong>What I built:</strong> context formatting &#8594; LLM invocation with retries &#8594; response assembly.</p><p>LLMs fail in ways traditional dependencies don&#8217;t. They are non-deterministic, occasionally unavailable, and can return plausible but wrong outputs. I treated the model as an unreliable dependency from day one, something to isolate, observe, and swap, not something to trust implicitly.</p><h4>Swappable LLMs via a factory</h4><p>A simple factory pattern makes experimentation cheap:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def get_llm():

    if provider == &#8220;openai&#8221;:

        return OpenAIChat(...)

    if provider == &#8220;gemini&#8221;:

        return GeminiChat(...)</code></pre></div><p>Switching providers requires only configuration changes. Call sites don&#8217;t care. This is exactly where frameworks like LangChain shine: as an abstraction layer. They handle the messy API differences between providers so that OpenAIChat and GeminiChat can expose the same interface to your application. Using them here makes swapping models trivial, without letting them dictate your control flow.</p><h4>Explicit orchestration over chains</h4><p>Generation is intentionally step-by-step:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">async def generate_answer(request):

    retrieval_response = await retrieve(query=request.query, ...)

    context_text = format_docs(retrieval_response)

    messages = get_rag_prompt().format_messages(

        context=context_text,

        question=request.query,

    )

    llm = get_llm()

    ai_message = await _invoke_llm_with_retry(llm, messages)

    return GenerateResponse(answer=ai_message.content, ...)</code></pre></div><p>I avoided using LangChain&#8217;s expression language (LCEL) or runnable abstractions to build this flow. While powerful, they can hide what&#8217;s happening. Explicit orchestration is easier to debug, instrument, and reason about, especially while learning. This resonated with me even more since I&#8217;m used to a hands-on approach where I can write code and truly understand how the logic flows.</p><h4>Retries are operational, not semantic</h4><p>LLM calls fail for mundane reasons: transient network issues, provider-side throttling, or brief outages. I treat those as operational failures, not model behavior, and handle them explicitly.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">from tenacity import retry, stop_after_attempt, wait_exponential

@retry(

    stop=stop_after_attempt(3),

    wait=wait_exponential(multiplier=1, min=1, max=10),

)

async def _invoke_llm_with_retry(llm, messages):

    return await llm.ainvoke(messages)</code></pre></div><p>Retries don&#8217;t make the model <em>correct</em>, they make the system resilient.</p><h3>Phase 4. Serving: Thin Adapters, Shared Core</h3><p><strong>What I built:</strong> two interfaces over the same RAG core:  a REST API and an MCP server.</p><p>In many RAG implementations, the retrieval logic is tightly coupled to the web framework (e.g., defined inside a FastAPI route). This makes it hard to test the logic in isolation or reuse it in different contexts (like a CLI or an evaluation script).</p><p>Instead, I treated my RAG system as a standalone library. The core function &#8216;<em>generate_answer</em>&#8217; takes a pure Pydantic object and returns one. It knows nothing about HTTP, headers, or JSON.</p><p>This architecture allowed me to treat serving as a <strong>thin adapter pattern</strong>.</p><h4>Adapter 1: REST API (FastAPI)</h4><p>The REST adapter serves traditional software systems that need deterministic access to the retrieval layer. This includes web applications, backend services, internal tooling, evaluation pipelines, and automation jobs. These are environments where the caller decides exactly when and how the capability should be invoked.</p><p>The web layer itself does no <em>extra</em> work. It merely deserializes JSON, calls the core, and serializes the result.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">@router.post(&#8221;&#8220;, response_model=GenerateResponse)

async def query(request: GenerateRequest) -&gt; GenerateResponse:

    return await generate_answer(request)</code></pre></div><h4>Adapter 2: MCP Server (Capability Interface for Tool-Using LLMs)</h4><p>Exposing the same core through the Model Context Protocol (MCP) transforms my RAG pipeline from an application-bound feature into a standardized capability.</p><p><strong>MCP standardizes how capabilities are exposed to tool-using LLMs</strong>,  regardless of whether the caller is a chat assistant, a coding copilot, or an autonomous agent.</p><p>I&#8217;m used to abstraction via careful refactoring, and it didn&#8217;t take long to understand that MCP was just another way of achieving this in the context of AI.</p><p>MCP-compatible clients such as Claude Desktop, Cowork, or Cursor can connect to the server and invoke the <em>query_rag</em> tool directly. This allows the underlying LLM to ground its responses in private data without requiring custom integrations, plugins, or connector logic.</p><p>Direct tool access is useful, but the MCP interface becomes far more valuable as agents adopt <a href="https://agentskills.io/home">skills</a> to carry out knowledge work and other multi-step tasks. For example, a &#8220;Market Research Skill&#8221; might combine web search, financial data lookup, and document retrieval. By exposing my RAG system as an MCP Tool, it becomes a standardized block that these skills can easily include in their workflows, without needing custom code.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">@mcp.tool()

async def query_rag(query: str, top_k: int = 5, rerank: bool = True) -&gt; dict:

    request = GenerateRequest(query=query, top_k=top_k, rerank=rerank)

    response = await generate_answer(request)

    return response.model_dump()</code></pre></div><p>Both interfaces share the same core logic, thus avoiding duplication. Serving is an adapter problem, not a RAG problem.</p><h4>Data lineage &amp; traceability</h4><p>Traceability isn&#8217;t new. Long before LLMs, production systems relied on lineage and identifiers to make failures debuggable. LLM non-determinism makes that discipline more important, not less.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KxBN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441edf26-9f7f-469f-889d-c0b16ff2451b_367x500.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KxBN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441edf26-9f7f-469f-889d-c0b16ff2451b_367x500.png 424w, https://substackcdn.com/image/fetch/$s_!KxBN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441edf26-9f7f-469f-889d-c0b16ff2451b_367x500.png 848w, https://substackcdn.com/image/fetch/$s_!KxBN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441edf26-9f7f-469f-889d-c0b16ff2451b_367x500.png 1272w, https://substackcdn.com/image/fetch/$s_!KxBN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441edf26-9f7f-469f-889d-c0b16ff2451b_367x500.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KxBN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441edf26-9f7f-469f-889d-c0b16ff2451b_367x500.png" width="367" height="500" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/441edf26-9f7f-469f-889d-c0b16ff2451b_367x500.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:500,&quot;width&quot;:367,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KxBN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441edf26-9f7f-469f-889d-c0b16ff2451b_367x500.png 424w, https://substackcdn.com/image/fetch/$s_!KxBN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441edf26-9f7f-469f-889d-c0b16ff2451b_367x500.png 848w, https://substackcdn.com/image/fetch/$s_!KxBN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441edf26-9f7f-469f-889d-c0b16ff2451b_367x500.png 1272w, https://substackcdn.com/image/fetch/$s_!KxBN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441edf26-9f7f-469f-889d-c0b16ff2451b_367x500.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Debugging RAG systems almost always means reasoning backward, from an answer, to retrieved chunks, to embeddings, and finally to source files.</p><p>In practice, this meant persisting identifiers at every step. Retrieved results carry chunk IDs forward. Generation logs include the IDs of the chunks used as context. When an answer looks wrong, I can trace it deterministically back to its source.</p><p>Without lineage, every bad answer looks like a model problem. With it, failures become diagnosable and fixable.</p><h4>Vendor-neutral observability</h4><p>This isn&#8217;t RAG specific. It&#8217;s the same observability discipline I&#8217;ve applied in other production systems. I deliberately kept it vendor-neutral, following a pattern I&#8217;ve used before to keep core logic decoupled from tooling.</p><p>Beyond tracing execution paths, tools like <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a> let me reason about operational realities: latency per phase, token usage, and cost per request. Being able to see which model was invoked, how many tokens were consumed, and where time was spent turns performance and cost from assumptions into measurable signals.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def track(name: str = None, phase: Phase = None):

    def decorator(func):

        tags = [f&#8221;phase:{phase.value}&#8221;] if phase else []

        @opik.track(name=name, tags=tags)

        def wrapper(*args, **kwargs):

            return func(*args, **kwargs)

        return wrapper

    return decorator</code></pre></div><p>If I ever switch observability tools, business code doesn&#8217;t change.</p><h2>What I&#8217;m Exploring Next</h2><p>Next steps include:</p><ol><li><p>Adding durable workflow orchestration (DBOS or Prefect)</p></li><li><p>Implementing systematic evaluation for retrieval quality and faithfulness</p></li><li><p>Exploring more advanced retrieval patterns</p></li></ol><p>Each will be added deliberately, one constraint at a time.</p><h2>Closing Thoughts</h2><p>Moving from keyword search to semantic and multimodal understanding is a massive leap in how we solve problems. While this technology introduces an ambiguity that contrasts with the deterministic systems I&#8217;ve built before, the incredible advantages and sheer problem-solving power it offers make the challenge truly exciting.</p><p>Building RAG this way slowed me down, deliberately.</p><p>What I have now is a system I can inspect, rerun, and reason about when something goes wrong. For me, that&#8217;s a better foundation than a faster demo.</p><p>I&#8217;m still learning RAG. But I&#8217;m learning it with the same instincts that shaped the rest of my career: make systems observable, design for failure, and own the control flow before adding abstraction.</p><p><strong>Code:</strong> <a href="https://github.com/CalvHobbes/rag-101">https://github.com/CalvHobbes/rag-101</a></p><p><strong>Inspired by:</strong> <em><a href="https://www.decodingai.com/p/my-ai-production-tech-stack">From 100+ AI Tools to 4: My Production Stack</a></em> by <a href="https://substack.com/@pauliusztin">Paul Iusztin</a></p><p>See you next time.</p><p><a href="https://substack.com/@pmarwa">Priya</a></p><div><hr></div><p><em>What&#8217;s your opinion? Do you agree, disagree, or is there something I missed?</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/production-rag-from-scratch-senior-architect-guide/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/production-rag-from-scratch-senior-architect-guide/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to share our work.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/production-rag-from-scratch-senior-architect-guide?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/production-rag-from-scratch-senior-architect-guide?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h2>Go Deeper</h2><p><strong>Go from zero to production-grade AI agents</strong> with the <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Agentic AI Engineering self-paced course</a>. Built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Towards AI</a>.</p><p>Across <strong>34 lessons</strong> (articles, videos, and a lot of code), you&#8217;ll design, build, evaluate, and deploy production-grade AI agents end to end. By the final lesson, you&#8217;ll have <strong>built a multi-agent system</strong> and a <strong>capstone project</strong> where you apply everything you&#8217;ve learned on your own.</p><p><em>Three portfolio projects and a certificate to showcase in interviews. Plus a Discord community where you have direct access to other industry experts and me.</em></p><p>Rated 4.9/5 &#11088;&#65039; by 290+ early students &#8212; <em>&#8220;Every AI Engineer needs a course like this.&#8221;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&quot;,&quot;text&quot;:&quot;Learn more&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31"><span>Learn more</span></a></p><p><em>Not ready to commit?</em> We also prepared a free 6-day email course to reveal the <em><strong>6 critical mistakes that silently destroy agentic systems. </strong><a href="https://email-course.towardsai.net/?ref=b3ab31">Get the free email course.</a></em></p><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[Our LLM Judge Passed Everything. It Was Wrong.]]></title><description><![CDATA[Align your evaluator with human judgment, or don't trust it at all.]]></description><link>https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge</link><guid isPermaLink="false">https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Tue, 10 Mar 2026 12:01:50 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!1am-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Welcome to the <strong><a href="https://www.decodingai.com/t/ai-evals-and-observability">AI Evals &amp; Observability series</a></strong>: A 7-part journey from shipping AI apps to systematically improving them. Made by busy people. For busy people.</em></p><p>&#129488; Everyone says you need AI evals. Few explain how to actually build them and answer questions such as&#8230;</p><p>How do we avoid creating evals that waste our time and resources? How do we build datasets and design evaluators that matter? How do we adapt them for RAG? ...and most importantly, how do we stop &#8220;vibe checking&#8221; and leverage evals to actually track and optimize our app?</p><p><em>This 7-article series breaks it all down from first principles:</em></p><ol><li><p><a href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app">Integrating AI Evals Into Your AI App</a><strong> </strong></p></li><li><p><a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Build an AI Evals Dataset from Scratch</a> </p></li><li><p><a href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals">Generate Synthetic Datasets for AI Evals </a></p></li><li><p><a href="https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures">How to Design Evaluators</a></p></li><li><p><strong>How to Evaluate the Evaluator</strong>  &#8592; <em>You are here</em></p></li><li><p><a href="https://www.decodingai.com/p/rag-evaluation-6-metrics-framework">RAG Evaluation: The Only 6 Metrics You Need</a></p></li><li><p><a href="https://www.decodingai.com/p/behind-the-scenes-of-ai-observability">Lessons from 6 Months of Evals on a Production AI Companion</a></p></li></ol><p>By the end, you&#8217;ll know how to integrate AI evals that actually track and improve the performance of your AI product. No vibe checking required!</p><p><strong>Let&#8217;s get started.</strong></p><div><hr></div><h2>How to Evaluate the Evaluator</h2><p>Your evaluators are running. They produce Pass or Fail verdicts on your agent&#8217;s outputs. But one open question remains: how do you know if those verdicts are correct?</p><p>While building Brown, a writer agent I developed with the Towards AI team for our <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Agentic AI Engineering course</a>, I set up an LLM judge to verify generated articles. I wanted to check the expected structure, idea flow, and content against a golden dataset. I ran it on a batch of traces, and the scores seemed reasonable. Then I manually compared the traces against the judge&#8217;s verdicts, only to realize it was fixating on the wrong things.</p><p>It scored 0 when an article used bullet points instead of H3 headers, which was perfectly acceptable for that section. It scored 0 when the agent used a different transition phrase than the few-shot examples, penalizing creativity when we wanted flexibility. Furthermore, it scored 1 when paragraphs did not flow smoothly into each other, completely overlooking a real quality issue we cared about. </p><p>We had to iterate on the judge until it reflected what we actually valued. Anthropic reports a similar pattern, seeing eval scores jump from 42% to 95% after fixing grading bugs and ambiguous task specifications <a href="https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents">[3]</a>. The agent was fine all along, but the evaluator was broken. That experience crystallized something for me: <strong>eval metrics you cannot trust are worse than no metrics at all.</strong></p><p>Unvalidated evals create false confidence. You see green dashboards, assume quality is fine, and stop looking. You push broken outputs because the numbers said they were good, and you hear about problems from frustrated users instead of your test suite. Worst of all, you cannot tell which evaluations are wrong, as the 10-20% of incorrect signals hide silently and contaminate every decision built on those scores.</p><p>Your evaluator is another AI model that makes binary predictions, so it needs a test set, metrics, and mapped failure modes like any other model.</p><p>Also, LLM judges are inherently non-deterministic, meaning they hallucinate, carry biases, and drift. Alignment with human evaluators varies widely by task, with some teams achieving high agreement after careful iteration, while others struggle to break 70% on subjective criteria. The gap between your judge and reality could mean hundreds of bad signals across a thousand evaluations, which you will not know without validation <a href="https://hamel.dev/blog/posts/llm-judge/">[2]</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1am-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1am-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!1am-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!1am-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!1am-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1am-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png" width="1456" height="1048" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1048,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The evaluator validation workflow&quot;,&quot;title&quot;:&quot;The evaluator validation workflow&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The evaluator validation workflow" title="The evaluator validation workflow" srcset="https://substackcdn.com/image/fetch/$s_!1am-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!1am-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!1am-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!1am-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F556c0406-06a3-4c94-8bd7-8755d756cf38_1456x1048.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 1: The evaluator validation workflow: Comparing judge verdicts against expert labels with classification metrics.</em></figcaption></figure></div><p>Here is what you will learn to solve this problem:</p><ul><li><p>Partitioning your labeled data to prevent data leakage.</p></li><li><p>Quantifying agreement using standard classification metrics.</p></li><li><p>Systematically closing the gap between your judge and domain experts.</p></li><li><p>Dealing with the randomness of LLMs.</p></li></ul><p>To start this process, we first need to structure our dataset correctly.</p><p><em>But before digging into the article, a quick word from our sponsor, Opik.</em> &#8595;</p><div><hr></div><h2><a href="https://www.comet.com/docs/opik/agent_optimization/quickstart?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik: Automated Agent Optimization Using Your Data (Sponsored)</a></h2><p>This AI Evals &amp; Observability series is brought to you by <strong><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a></strong>, the LLMOps open-source platform used by Uber, Netflix, Etsy, and more.</p><p>We use Opik daily across our courses and AI products. Not just for observability, but now to <strong>automatically optimize our agents&#8217; prompts</strong> using the same datasets and metrics we already have in the platform.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ecvh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c8d01a-ea3a-4074-af28-c6a13d28f1d7_800x450.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ecvh!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c8d01a-ea3a-4074-af28-c6a13d28f1d7_800x450.gif 424w, https://substackcdn.com/image/fetch/$s_!Ecvh!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c8d01a-ea3a-4074-af28-c6a13d28f1d7_800x450.gif 848w, https://substackcdn.com/image/fetch/$s_!Ecvh!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c8d01a-ea3a-4074-af28-c6a13d28f1d7_800x450.gif 1272w, https://substackcdn.com/image/fetch/$s_!Ecvh!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c8d01a-ea3a-4074-af28-c6a13d28f1d7_800x450.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ecvh!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c8d01a-ea3a-4074-af28-c6a13d28f1d7_800x450.gif" width="800" height="450" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/88c8d01a-ea3a-4074-af28-c6a13d28f1d7_800x450.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;optimization_studio_walkthrough.mp4 [video-to-gif output image]&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="optimization_studio_walkthrough.mp4 [video-to-gif output image]" title="optimization_studio_walkthrough.mp4 [video-to-gif output image]" srcset="https://substackcdn.com/image/fetch/$s_!Ecvh!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c8d01a-ea3a-4074-af28-c6a13d28f1d7_800x450.gif 424w, https://substackcdn.com/image/fetch/$s_!Ecvh!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c8d01a-ea3a-4074-af28-c6a13d28f1d7_800x450.gif 848w, https://substackcdn.com/image/fetch/$s_!Ecvh!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c8d01a-ea3a-4074-af28-c6a13d28f1d7_800x450.gif 1272w, https://substackcdn.com/image/fetch/$s_!Ecvh!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c8d01a-ea3a-4074-af28-c6a13d28f1d7_800x450.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You are learning how to build diverse synthetic datasets to evaluate your AI app. But once you have those datasets and metrics, why stop at measuring quality?<strong> <a href="https://www.comet.com/docs/opik/agent_optimization/quickstart?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik&#8217;s agent optimizer</a></strong> closes the loop. It uses <strong>your</strong> <strong>eval dataset to automatically improve your prompts</strong>. Here is why we love it:</p><ul><li><p><strong>Same datasets, zero extra setup</strong> &#8212; Opik&#8217;s optimizer reuses the exact datasets, metrics, and tracing you already have. <a href="https://www.comet.com/docs/opik/agent_optimization/quickstart?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Quick start guide</a>.</p></li><li><p><strong>Six optimization algorithms</strong> &#8212; Choose from strategies like HRPO (our favorite), which performs root-cause analysis on failures and proposes targeted fixes, or evolutionary optimization to explore diverse prompt structures. <a href="https://www.comet.com/docs/opik/agent_optimization/algorithms/overview?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">See all algorithms.</a></p></li><li><p><strong>No-code Optimization Studio</strong> &#8212; For quick iterations, run optimization directly from the <a href="https://www.comet.com/docs/opik/agent_optimization/optimization_studio?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Optimization Studio UI</a>. Start from your prompt, pick your dataset, choose an algorithm, and watch Opik test prompt variations against your metrics in real time.</p></li></ul><p><strong><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a></strong> is fully open source and integrates with OpenAI, Anthropic, Gemini, and 100+ providers. <em><strong>Start optimizing your agents for free:</strong></em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.comet.com/docs/opik/agent_optimization/quickstart?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;text&quot;:&quot;Automated agent optimization guide&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.comet.com/docs/opik/agent_optimization/quickstart?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul"><span>Automated agent optimization guide</span></a></p><div><hr></div><p><em>&#8595;</em>  <em>Now, let&#8217;s move back to the article.</em></p><h2>Structuring Your Data for Validation</h2><p>You already have your ground truth. As explained in <a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Article 2</a> and <a href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals">Article 3</a> of the series, your domain expert labeled each trace as Pass or Fail with a critique. Those labels are the reference standard your automated judge must match. If the task is highly subjective, consider having multiple people label the same examples to discover the agreement ceiling, but for most teams, the single expert is sufficient.</p><p>Now you need to partition that labeled data correctly. Why? Because you cannot build and validate on the same examples, as that is like grading your own exam. You must calculate the error on unseen data only to make sure you are not getting biased results, so split your dataset into three sets: train, dev and test <a href="https://hamel.dev/blog/posts/llm-judge/">[2]</a>.</p><p>The train set takes 60% of the data, representing the examples your evaluator learns from. They go into the few-shot prompt, inform the rubric, and set the standard for what Pass and Fail look like. The dev set takes 20% of the data, acting as your iteration sandbox. Run the judge here, check where it disagrees with the expert, adjust the prompt, and repeat to refine the system. Finally, the test set takes the remaining 20% and must be kept locked until you are done iterating. You use it only at the end when the LLM judge is aligned with the expert on the dev set. This gives you an unbiased final score on data that the evaluator has never seen.</p><p>The 60/20/20 split is a good starting point, but as your data grows and you don&#8217;t want to overload your few-shot-examples (they grow your context window), you can start moving more data to the dev and test splits.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!c8_b!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F321f8ad5-8329-4ed0-bbc9-df6d1db1820f_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!c8_b!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F321f8ad5-8329-4ed0-bbc9-df6d1db1820f_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!c8_b!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F321f8ad5-8329-4ed0-bbc9-df6d1db1820f_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!c8_b!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F321f8ad5-8329-4ed0-bbc9-df6d1db1820f_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!c8_b!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F321f8ad5-8329-4ed0-bbc9-df6d1db1820f_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!c8_b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F321f8ad5-8329-4ed0-bbc9-df6d1db1820f_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/321f8ad5-8329-4ed0-bbc9-df6d1db1820f_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Data partitioning for evaluator development&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Data partitioning for evaluator development" title="Data partitioning for evaluator development" srcset="https://substackcdn.com/image/fetch/$s_!c8_b!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F321f8ad5-8329-4ed0-bbc9-df6d1db1820f_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!c8_b!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F321f8ad5-8329-4ed0-bbc9-df6d1db1820f_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!c8_b!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F321f8ad5-8329-4ed0-bbc9-df6d1db1820f_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!c8_b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F321f8ad5-8329-4ed0-bbc9-df6d1db1820f_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 2: How to partition your labeled data into train, dev, and test sets for evaluator development.</em></figcaption></figure></div><p>In practice, 100 labeled examples mean 60 powering the prompt, 20 for tuning, and 20 for the final honest check. Aim for at least 100 labeled examples to get stable metrics. Below 50, your numbers become too noisy to act on. Watch out for class imbalance. If 90% of your traces are Pass and only 10% are Fail, you need a way to balance the classes, either by synthetically increasing your negative class or removing samples from your positive class, until a balance is achieved.</p><p>With data properly structured, let us quantify how well your judge actually agrees with the expert.</p><h2>Measuring Alignment With Human Judgment</h2><p>Your judge outputs Pass or Fail for each trace, which means you are building a binary classifier. You are using LLMs instead of other models, but ultimately, it&#8217;s still just a classifier.</p><p>Thus, you need to quantify the performance of the LLM Judge against the golden dataset we just split in the previous section. Standard classification metrics give you this visibility.</p><p>The <strong>confusion matrix</strong> shows four possible outcomes. True Positive (TP) means both judge and expert say Pass, agreeing the output is good. True Negative (TN) means both say Fail, agreeing the output is bad. False Positive (FP) means the judge says Pass, but the expert says Fail, letting a bad output through. False Negative (FN) means the judge says Fail, but the expert says Pass, meaning the judge was overly harsh.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_JaU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15c2a0f6-b059-4a17-9cc3-2ba9b59c1611_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_JaU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15c2a0f6-b059-4a17-9cc3-2ba9b59c1611_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!_JaU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15c2a0f6-b059-4a17-9cc3-2ba9b59c1611_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!_JaU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15c2a0f6-b059-4a17-9cc3-2ba9b59c1611_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!_JaU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15c2a0f6-b059-4a17-9cc3-2ba9b59c1611_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_JaU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15c2a0f6-b059-4a17-9cc3-2ba9b59c1611_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/15c2a0f6-b059-4a17-9cc3-2ba9b59c1611_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The confusion matrix for evaluator validation&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The confusion matrix for evaluator validation" title="The confusion matrix for evaluator validation" srcset="https://substackcdn.com/image/fetch/$s_!_JaU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15c2a0f6-b059-4a17-9cc3-2ba9b59c1611_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!_JaU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15c2a0f6-b059-4a17-9cc3-2ba9b59c1611_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!_JaU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15c2a0f6-b059-4a17-9cc3-2ba9b59c1611_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!_JaU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15c2a0f6-b059-4a17-9cc3-2ba9b59c1611_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 3: The four outcomes when comparing judge verdicts against expert labels.</em></figcaption></figure></div><p>Combining TP, TN, FP, and FN yields <strong>three fundamental metrics</strong>:</p><ol><li><p><strong>Accuracy</strong> is the overall agreement rate, calculated as <code>(TP + TN) / total</code>. If the judge matches the expert on 170 out of 200 traces, that is 85% accuracy. This is useful when Pass and Fail are roughly balanced, but it is highly misleading when they are not.</p></li><li><p><strong>Precision</strong> measures how trustworthy the Pass verdicts are, representing the fraction of judge-approved traces that the expert also labeled Pass. You calculate it as <code>TP / (TP + FP)</code>. If the judge approves 50 articles and the expert disagrees on 8, precision is <code>42 / 50 = 84%</code>, meaning when the judge says the output is good, you can generally believe it.</p></li><li><p><strong>Recall</strong> measures how many actual Passes the judge finds out of all the traces the expert labeled Pass. You calculate it as <code>TP / (TP + FN)</code>. If 60 articles are genuinely good but the judge only catches 48, recall is <code>48 / 60 = 80%</code>, meaning the judge finds most quality output but still misses some.</p></li></ol><p>Ultimately, we have the <strong>F1 score</strong> as an aggregate metric that provides a balanced view as the harmonic mean of precision and recall, calculated as <code>2 &#215; (Precision &#215; Recall) / (Precision + Recall)</code>. Use this when both false positives and false negatives matter equally. The right F1 target depends on the metric. With Brown, we accepted around 60% for subjective metrics like style, but demanded over 90% for objective ones like article structure. As a general rule, aim for an F1 above 0.70.</p><p>These metrics seem simple enough. But there is a common trap most teams fall into when their datasets are not balanced.</p><h2>When High Scores Hide Real Failures</h2><p>We can best understand this phenomenon by looking at a few examples.</p><p>For example, let&#8217;s assume Brown generates 80 articles. 70 are correct, and 10 are broken. Your judge labels every single one as Pass. Accuracy sits at <code>70 / 80 = 87.5%</code>, which looks reasonable, but it never caught a single failure, making it completely useless.</p><p>Let us look at another example in more depth. Out of 80 generated articles, 60 are genuinely well-structured, while 20 have real problems like missing sections or disconnected paragraphs. The judge correctly approves 55 of the good ones and wrongly rejects 5. Of the 20 broken articles, it catches only 4 and lets 16 slip through. That gives us TP=55, FN=5, FP=16, TN=4.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QBBm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eb6277e-3fb0-439b-b559-8d3e7538e33f_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QBBm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eb6277e-3fb0-439b-b559-8d3e7538e33f_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!QBBm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eb6277e-3fb0-439b-b559-8d3e7538e33f_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!QBBm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eb6277e-3fb0-439b-b559-8d3e7538e33f_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!QBBm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eb6277e-3fb0-439b-b559-8d3e7538e33f_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QBBm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eb6277e-3fb0-439b-b559-8d3e7538e33f_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9eb6277e-3fb0-439b-b559-8d3e7538e33f_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Accuracy vs precision and recall breakdown&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Accuracy vs precision and recall breakdown" title="Accuracy vs precision and recall breakdown" srcset="https://substackcdn.com/image/fetch/$s_!QBBm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eb6277e-3fb0-439b-b559-8d3e7538e33f_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!QBBm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eb6277e-3fb0-439b-b559-8d3e7538e33f_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!QBBm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eb6277e-3fb0-439b-b559-8d3e7538e33f_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!QBBm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eb6277e-3fb0-439b-b559-8d3e7538e33f_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 4: Decent overall accuracy can disguise a judge who barely detects real failures.</em></figcaption></figure></div><p>Overall accuracy reads <code>(55 + 4) / 80 = 73.75%</code>, which looks reasonable. But Fail-class recall is just <code>TN / (FP + TN) = 4 / (16 + 4) = 20%</code>, meaning the judge misses 80% of structural failures. The lesson here is to always check precision and recall on the minority class. If those numbers are low, enrich your few-shot prompts with more failure examples, focusing particularly on the subtle cases where individual paragraphs look fine but do not connect fluidly <a href="https://hamel.dev/blog/posts/llm-judge/">[2]</a>.</p><p>Now that you know what to measure and what to watch out for, let us walk through the process of systematically improving your judge.</p><h2>Closing the Gap Between Judge and Expert</h2><p>This is the core workflow for making your judge reliable. Start with 10-20 few-shot examples from the train set to build your initial judge, and run it against the dev set while leaving the test set untouched. Compute precision, recall, and F1, then identify every disagreement where the judge and expert diverge. Expand your few-shot examples by incorporating those disagreements into the prompt when they reveal real patterns, re-run, and re-measure until the dev set alignment hits your target threshold.</p><p>Remember that your few-shot examples translate to input tokens, which translate to extra costs. Thus, ideally, you want to keep your few-shot examples as minimal, yet diverse, as possible, while maximizing performance on your dev and test splits.</p><p>Lock the test set for the final check. Only run the judge on the test set after you stop iterating on the dev set, giving you an uncontaminated measurement of real performance.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xx5O!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23934a2d-8368-4af9-a544-ee01ce75600c_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xx5O!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23934a2d-8368-4af9-a544-ee01ce75600c_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!xx5O!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23934a2d-8368-4af9-a544-ee01ce75600c_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!xx5O!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23934a2d-8368-4af9-a544-ee01ce75600c_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!xx5O!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23934a2d-8368-4af9-a544-ee01ce75600c_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xx5O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23934a2d-8368-4af9-a544-ee01ce75600c_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/23934a2d-8368-4af9-a544-ee01ce75600c_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The judge refinement cycle&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The judge refinement cycle" title="The judge refinement cycle" srcset="https://substackcdn.com/image/fetch/$s_!xx5O!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23934a2d-8368-4af9-a544-ee01ce75600c_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!xx5O!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23934a2d-8368-4af9-a544-ee01ce75600c_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!xx5O!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23934a2d-8368-4af9-a544-ee01ce75600c_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!xx5O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23934a2d-8368-4af9-a544-ee01ce75600c_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 5: The judge refinement cycle: build, measure, diagnose disagreements, adjust, and repeat until alignment is sufficient.</em></figcaption></figure></div><p>Expect at least 3 rounds of iteration. If you are still far below target after 10 iterations, the task may require human judgment that no prompt can replicate. Start by hand, as manual prompt refinement teaches you where your judge&#8217;s reasoning diverges from the expert&#8217;s. Carefully studying each disagreement is the most informative signal you have, and once your labeled dataset is large and high-quality enough, you can explore automated prompt optimization tools.</p><p>Read the LLM Judge critiques instead of just looking at metrics, as critiques tell you whether the judge was wrong or the expert missed something. As highlighted by Anthropic, you shouldn&#8217;t take eval scores at face value until someone digs into the details and reads the critiques of the judge <a href="https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents">[3]</a>.</p><p>Once your judge passes validation, put it to work for regression testing, optimization, and production monitoring as explained in <a href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app">Article 1</a>.</p><p><strong>What if the agreement stays low?</strong> If after 10 rounds your agreement is still low, here is what to look out for. Your few-shot examples might be too narrow, so as you keep sampling more production traces using your observability platform (e.g., <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a>) revisit error analysis, as exlained in <a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Article 2</a>, to find the specific patterns where the judge fails and add those to the few-shot-examples. With Brown, our initial examples were too uniform, and adding subtle structural failures immediately improved alignment.</p><p>The rubric might lack specificity, as asking if the article is well-written invites interpretation, while asking if it contains well-defined paragraphs, transitions, and metaphors leaves less room for ambiguity. Sharpen the criteria.</p><p>Also, in case the task itself is too subjective, consider accepting a lower F1 score. For example, with Brown, style adherence was inherently subjective, so we accepted a lower F1 there while holding structure to &gt;90%. The idea is to adapt your acceptance threshold based on the nature of each business metric you are tracking.</p><p>Even with strong agreement, there is one more challenge. Both your judge and your agent introduce randomness into every run. Let&#8217;s see how we can fix that.</p><h2>Dealing With Non-Determinism</h2><p>Randomness comes from two directions: as the judge produces different scores on the same input, and the agent itself takes different paths each run. You need to address both to build a stable evaluation pipeline.</p><p>The easiest and most powerful way to win is to scale the dataset, as larger datasets smooth out noise. Aim for enough examples in each class that a few misclassifications do not swing your metrics wildly. A good starting point is a minimum of 50 samples per class.</p><p>Also, another easy win (but not necessarily cheap) is to pick the strongest available model, using a frontier model like the latest versions of Claude Opus or Gemini Pro, because the judge should be at least as capable as the system it evaluates <a href="https://hamel.dev/blog/posts/llm-judge/">[2]</a>. Require reasoning before the verdict by structuring the prompt with Chain of Thought (CoT) so the judge walks through each criterion first before delivering Pass or Fail. This step-by-step analysis produces more consistent scores and better alignment with human judgment <a href="https://arize.com/llm-as-a-judge/">[1]</a>.</p><p>Let the judge abstain by giving it an &#8220;Unknown&#8221; option when it lacks enough information to decide, because forcing a binary Pass/Fail on ambiguous cases generates false positives you cannot distinguish from real ones <a href="https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents">[3]</a>.</p><p>To further stabilize the judge, you can compute a significance threshold by running the evaluation 3-5 times and computing the variance between the runs. With Brown, this was essential because writing is subjective, and running the evaluator 5 times told us the real error threshold. A 3% metric shift across runs was noise, but 10% meant something actually changed. Without this, you are chasing random fluctuations.</p><p>On the agent side, treat it as a black box and evaluate the destination, not the route, as agents can reach the same outcome through different strategies. Brown might outline first, then write or draft everything, then restructure, but both can produce a strong article. Score the final output against your quality criteria, not the intermediate steps <a href="https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents">[3]</a>.</p><p>For the agent, measure reliability across multiple runs using <code>pass@k</code> and <code>pass^k</code>. <code>pass@k</code> tracks whether at least one out of k attempts succeeds, while <code>pass^k</code> tracks whether all k attempts succeed. These two metrics tell opposite stories as k grows: <code>pass@k</code> climbs toward 100% while <code>pass^k</code> dropping sharply, revealing how consistent your agent really is. <a href="https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents">[3]</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TdzU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3048100-30a5-4ce5-a695-1dad61e26b6b_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TdzU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3048100-30a5-4ce5-a695-1dad61e26b6b_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!TdzU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3048100-30a5-4ce5-a695-1dad61e26b6b_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!TdzU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3048100-30a5-4ce5-a695-1dad61e26b6b_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!TdzU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3048100-30a5-4ce5-a695-1dad61e26b6b_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TdzU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3048100-30a5-4ce5-a695-1dad61e26b6b_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e3048100-30a5-4ce5-a695-1dad61e26b6b_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;pass@k vs pass^k divergence&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="pass@k vs pass^k divergence" title="pass@k vs pass^k divergence" srcset="https://substackcdn.com/image/fetch/$s_!TdzU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3048100-30a5-4ce5-a695-1dad61e26b6b_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!TdzU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3048100-30a5-4ce5-a695-1dad61e26b6b_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!TdzU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3048100-30a5-4ce5-a695-1dad61e26b6b_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!TdzU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3048100-30a5-4ce5-a695-1dad61e26b6b_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 6: pass@k and pass^k tell opposite stories about reliability as the number of trials grows.</em></figcaption></figure></div><p>You now have the complete toolkit for evaluating your evaluator.</p><h2>Demo</h2><p>To fully grasp the end-to-end workflow for building AI Evals, I recommend rewatching our demo using&nbsp;<a href="https://aligneval.com/">AlignEval</a>, an open-source tool created by Eugene Yan. It provides a streamlined interface for the exact workflow this article teaches: look at your data, label it, evaluate outputs, and optimize your evaluators:</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;90b981d0-fb1c-4c33-82f2-4f1fb476cd02&quot;,&quot;duration&quot;:null}"></div><p>The tool is open source and available at <a href="https://aligneval.com/">aligneval.com</a>, with the source code on GitHub (<a href="https://github.com/eugeneyan/align-app">eugeneyan/align-app</a>). You can try it for free with your own data or use the prompt below to quickly generate a CSV similar to the one from the demo:</p><pre><code><code>I want you to generate a CSV file with the following characteristics:
"""
* The CSV file must include the following columns:
   * id: Unique identifier for each row
   * input: Context used to generate output
   * output: Generated text to be evaluated
   * label: Ground truth (values optional but counts towards XP)
   * explanation: A one-sentence explanation on why we labeled the row as 0 (PASS) or 1 (FAIL)
* &#128680; The label column only accepts binary labels, either 0 or 1.
   * 0: Output PASSES your evaluation
   * 1: Output FAILS your evaluation
"""
that contains 100 rows

The goal of the CSV file is to implement a dataset to build an LLM Judge evaluator. 

We want to create some mock, synthetic data to conceptually show how labeling, evaluating and optimizing the LLM judge would look like, based on this tool: https://aligneval.com/

Let's say that we collected data from a vertical assistant agent specialized in answering work emails and Slack messages. Thus, create 100 scenarios based on these dimensions:
* feature: email/slack
* scenario: executive, manager, colleague, spam email, phishing email, friend (as an exception)
* label: success/failure of properly answering the message

Where the input is a single email or Slack message or an email or Slack thread, but the output will ALWAYS be just the generated reply, whether it's email or Slack.

Make the labels a 50%/50% split between passes and fails.

Also, note that NO REPLY is an expected behavior for SPAM and phishing emails. Also, for non-essential emails or toxic or slack messages.</code></code></pre><p>We used Claude Opus 4.6 within the Claude app to generate it.</p><h2>Next Steps</h2><p>An evaluator only earns trust when it matches expert judgment. The workflow is straightforward: measure where your judge disagrees with the expert, fix those gaps, and confirm on data the judge has never seen. Only when the judge aligns with the expert on the test set can you rely on your eval metrics.</p><p>The error analysis workflow and iterative labeling were only the tip of the iceberg. Now you see the full picture of how to build, validate, and maintain evaluators.</p><p>Next up is a specialized article focused on evaluating Retrieval-Augmented Generation (RAG) systems.</p><p>Also, remember that this article is part of a <strong><a href="https://www.decodingai.com/t/ai-evals-and-observability">7-piece series on AI Evals &amp; Observability</a></strong>. <strong>Here&#8217;s what&#8217;s ahead:</strong></p><ol><li><p><a href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app">Integrating AI Evals Into Your AI App</a> </p></li><li><p><a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Build an AI Evals Dataset from Scratch</a>  </p></li><li><p><a href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals">Generate Synthetic Datasets for AI Evals</a>  </p></li><li><p><a href="https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures">How to Design Evaluators</a></p></li><li><p><strong>How to Evaluate the Evaluator</strong>  &#8592; <em>You just finished this one</em></p></li><li><p><a href="https://www.decodingai.com/p/rag-evaluation-6-metrics-framework">RAG Evaluation: The Only 6 Metrics You Need</a></p></li><li><p><a href="https://www.decodingai.com/p/behind-the-scenes-of-ai-observability">Lessons from 6 Months of Evals on a Production AI Companion</a></p></li></ol><p>See you next Tuesday.</p><p><a href="https://www.pauliusztin.ai/">Paul Iusztin</a></p><div><hr></div><p><em>What&#8217;s your opinion? Do you agree, disagree, or is there something I missed?</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to share our work.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h2>Go Deeper</h2><p><strong>Go from zero to production-grade AI agents</strong> with the <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Agentic AI Engineering self-paced course</a>. Built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Towards AI</a>.</p><p>Across <strong>34 lessons</strong> (articles, videos, and a lot of code), you&#8217;ll design, build, evaluate, and deploy production-grade AI agents end to end. By the final lesson, you&#8217;ll have <strong>built a multi-agent system</strong> and a <strong>capstone project</strong> where you apply everything you've learned on your own.</p><p><em>Three portfolio projects and a certificate to showcase in interviews. Plus a Discord community where you have direct access to other industry experts and me.</em></p><p>Rated 4.9/5 &#11088;&#65039; by 290+ early students &#8212; <em>&#8220;Every AI Engineer needs a course like this.&#8221;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&quot;,&quot;text&quot;:&quot;Learn more&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31"><span>Learn more</span></a></p><p><em>Not ready to commit?</em> We also prepared a free 6-day email course to reveal the <em><strong>6 critical mistakes that silently destroy agentic systems. </strong><a href="https://email-course.towardsai.net/?ref=b3ab31">Get the free email course.</a></em></p><div><hr></div><p><em>Thanks again to <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a> for sponsoring the series and keeping it free!</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yeD8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 424w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 848w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1272w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yeD8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png" width="1200" height="400" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/deaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:400,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Opik Banner&quot;,&quot;title&quot;:&quot;Opik Banner&quot;,&quot;type&quot;:null,&quot;href&quot;:&quot;https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Opik Banner" title="Opik Banner" srcset="https://substackcdn.com/image/fetch/$s_!yeD8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 424w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 848w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1272w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Try Opik for free here</a> (25k spans/month free)</figcaption></figure></div><p><strong>If you want to monitor, evaluate and optimize your AI workflows and agents:</strong></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;text&quot;:&quot;Try Opik for free&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul"><span>Try Opik for free</span></a></p><div><hr></div><h2>References</h2><ol><li><p>Arize AI. (n.d.). LLM as a Judge: Primer and Pre-Built Evaluators. Arize. <a href="https://arize.com/llm-as-a-judge/">https://arize.com/llm-as-a-judge/</a></p></li><li><p>Husain, H. (n.d.). Using LLM-as-a-Judge for Evaluation. hamel.dev. <a href="https://hamel.dev/blog/posts/llm-judge/">https://hamel.dev/blog/posts/llm-judge/</a></p></li><li><p>Anthropic. (n.d.). Demystifying Evals for AI Agents. Anthropic Engineering Blog. <a href="https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents">https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents</a></p></li></ol><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[Scaling to 120+ AI Agents Without Losing Control]]></title><description><![CDATA[How two-tier orchestration keeps multi-agent systems debuggable]]></description><link>https://www.decodingai.com/p/scaling-120-ai-agents-two-tier-orchestration</link><guid isPermaLink="false">https://www.decodingai.com/p/scaling-120-ai-agents-two-tier-orchestration</guid><dc:creator><![CDATA[Lucian Lature]]></dc:creator><pubDate>Thu, 05 Mar 2026 12:03:06 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!tePE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4513bc-dd44-4c65-9544-871d17b87c7f_1200x1485.gif" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Paul:</strong> Today, the stage belongs to <a href="https://substack.com/@lucianlature">Lucian Lature</a>, Solutions Architect and Technical Leader with 15+ years of experience spent building and scaling cloud platforms and Node.js products.</p><p>He&#8217;s skipping the textbook definitions today to focus on the architectural trade-offs and real-world logic behind his most recent builds.</p><p>Enough chitchat. Let&#8217;s get into it &#128064; &#8595;</p><div><hr></div><h2>When Single-Agent Systems Fall Apart</h2><p>You know the moment. You built a perfectly capable AI agent that writes code, answers questions, and searches through your docs. It works great. Then you ask it to review code for security issues and synthesize three different research papers. It returns something that&#8217;s half right and half wrong, delivered with full confidence.</p><p>I used to think this was a model problem. Better prompts, bigger context window, maybe switch to the latest Sonnet release. Wrong. The problem is architectural, and no amount of prompt engineering fixes it.</p><p>A single agent with 40+ tools, a 2,000-word prompt over five different domains, and retrieval tuned for one job at a time collapses. Context windows get bloated. Tool selection becomes a mess. Quality tanks.</p><p>This happened to me with Screech, a personal agent I built for my side projects. It started simply, basically a smarter search over my notes. Then I kept adding: code generation, documentation, code reviews, security audits, and research synthesis. The single-agent approach worked beautifully until it very suddenly didn&#8217;t.</p><p>The stack is not exotic. It&#8217;s VoltAgent for runtime and workflows, SurrealDB as the &#8220;one DB to store everything&#8221; experiment, and Claude as the default model tier.</p><p>And yes, the agent is named after Screech from the Saved by the Bell TV series. Also, my childhood nickname.</p><p>I didn&#8217;t invent this in a vacuum. <a href="https://github.com/getzep/graphiti">Graphiti</a> shaped how I think about knowledge that changes over time. VoltAgent gave me workflow primitives I didn&#8217;t have to build. Paul Iusztin&#8217;s <a href="https://www.decodingai.com/p/stop-converting-documents-to-text">AI Agents Foundations</a> convinced me to stop forcing PDFs through OCR and treat them as images. <a href="https://github.com/JustinNarracott/agentic-playbooks">Agentic Playbooks</a> showed me that auditable agent decisions are a performance win, not only a governance check.</p><p>So, here&#8217;s the architecture, decisions, and stuff I&#8217;d do differently. For legal reasons, that&#8217;s &#8220;informational only,&#8221; not &#8220;you should do this.&#8221;</p><p><em>Before we continue, a quick word from the Decoding AI team.</em> &#8595;</p><div><hr></div><h2><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Go Deeper: Your Path to Agentic AI for Production</a></h2><p>The <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Agentic AI Engineering course</a>, built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Towards AI</a>, walks you through building exactly this kind of multi-agent architecture across 34 lessons. </p><p>Across <strong>34 lessons</strong> (articles, videos, and a lot of code), you&#8217;ll design, build, evaluate, and deploy production-grade AI agents end to end. By the final lesson, you&#8217;ll have <strong>built a multi-agent system</strong> and a <strong>capstone project</strong> where you apply everything you&#8217;ve learned on your own.</p><p><em>Three portfolio projects and a certificate to showcase in interviews. Plus a Discord community where you have direct access to other industry experts and me.</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!59a6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!59a6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 424w, https://substackcdn.com/image/fetch/$s_!59a6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 848w, https://substackcdn.com/image/fetch/$s_!59a6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 1272w, https://substackcdn.com/image/fetch/$s_!59a6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!59a6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif" width="1200" height="1200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1200,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:315304,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188280936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!59a6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 424w, https://substackcdn.com/image/fetch/$s_!59a6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 848w, https://substackcdn.com/image/fetch/$s_!59a6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 1272w, https://substackcdn.com/image/fetch/$s_!59a6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a5bb56-1fed-426d-8284-cb8bf74b8599_1200x1200.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"> <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">What you will build during the course</a>: Nova, the deep research agent, and Brown, the writing workflow, connected into a multi-agent system.</figcaption></figure></div><p>Rated 4.9/5 &#11088;&#65039; by 290+ early students &#8212; <em>&#8221;Every AI Engineer needs a course like this&#8221;</em> and <em>&#8221;an excellent bridge from experimental LLM projects to real-world AI engineering.&#8221;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&quot;,&quot;text&quot;:&quot;Start learning today&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31"><span>Start learning today</span></a></p><div><hr></div><p>&#8595; <em>Now, back to the article.</em></p><h2>If You&#8217;re Not Building Agents</h2><p>You can stop here and still get the gist. The problem: one AI agent that should do everything (search your notes, write code, review for security, summarize research) does all of it poorly. One large prompt and many tools. It confuses tasks, wastes tokens, and returns confident nonsense when goals conflict, e.g., security paranoia versus &#8220;ship it&#8221; code gen.</p><p>The solution: one conductor agent that handles simple work itself and a pool of specialists it calls when the task needs depth. The conductor stays cheap and fast for most requests. Specialists run only when needed. You need routing (who handles what), hybrid retrieval (not only vector search), and one store for documents, relationships, and chat (here, SurrealDB). The rest of this article is for people who want to see the wiring.</p><h2>Multi-Agent: When It&#8217;s Worth the Complexity</h2><p>The maintenance overhead is real. So let me be clear about when this makes sense.</p><p>I&#8217;d only do it when there are 3+ domains that actively conflict (dev, research, security is a classic triangle), when I care about cost per request (not &#8220;cost later&#8221;, cost now), and when I need failures to be contained so that one specialist can be dumb without contaminating the whole system.</p><p>Stay single-agent when tasks are similar, the tool count is under about 15, you do not need different model tiers, and simplicity beats per-task quality.</p><p>Single-agent favors simplicity. Multi-agent favors quality per task and adds orchestration. Pick your poison.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QFQw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9e9644-e94e-4aeb-a794-060b11fc073e_923x328.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QFQw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9e9644-e94e-4aeb-a794-060b11fc073e_923x328.png 424w, https://substackcdn.com/image/fetch/$s_!QFQw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9e9644-e94e-4aeb-a794-060b11fc073e_923x328.png 848w, https://substackcdn.com/image/fetch/$s_!QFQw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9e9644-e94e-4aeb-a794-060b11fc073e_923x328.png 1272w, https://substackcdn.com/image/fetch/$s_!QFQw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9e9644-e94e-4aeb-a794-060b11fc073e_923x328.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QFQw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9e9644-e94e-4aeb-a794-060b11fc073e_923x328.png" width="728" height="258.7042253521127" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2c9e9644-e94e-4aeb-a794-060b11fc073e_923x328.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:328,&quot;width&quot;:923,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:62221,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188280936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9e9644-e94e-4aeb-a794-060b11fc073e_923x328.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QFQw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9e9644-e94e-4aeb-a794-060b11fc073e_923x328.png 424w, https://substackcdn.com/image/fetch/$s_!QFQw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9e9644-e94e-4aeb-a794-060b11fc073e_923x328.png 848w, https://substackcdn.com/image/fetch/$s_!QFQw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9e9644-e94e-4aeb-a794-060b11fc073e_923x328.png 1272w, https://substackcdn.com/image/fetch/$s_!QFQw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9e9644-e94e-4aeb-a794-060b11fc073e_923x328.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>The Three-Layer Architecture</h2><p>I think of Screech as an orchestra. One conductor who knows the entire score but doesn&#8217;t play every instrument. Backed by specialists who are genuinely brilliant at their specific parts.</p><p><strong>Layer 1: Orchestration.</strong> It does the boring-but-hard parts: understanding intent, pulling context, and deciding whether this is &#8220;handle it now&#8221; or &#8220;call a specialist&#8221;. Three meta-tools carry most of the orchestration weight: discover subagents, invoke one subagent, and fan out to multiple subagents. A task router (Claude Haiku) classifies complexity before any expensive model runs. The runtime, memory management, workflow engine (with suspend/resume), and MCP server integration come from <a href="https://github.com/VoltAgent/voltagent">VoltAgent</a>. I didn&#8217;t build any of that infrastructure, I plugged in.</p><p><strong>Layer 2: Specialists.</strong>128 subagents in 10 categories: core development, language specialists (TypeScript, Python, Rust, plus 19 more), testing &amp; quality, meta-orchestration. More on why that number and why these categories in a bit.</p><p><strong>Layer 3: Knowledge. </strong>Hybrid retrieval combining vector search + knowledge graph traversal + keyword matching, all backed by SurrealDB. Plus a temporal layer (Graphiti-style) so the system knows when it learned something, not only what.</p><p>Here&#8217;s the key decision: Screech is a full agent with its own tools and retrieval, not a dumb router. That is the decision that matters. It handles 60&#8211;70% of requests directly. Subagents only kick in when you need deep specialization. That keeps latency and cost sane for the common case.</p><p><strong>If you&#8217;ve used Claude Code, this pattern will feel familiar, but there is a key difference:</strong> Claude Code is one agent plus injected context (<a href="http://CLAUDE.md">CLAUDE.md</a>, conventions, slash commands). When you give it a task, the same agent handles everything, and it just gets extra context injected from your skill files. It&#8217;s the &#8220;enhanced single-agent&#8221; end of the spectrum: one brain, augmented with domain knowledge. Screech pushes further along that spectrum. Instead of injecting domain knowledge into one agent&#8217;s prompt, each specialist <em>is its own agent</em> with a dedicated system prompt, model tier, and tool set. The orchestrator doesn&#8217;t just get &#8220;React knowledge&#8221; injected &#8212; it delegates to an <code>react-specialist</code> agent that was born and bred to think in components, hooks, and JSX. The difference matters when domains actively conflict: a security auditor&#8217;s &#8220;assume everything is dangerous&#8221; mindset would poison a code generator&#8217;s &#8220;keep it simple&#8221; prompt if they shared the same context. Separate agents, separate prompts, no cross-contamination. Think of it as: Claude Code = one chef who reads different recipe books depending on the dish. Screech = a head chef who delegates to a pastry specialist, a sushi chef, and a grill master: each with their own kitchen and knives.</p><p>The diagram below shows how these layers connect. Here&#8217;s the flow from top to bottom:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!H09h!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d7cc48-530c-4698-8b87-4ed5de2925cd_1200x1051.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!H09h!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d7cc48-530c-4698-8b87-4ed5de2925cd_1200x1051.gif 424w, https://substackcdn.com/image/fetch/$s_!H09h!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d7cc48-530c-4698-8b87-4ed5de2925cd_1200x1051.gif 848w, https://substackcdn.com/image/fetch/$s_!H09h!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d7cc48-530c-4698-8b87-4ed5de2925cd_1200x1051.gif 1272w, https://substackcdn.com/image/fetch/$s_!H09h!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d7cc48-530c-4698-8b87-4ed5de2925cd_1200x1051.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!H09h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d7cc48-530c-4698-8b87-4ed5de2925cd_1200x1051.gif" width="728" height="637.6066666666667" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b1d7cc48-530c-4698-8b87-4ed5de2925cd_1200x1051.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:1051,&quot;width&quot;:1200,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:2239121,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188280936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d7cc48-530c-4698-8b87-4ed5de2925cd_1200x1051.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!H09h!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d7cc48-530c-4698-8b87-4ed5de2925cd_1200x1051.gif 424w, https://substackcdn.com/image/fetch/$s_!H09h!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d7cc48-530c-4698-8b87-4ed5de2925cd_1200x1051.gif 848w, https://substackcdn.com/image/fetch/$s_!H09h!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d7cc48-530c-4698-8b87-4ed5de2925cd_1200x1051.gif 1272w, https://substackcdn.com/image/fetch/$s_!H09h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1d7cc48-530c-4698-8b87-4ed5de2925cd_1200x1051.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>User Interfaces</strong> layer exposes three entry points: a web UI (React), an MCP server (for IDE integration like Cursor), and a CLI terminal. All hit the same orchestration layer.</p><p><strong>Screech Web UI</strong> is a React-based interface for the Screech personal knowledge agent. It connects to the Screech backend and provides five main views: <strong>Sources</strong> (ingest and manage documents, view chunk/entity/pattern/insight stats), <strong>Notes</strong>, <strong>Chat</strong> (conversation with the agent), <strong>Search</strong> (query over your knowledge), and <strong>Knowledge Graph</strong> (browse entities and relationships). It also shows connection status and supports running synthesis. Its main use is to browse, chat, search, and explore your knowledge graph in one place.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cpGB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b12d8e8-0201-40fa-8bb2-edb8a236c83a_3984x2256.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cpGB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b12d8e8-0201-40fa-8bb2-edb8a236c83a_3984x2256.png 424w, https://substackcdn.com/image/fetch/$s_!cpGB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b12d8e8-0201-40fa-8bb2-edb8a236c83a_3984x2256.png 848w, https://substackcdn.com/image/fetch/$s_!cpGB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b12d8e8-0201-40fa-8bb2-edb8a236c83a_3984x2256.png 1272w, https://substackcdn.com/image/fetch/$s_!cpGB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b12d8e8-0201-40fa-8bb2-edb8a236c83a_3984x2256.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cpGB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b12d8e8-0201-40fa-8bb2-edb8a236c83a_3984x2256.png" width="728" height="412" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9b12d8e8-0201-40fa-8bb2-edb8a236c83a_3984x2256.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:824,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:1411086,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188280936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b12d8e8-0201-40fa-8bb2-edb8a236c83a_3984x2256.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cpGB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b12d8e8-0201-40fa-8bb2-edb8a236c83a_3984x2256.png 424w, https://substackcdn.com/image/fetch/$s_!cpGB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b12d8e8-0201-40fa-8bb2-edb8a236c83a_3984x2256.png 848w, https://substackcdn.com/image/fetch/$s_!cpGB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b12d8e8-0201-40fa-8bb2-edb8a236c83a_3984x2256.png 1272w, https://substackcdn.com/image/fetch/$s_!cpGB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b12d8e8-0201-40fa-8bb2-edb8a236c83a_3984x2256.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Orchestration Layer</strong> has as a master agent the Screech agent (Claude Sonnet 4), which sits at the center, handling 60&#8211;70% of requests directly. Three supporting components surround it: the <strong>Task Router</strong> (Haiku, $0.0025/classification), the <strong>Event Bus</strong> (in-process pub/sub), and <strong>Persistent Memory</strong> (conversation history, user context).</p><p><strong>The 128 Subagents</strong> are arranged by category: Core Dev (11), Language Specialists (22), DevOps (15), Testing &amp; Quality (13), Domain-Specific (27), Business (12), Research (6), Dev Experience (12), and Meta-Orchestration (10). The orchestrator delegates to these when deep specialization is needed.</p><p><strong>Hybrid Retrieval</strong> sits between the agents and the database: <code>0.6 vector + 0.2 graph + 0.2 keyword</code>, merging three signals before final relevance scoring.</p><p><strong>SurrealDB</strong> acts as the persistence layer, split into three logical stores in one database: the <strong>Vector Store</strong> (MTREE index, 3072-dim embeddings, cosine similarity), the <strong>Knowledge Graph</strong> (entities, relationships, BFS traversal), and the <strong>Temporal Graph</strong> (Graphiti-inspired episodes, facts, time-range queries).</p><p>At a code level, Screech is just an <code>Agent</code> instance with four things wired in: a <strong>model</strong>, a <strong>hybrid retriever</strong>, a <strong>tool set</strong>, and <strong>persistent memory</strong>. This &#8220;agent factory&#8221; is the single place where the orchestration decisions become concrete.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">// The Screech agent factory
const agent = new Agent({
  name: "Screech",
  purpose: "Unified personal agent for side projects, knowledge synthesis, " +
           "development, documentation, and orchestration of specialist subagents.",
  model: anthropic("claude-sonnet-4-20250514"),
  retriever,  // Hybrid RAG (vector + graph + keyword)
  tools: screechTools,  // Deduplicated from 3 domains
  memory,  // LibSQL persistent memory
});</code></pre></div><h2>11 Tables, One Database: The SurrealDB Model</h2><p>One of the strongest arguments for SurrealDB: documents, embeddings, knowledge graph, temporal events, and conversation memory in 11 tables, one database. No Postgres + Neo4j + Redis dance.</p><h3>Documents and Chunks</h3><p>Ingest a document. Create a <code>document</code> record (metadata, content hash for dedup). Then split it into <code>chunk</code> records. Each chunk gets a 3072-dim embedding (OpenAI <code>text-embedding-3-large</code>). SurrealDB&#8217;s MTREE index does cosine similarity. MTREE is a tree index for high-dimensional vectors (same idea as pgvector&#8217;s HNSW/IVFFlat). It lets SurrealDB find the nearest embeddings without brute-force scanning every row. Chunks are multimodal. They store <code>image_data</code> (base64) and <code>mime_type</code> alongside text. This comes straight from Paul Iusztin&#8217;s insight: stop forcing PDFs through OCR. Treat them as images.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">// Multimodal chunk structure
interface Chunk {
  document_id: string;  // Parent document link
  content: string;      // Text or image description
  embedding: number[];  // 3072-dim vector (MTREE indexed)
  mime_type?: string;   // "text/plain", "image/png", "application/pdf"
  image_data?: string;  // Base64 for vision-processed pages
  page_number?: number; // PDF page tracking
}</code></pre></div><h3>Entities and the Graph (Your Ontology)</h3><p>Here&#8217;s where Screech diverges from typical RAG: I extract a structured knowledge graph. Claude identifies entities and relationships from each document. SurrealDB&#8217;s <code>RELATION</code> type makes this straightforward: <code>entity</code> table <code>relates_to</code> with <code>TYPE RELATION IN entity OUT entity</code>. No separate graph DB.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">-- SurrealDB graph relationships (native support)
DEFINE TABLE relates_to SCHEMAFULL TYPE RELATION IN entity OUT entity;

RELATE entity:react-&gt;relates_to-&gt;entity:nextjs CONTENT {
  relation_type: "EXTENDS",
  confidence: 0.9,
  description: "Next.js extends React with SSR and routing"
};</code></pre></div><p>I picked 11 entity types on purpose. This is the system&#8217;s <strong>ontology</strong>, the vocabulary it uses to classify everything it learns: <code>concept</code>, <code>person</code>, <code>organization</code>, <code>tool</code>, <code>technology</code>, <code>pattern</code>, <code>best_practice</code>, <code>principle</code>, <code>process</code>, <code>document</code>, <code>topic</code>. Each type has its own extraction prompt (e.g., person for roles and affiliations, technology for use cases, and ecosystem). Relationship types include <code>IMPLEMENTS</code>, <code>USES</code>, <code>DEPENDS_ON</code>, <code>PART_OF</code>, <code>EXTENDS</code>, <code>SIMILAR_TO</code>. The ontology is deliberately small; there is enough granularity for useful graph queries without turning into a taxonomy nightmare. Bigger ontologies mean more edge cases and more &#8220;is this a tool or a technology?&#8221; ambiguity. Eleven types cover 95%+ of what a personal knowledge agent encounters.</p><h3>Episodes and Facts: Temporal Layer (It&#8217;s a Log)</h3><p>This is the <a href="https://github.com/getzep/graphiti">Graphiti</a>-inspired layer that most RAG systems completely skip. Every ingestion creates an episode. Think of episodes as an <strong>append-only log of everything the system has ever learned</strong>. It is time-stamped and immutable. Episodes link to entities via source_episode_ids. Ingest a PDF, and you get a new episode. Process a paper, and you get a new episode. They do not get updated or overwritten. Old episodes don&#8217;t get deleted when new ones arrive; they stay in the timeline with their original timestamps. You can ask &#8220;what did I know about X six months ago?&#8221; and get a real and accurate answer.</p><p>Facts are triples (<code>subject</code>, <code>predicate</code>, <code>object</code>) with a source_episode_id. They capture the structured knowledge extracted alongside each episode. When two facts conflict, e.g., &#8220;Bun is experimental&#8221; (June) and &#8220;Bun is production-ready&#8221; (January), the agent can prefer the more recent one.</p><p>Why does this matter? Knowledge changes. Without temporal tracking, both facts coexist in your knowledge base with equal weight, and the agent might confidently cite the stale one. Graphiti calls this &#8220;bi-temporal awareness&#8221;. Tracking both when a fact was true in the world <em>and</em> when the system learned it.</p><p>Behind the scenes, temporal queries run entity search (match query terms), then episode retrieval in a time range (filter out chat episodes, keep knowledge episodes), then relevance filtering and linking back to entities. The result is a time-ordered context. The <code>context</code> field returned is prompt-ready: entity descriptions and relationship sentences in plain language.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">// Time-aware queries
const recent = await queryTemporalGraph("recent learnings", {
  includeTemporal: true,
  timeRange: { start: oneMonthAgo, end: now },
});
// Returns: episodes + linked entities + relationships, time-ordered</code></pre></div><h3>Patterns and Insights: The &#8220;So What?&#8221; Chain</h3><p>Beyond storage, Screech runs a synthesis pipeline. Detects patterns across your knowledge base. Generates actionable insights. The <code>pattern</code> table stores detected patterns (workflows, successes, failures, optimizations) with embeddings for searchability. The <code>insight</code> table stores generated insights, each linked back to source patterns with relevance scores.</p><p>Documents become chunks, which get embedded. Chunks become entities through extraction. Entities become patterns once the synthesis pipeline starts noticing recurring signals. Patterns become insights (actionable takeaways with provenance). Each stage feeds the next. Every layer is searchable. Vague question goes to vector over chunks. Relationships go to a graph over entities. &#8220;What should I do?&#8221; goes to insights.</p><h3>Conversation Memory</h3><p><code>user</code>, <code>thread</code>, <code>message</code> tables handle conversation memory. Zep-style user summaries, conversation threading, and message history. Persistent context across sessions, but separate from the knowledge base.</p><p>The diagram below shows how data flows through the system. Think of it as five layers stacked on top of each other, each feeding the next:</p><ol><li><p><strong>Document layer</strong> (top): <code>document</code> and <code>chunk</code>. Raw material comes in here. A PDF becomes a <code>document</code> record; its content gets split into <code>chunk</code> records, each with a 3072-dim embedding. This is the foundation everything else builds on.</p></li><li><p><strong>Temporal layer</strong>: <code>episode</code> and <code>community</code>. Every ingestion event creates an <code>episode</code> timestamped to when the system learned it. Episodes link back to chunks (what was ingested) and forward to entities (what was extracted). This is the Graphiti-inspired time dimension&#8212;the system knows <em>when</em> it learned something, not just <em>what</em>.</p></li><li><p><strong>Knowledge graph layer</strong>: <code>entity</code>, <code>relates_to</code>, and <code>fact</code>. Entities extracted from chunks (concepts, technologies, people) live here, connected by typed <code>relates_to</code> edges. The diamond shape in the diagram represents the relationship table, which is a SurrealDB <code>RELATION</code> type that sits <em>between</em> entity nodes. <code>fact</code> triples (subject, predicate, object) capture the structured knowledge extracted alongside entities.</p></li><li><p><strong>Synthesis layer</strong>: <code>pattern</code> and <code>insight</code>. Patterns detected across your knowledge base (recurring workflows, success/failure signals, optimization opportunities) and actionable insights generated from those patterns. Each links back to the entities and episodes that sourced it.</p></li><li><p><strong>Conversation layer</strong> (bottom): <code>user</code>, <code>thread</code>, <code>message</code>. Conversation memory, separate from knowledge. Threads reference the user; messages reference threads. The agent can query conversation history independently of the knowledge base.</p></li></ol><p>The arrows in the diagram show the key relationships: chunks link to their parent document. Episodes link to chunks and entities. Entities connect via <code>relates_to</code>. Patterns and insights link back to entities and episodes for provenance. Each layer is independently searchable via the hybrid retrieval pipeline.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!N9Lq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24fd69fe-7267-414d-8e85-51a20564cc3c_1200x2188.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!N9Lq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24fd69fe-7267-414d-8e85-51a20564cc3c_1200x2188.gif 424w, https://substackcdn.com/image/fetch/$s_!N9Lq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24fd69fe-7267-414d-8e85-51a20564cc3c_1200x2188.gif 848w, https://substackcdn.com/image/fetch/$s_!N9Lq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24fd69fe-7267-414d-8e85-51a20564cc3c_1200x2188.gif 1272w, https://substackcdn.com/image/fetch/$s_!N9Lq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24fd69fe-7267-414d-8e85-51a20564cc3c_1200x2188.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!N9Lq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24fd69fe-7267-414d-8e85-51a20564cc3c_1200x2188.gif" width="1200" height="2188" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/24fd69fe-7267-414d-8e85-51a20564cc3c_1200x2188.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2188,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2513810,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188280936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24fd69fe-7267-414d-8e85-51a20564cc3c_1200x2188.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!N9Lq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24fd69fe-7267-414d-8e85-51a20564cc3c_1200x2188.gif 424w, https://substackcdn.com/image/fetch/$s_!N9Lq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24fd69fe-7267-414d-8e85-51a20564cc3c_1200x2188.gif 848w, https://substackcdn.com/image/fetch/$s_!N9Lq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24fd69fe-7267-414d-8e85-51a20564cc3c_1200x2188.gif 1272w, https://substackcdn.com/image/fetch/$s_!N9Lq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24fd69fe-7267-414d-8e85-51a20564cc3c_1200x2188.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>When to Use This vs. Alternatives</h3><p>Unified SurrealDB works when you want graph + vector + relational without three databases, your dataset is moderate (thousands to tens of thousands of docs), and you value dev velocity over ecosystem maturity.</p><p>For production SLAs, PostgreSQL plus pgvector is the safer bet. If your graph is only 2&#8211;3 hops (like Screech&#8217;s BFS), Postgres handles it with recursive CTEs or JOINs, even at millions of rows. Neo4j earns its place when you need deep traversals or heavy graph queries. Graphiti uses Neo4j for that. For my 2-hop, few-thousand-entity case, Postgres + pgvector could do it all. I chose SurrealDB to prototype faster with one schema and one connection. Right call for me. One schema file. One connection. One query language. I would not blindly recommend it for a team with compliance needs.</p><h2>Subagent System: Factory, Registry, Profiles</h2><p>Every subagent is a factory <code>(memory?) =&gt; Agent</code>. Keeps instantiation lazy (no subagent created until needed) and shared memory (agents in the same workflow see the same conversation history).</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">// Subagent factory pattern
export type SubagentFactory = (memory?: Memory) =&gt; Agent;

// Registry metadata for discovery
export interface SubagentDefinition {
  name: string;
  description: string;
  category: SubagentCategory;  // 10 categories
  tags: string[];
  modelTier?: ModelTier;       // fast | standard | reasoning
  toolProfile?: ToolProfile;   // core | dev | security | full
  capabilities?: SubagentCapabilities;
  factory: SubagentFactory;
}</code></pre></div><p>Three decisions that actually matter:</p><p><strong>Model tiers control cost.</strong> Not every agent needs Sonnet. Simple formatting &#8594; Haiku (~90% cheaper). Security audits &#8594; reasoning tier. Default is standard (Sonnet 4). Router can override.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">const MODEL_MAP = {
  fast: anthropic("claude-3-5-haiku-20241022"),     // ~10x cheaper
  standard: anthropic("claude-sonnet-4-20250514"),   // Balanced
  reasoning: anthropic("claude-sonnet-4-20250514"),  // Same model, deeper prompts
};</code></pre></div><p><strong>Tool profiles prevent token waste.</strong> Research analyst doesn&#8217;t need git tools. Code reviewer doesn&#8217;t need security scanning tools. Four profiles: <code>core</code> (7 tools): Knowledge/RAG + file ops + workflow discovery, <code>dev</code> (15 tools): core + git + code analysis + testing, <code>security</code> (17 tools): dev + security scanning + dependency audit and<code>full</code> (18 tools): everything (backwards-compatible default)</p><p>Each subagent can add domain tools on top of its profile.</p><p><strong>Capability declarations.</strong> With 128 agents, &#8220;find agents tagged typescript&#8221; returns a dozen. The orchestrator needs to know what each agent is good at. Each subagent declares what it can do, expected input, output, and latency tier. Semantic matching sends &#8220;TypeScript conditional types&#8221; to the agent whose canDo includes &#8220;conditional types&#8221; and &#8220;type system design&#8221;, not any agent with TypeScript in the tag. Same language field, different canDo, e.g., typescript-pro vs. react-specialist.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">capabilities: {
  canDo: ["type system design", "generics", "conditional types"],
  languages: ["typescript", "javascript"],
  inputSchema: "code snippet or type problem description",
  outputSchema: "typed solution with explanation",
  latencyTier: "medium",
}</code></pre></div><p><strong>Why 10 categories?</strong> Flat list worked until about 40 agents. Then discovery got noisy. Categories are a coarse filter: the orchestrator picks a category, then finds the right specialist within it. The split follows prompt conflicts: security paranoia vs. code-gen creativity, test-engineer adversarial vs. technical-writer explanatory. Separate categories, separate prompts.</p><p>At the same time, <strong>agents in the same category share a tool profile but differ in expertise.</strong> All language specialists get the <code>dev</code> tool profile (git, testing, code analysis). All testing-quality agents share <code>dev</code> tools, too, but their prompts are tuned for finding problems, not writing code. Security agents get <code>security</code> tools. Research agents only need <code>core</code> tools (knowledge/RAG). The table below summarizes counts and examples. The diagram repeats it visually.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!T_gi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e57724a-028a-4ffa-ac91-d76a42ea4aef_926x535.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!T_gi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e57724a-028a-4ffa-ac91-d76a42ea4aef_926x535.png 424w, https://substackcdn.com/image/fetch/$s_!T_gi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e57724a-028a-4ffa-ac91-d76a42ea4aef_926x535.png 848w, https://substackcdn.com/image/fetch/$s_!T_gi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e57724a-028a-4ffa-ac91-d76a42ea4aef_926x535.png 1272w, https://substackcdn.com/image/fetch/$s_!T_gi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e57724a-028a-4ffa-ac91-d76a42ea4aef_926x535.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!T_gi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e57724a-028a-4ffa-ac91-d76a42ea4aef_926x535.png" width="728" height="420.6047516198704" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5e57724a-028a-4ffa-ac91-d76a42ea4aef_926x535.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:535,&quot;width&quot;:926,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:162791,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188280936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e57724a-028a-4ffa-ac91-d76a42ea4aef_926x535.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!T_gi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e57724a-028a-4ffa-ac91-d76a42ea4aef_926x535.png 424w, https://substackcdn.com/image/fetch/$s_!T_gi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e57724a-028a-4ffa-ac91-d76a42ea4aef_926x535.png 848w, https://substackcdn.com/image/fetch/$s_!T_gi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e57724a-028a-4ffa-ac91-d76a42ea4aef_926x535.png 1272w, https://substackcdn.com/image/fetch/$s_!T_gi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e57724a-028a-4ffa-ac91-d76a42ea4aef_926x535.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!N9Lo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de746d-833b-4df7-b2bd-c8441306eaef_1200x675.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!N9Lo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de746d-833b-4df7-b2bd-c8441306eaef_1200x675.png 424w, https://substackcdn.com/image/fetch/$s_!N9Lo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de746d-833b-4df7-b2bd-c8441306eaef_1200x675.png 848w, https://substackcdn.com/image/fetch/$s_!N9Lo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de746d-833b-4df7-b2bd-c8441306eaef_1200x675.png 1272w, https://substackcdn.com/image/fetch/$s_!N9Lo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de746d-833b-4df7-b2bd-c8441306eaef_1200x675.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!N9Lo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de746d-833b-4df7-b2bd-c8441306eaef_1200x675.png" width="728" height="409.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/33de746d-833b-4df7-b2bd-c8441306eaef_1200x675.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:675,&quot;width&quot;:1200,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:158842,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188280936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de746d-833b-4df7-b2bd-c8441306eaef_1200x675.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!N9Lo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de746d-833b-4df7-b2bd-c8441306eaef_1200x675.png 424w, https://substackcdn.com/image/fetch/$s_!N9Lo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de746d-833b-4df7-b2bd-c8441306eaef_1200x675.png 848w, https://substackcdn.com/image/fetch/$s_!N9Lo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de746d-833b-4df7-b2bd-c8441306eaef_1200x675.png 1272w, https://substackcdn.com/image/fetch/$s_!N9Lo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de746d-833b-4df7-b2bd-c8441306eaef_1200x675.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>How Voltagent Fits Into the Picture</h3><p>VoltAgent is the runtime layer that makes the orchestration practical: it provides workflow primitives, tool execution, memory management, and suspend/resume so the orchestrator and subagents can run as a coordinated system.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5a7I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddd739d1-c162-46f8-af88-6d9f30dfda5a_3942x2198.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5a7I!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddd739d1-c162-46f8-af88-6d9f30dfda5a_3942x2198.png 424w, https://substackcdn.com/image/fetch/$s_!5a7I!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddd739d1-c162-46f8-af88-6d9f30dfda5a_3942x2198.png 848w, https://substackcdn.com/image/fetch/$s_!5a7I!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddd739d1-c162-46f8-af88-6d9f30dfda5a_3942x2198.png 1272w, https://substackcdn.com/image/fetch/$s_!5a7I!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddd739d1-c162-46f8-af88-6d9f30dfda5a_3942x2198.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5a7I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddd739d1-c162-46f8-af88-6d9f30dfda5a_3942x2198.png" width="728" height="406" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ddd739d1-c162-46f8-af88-6d9f30dfda5a_3942x2198.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:812,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:469712,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188280936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddd739d1-c162-46f8-af88-6d9f30dfda5a_3942x2198.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5a7I!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddd739d1-c162-46f8-af88-6d9f30dfda5a_3942x2198.png 424w, https://substackcdn.com/image/fetch/$s_!5a7I!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddd739d1-c162-46f8-af88-6d9f30dfda5a_3942x2198.png 848w, https://substackcdn.com/image/fetch/$s_!5a7I!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddd739d1-c162-46f8-af88-6d9f30dfda5a_3942x2198.png 1272w, https://substackcdn.com/image/fetch/$s_!5a7I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddd739d1-c162-46f8-af88-6d9f30dfda5a_3942x2198.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>The $0.0025 Routing Layer</h2><p>This is what makes the economics work. Before anything expensive runs, I do a tiny classification call (Haiku) to label the request. Complexity. Domain. Suggested tier. Up to three candidate specialists. The system prompt gives Haiku the full category list. Rules are explicit. &#8220;Security&#8221; or &#8220;vulnerability&#8221; always goes to reasoning. Simple question goes too fast. Code gen goes to at least standard.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">const classificationSchema = z.object({
  complexity: z.enum(["simple", "moderate", "complex"]),
  domain: z.enum(["lookup", "formatting", "code-generation",
                  "code-review", "architecture", "security",
                  "debugging", "research", "orchestration", "other"]),
  reasoning: z.string(),
  suggestedTier: z.enum(["fast", "standard", "reasoning"]),
  suggestedSubagents: z.array(z.string()).max(3),
});</code></pre></div><p><strong>Step 2: Domain overrides.</strong> <code>resolveRoutedTier()</code> takes the complexity-based tier and domain overrides and picks the <em><strong>higher</strong></em> of the two. So a &#8220;simple&#8221; security question still goes to reasoning. Security that looks simple often is not. The override is a safety net for Haiku&#8217;s optimism.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">const DOMAIN_TIER_OVERRIDES = {
  security: "reasoning",     // No shortcuts
  architecture: "reasoning", // No shortcuts
  debugging: "reasoning",    // No shortcuts
  "code-review": "standard", // At least standard
};</code></pre></div><p><strong>Step 3: Final routing.</strong> Resolved tier + suggested subagents + rationale &#8594; orchestrator picks model and specialists.</p><p>Fallback. If classification fails (network error, timeout), we default to <code>{ complexity: "moderate", tier: "standard" }</code>. Fail to the middle. Not cheapest (might undershoot). Not the most expensive (waste on every failure). Safest with zero information.</p><p><strong>Cost:</strong> ~$0.0025 per classification (~$0.25/M input tokens on Haiku). Route 1,000 tasks, spend $2.50. If even 30% land on Haiku instead of Sonnet, you save on the order $8-10 per 1,000 tasks. The router pays for itself quickly.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tePE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4513bc-dd44-4c65-9544-871d17b87c7f_1200x1485.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tePE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4513bc-dd44-4c65-9544-871d17b87c7f_1200x1485.gif 424w, https://substackcdn.com/image/fetch/$s_!tePE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4513bc-dd44-4c65-9544-871d17b87c7f_1200x1485.gif 848w, https://substackcdn.com/image/fetch/$s_!tePE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4513bc-dd44-4c65-9544-871d17b87c7f_1200x1485.gif 1272w, https://substackcdn.com/image/fetch/$s_!tePE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4513bc-dd44-4c65-9544-871d17b87c7f_1200x1485.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tePE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4513bc-dd44-4c65-9544-871d17b87c7f_1200x1485.gif" width="1200" height="1485" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9a4513bc-dd44-4c65-9544-871d17b87c7f_1200x1485.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1485,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2084570,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188280936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4513bc-dd44-4c65-9544-871d17b87c7f_1200x1485.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tePE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4513bc-dd44-4c65-9544-871d17b87c7f_1200x1485.gif 424w, https://substackcdn.com/image/fetch/$s_!tePE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4513bc-dd44-4c65-9544-871d17b87c7f_1200x1485.gif 848w, https://substackcdn.com/image/fetch/$s_!tePE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4513bc-dd44-4c65-9544-871d17b87c7f_1200x1485.gif 1272w, https://substackcdn.com/image/fetch/$s_!tePE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a4513bc-dd44-4c65-9544-871d17b87c7f_1200x1485.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Hybrid Retrieval: Three Signals Beat One</h2><p>I started with pure vector search. It worked until it didn&#8217;t. Three failure modes:</p><ol><li><p><strong>Structural queries fell flat.</strong> &#8220;What tools does the API designer use?&#8221; The answer is in the relationship structure. Vector search gave me chunks that mentioned the API designer, not the ones describing its tool config. I needed graph traversal.</p></li><li><p><strong>Exact-match queries got paraphrased away.</strong> &#8220;What is the error for SQLITE_BUSY?&#8221; Embeddings map that into &#8220;database locking&#8221; neighborhood and miss the chunk with the actual error code. I needed a keyword.</p></li><li><p><strong>Long-document questions needed reasoning, not similarity.</strong> &#8220;What are the conclusions?&#8221; The conclusion section often is not the most similar to the word &#8220;conclusions&#8221;. The intro restating the thesis can score higher. I needed the model to reason over document structure (e.g., a table of contents), not only similarity.</p></li></ol><p>Instead of trying to make one approach handle everything, I split retrieval into three paths:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6Ru_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aff88ce-f6c3-4a8b-8f5b-a4ec3b10b9a7_923x280.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6Ru_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aff88ce-f6c3-4a8b-8f5b-a4ec3b10b9a7_923x280.png 424w, https://substackcdn.com/image/fetch/$s_!6Ru_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aff88ce-f6c3-4a8b-8f5b-a4ec3b10b9a7_923x280.png 848w, https://substackcdn.com/image/fetch/$s_!6Ru_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aff88ce-f6c3-4a8b-8f5b-a4ec3b10b9a7_923x280.png 1272w, https://substackcdn.com/image/fetch/$s_!6Ru_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aff88ce-f6c3-4a8b-8f5b-a4ec3b10b9a7_923x280.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6Ru_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aff88ce-f6c3-4a8b-8f5b-a4ec3b10b9a7_923x280.png" width="923" height="280" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5aff88ce-f6c3-4a8b-8f5b-a4ec3b10b9a7_923x280.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:280,&quot;width&quot;:923,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:86938,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188280936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aff88ce-f6c3-4a8b-8f5b-a4ec3b10b9a7_923x280.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6Ru_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aff88ce-f6c3-4a8b-8f5b-a4ec3b10b9a7_923x280.png 424w, https://substackcdn.com/image/fetch/$s_!6Ru_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aff88ce-f6c3-4a8b-8f5b-a4ec3b10b9a7_923x280.png 848w, https://substackcdn.com/image/fetch/$s_!6Ru_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aff88ce-f6c3-4a8b-8f5b-a4ec3b10b9a7_923x280.png 1272w, https://substackcdn.com/image/fetch/$s_!6Ru_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aff88ce-f6c3-4a8b-8f5b-a4ec3b10b9a7_923x280.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The agent picks which tool to call: <code>document_search</code> / <code>get_context</code> &#8594; hybrid (vector + graph + keyword, then rerank). <code>search_within_document</code> &#8594; same pipeline, one document. <code>answer_from_document_deep</code> &#8594; build a section tree from chunks, LLM picks sections, fetch those chunks only. No vectors on path 3.</p><h3>Why Three Signals Beat One (Paths 1 &amp; 2)</h3><p>The hybrid pipeline runs three queries <strong>in parallel</strong>, then merges with configurable weights:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">const DEFAULT_WEIGHTS = {
  vectorWeight: 0.6,   // Semantic similarity (primary)
  graphWeight: 0.2,    // Relationship traversal
  keywordWeight: 0.2,  // Exact matching
};</code></pre></div><p>Why these weights? <strong>Vector search</strong> gets 0.6 because most knowledge base questions are conceptual, like &#8220;explain X,&#8221; &#8220;how does Y work?&#8221;; therefore, embeddings handle these well. OpenAI&#8217;s <code>text-embedding-3-large</code> (3072 dims) with SurrealDB&#8217;s native MTREE index, cosine similarity. The threshold is intentionally low (0.35), but you can tweak that value according to your use case. Similarity scores compress into a narrow range, so 0.35 is more selective than it sounds.</p><p><strong>Graph search</strong> gets 0.2 because structural queries are the minority but high-value. It works in two stages. First, it finds entities that match the query by name or description, basically a simple text search over the <code>entity</code> table (concepts, technologies, people, organizations). Then it <em>expands</em> outward from those matches using breadth-first search (BFS): for each matched entity, it queries the <code>relates_to</code> edges in SurrealDB to discover neighbors, scoring them at 70% of the parent&#8217;s relevance. Configurable traversal depth (default: 2 hops) controls how far the expansion goes. It should be deep enough to find meaningful connections, but shallow enough to avoid pulling in the entire graph.</p><p>So: vector search finds a chunk mentioning &#8220;React&#8221;? Graph search starts at the &#8220;React&#8221; entity node, walks its edges, and pulls in &#8220;hooks&#8221;, &#8220;server components&#8221;, or &#8220;Next.js&#8221;, without needing those terms in the original query. Need the path between two concepts? A separate BFS finds the shortest connection: <code>React &#8594; EXTENDS &#8594; JavaScript &#8594; USES &#8594; V8</code>, each hop following typed relationships (<code>IMPLEMENTS</code>, <code>USES</code>, <code>DEPENDS_ON</code>, <code>EXTENDS</code>, <code>SIMILAR_TO</code>). This is the signal that vector search fundamentally <em>cannot</em> provide, because relationships are structural, not semantic.</p><p><strong>Keyword search</strong> gets 0.2 because sometimes you just need to find the exact string. Ask a pure vector system &#8220;what version of React does project X use?&#8221; and it&#8217;ll confidently return chunks about React 17, React 18.2, and React 19, all because to an embedding model, they&#8217;re all basically &#8220;React with a number.&#8221; Helpful if you&#8217;re writing an essay. Useless if you need the actual version pinned in your <code>package.json</code>. Keyword search is the boring friend who actually reads the label. Full-text matching with term coverage scoring. No AI magic, just string comparison. And for error codes, version numbers, and config keys, that&#8217;s exactly what you want.</p><p>After merging, results go through <strong>reranking,</strong> and this is where the quality jump happens. The weighted merge gets you close, but reranking catches cases where a high-scoring vector result is semantically related but does not actually <em>answer the question</em>.</p><p>The reranker supports three methods, selectable per query:</p><ul><li><p><strong>Embedding reranking</strong> (fast, cheap): recalculates cosine similarity between the query embedding and each result&#8217;s embedding, then blends it 50/50 with the original merge score. This catches results that scored well on the graph or keyword but are semantically distant from the actual query. Fast because it&#8217;s just math. You don&#8217;t need an LLM call.</p></li><li><p><strong>LLM reranking</strong> (slower, more accurate): sends the query + top 20 candidate passages to Claude Sonnet 4, which scores each on a 0&#8211;1 relevance scale. The LLM understands <em>intent, and</em> it knows that &#8220;how do I fix CORS errors?&#8221; is asking for a solution, not a definition. Sits behind an LRU cache (128 entries, 5-minute TTL) to avoid redundant calls for similar queries.</p></li><li><p><strong>Hybrid reranking</strong> (two-pass): embedding reranking first to narrow the candidate set, then LLM reranking on the survivors. Best quality, highest latency.</p></li></ul><p>On top of any reranking method, there&#8217;s an optional <strong>diversity-aware mode</strong> using MMR (Maximal Marginal Relevance). It iteratively selects results that maximize relevance while penalizing similarity to already-selected results, so it prevents returning five chunks from the same paragraph. Plus a <strong>source-type preference</strong> layer that weights chunks, entities, patterns, and insights differently depending on the query type.</p><h3>Why Reasoning Beats Similarity for Long Documents (Path 3)</h3><p>This is the insight that took me the longest to internalize. For a 300-page PDF, when someone asks &#8220;what are the conclusions?&#8221; the <em>location</em> of the answer is a function of document <em>structure</em>, not content similarity. A chunk from the introduction that restates the thesis will often score higher on cosine similarity to &#8220;conclusions&#8221; than the actual conclusion section. More embedding dimensions won&#8217;t fix this. Better chunking strategies help, but don&#8217;t solve it.</p><p>Path 3 skips vector search entirely. The pipeline has three steps:</p><p><strong>Step 1: Build the section tree.</strong> Take the document&#8217;s chunks (already stored from ingestion) and group them into sections. If chunks have page numbers (PDFs), group by page. Otherwise, group into fixed-size windows (default: 4 chunks per section). Each section node gets an ID, a title (&#8221;Page 12&#8221; or &#8220;Section 5&#8221;), and a short summary (first ~220 characters of the first chunk). The result is a flat list of <code>TreeNode</code> objects, essentially a reconstructed table of contents.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">interface TreeNode {
  node_id: string;        // "s0", "s1", "s2"...
  title: string;          // "Page 12" or "Section 5"
  summary: string;        // First ~220 chars of first chunk
  startChunkIndex: number;
  endChunkIndex: number;
  pageRange?: string;     // "pp. 12&#8211;14"
}</code></pre></div><p><strong>Step 2: LLM selects relevant sections.</strong> The tree outline (node IDs + titles + summaries) is sent to Claude in a single prompt. The key instruction: <em>use reasoning, not keyword matching</em>. The prompt explicitly tells the LLM to think structurally, e.g., &#8220;conclusions are usually in the final section&#8221;, &#8220;see Appendix G means look for an appendix section.&#8221; The LLM returns a JSON object with its reasoning and a list of selected node IDs.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">// What the LLM sees (abbreviated)
// - s0: Page 1 &#8212; "Chapter 1: Introduction. This paper presents..."
// - s1: Page 2 &#8212; "Related work in retrieval-augmented generation..."
// - ...
// - s14: Page 28 &#8212; "7. Conclusions and Future Work. We have shown..."

// What the LLM returns
{
  "thinking": "Conclusions are in the final sections. s14 title mentions Conclusions.",
  "node_list": ["s14"]
}</code></pre></div><p><strong>Step 3: Fetch and return.</strong> Map the selected node IDs back to chunk index ranges, fetch those chunks, and concatenate their content. That&#8217;s your retrieval context. No embedding comparison anywhere in the pipeline.</p><p>The fallback is important: if the LLM returns invalid JSON or no valid node IDs, the system defaults to the first 2&#8211;3 sections. Better to return <em>something</em> than nothing, and introductory sections are a reasonable default for most questions.</p><p>The design is directly inspired by <a href="https://github.com/VectifyAI/PageIndex">PageIndex</a>&#8216;s thesis: similarity &#8800; relevance, and reasoning over document structure often beats embedding search for professional long-form content. It won&#8217;t help for vague conceptual questions; that&#8217;s what path 1 is for. But for &#8220;where in this document does X live?&#8221; or &#8220;what does chapter 7 say about Y?&#8221;, it&#8217;s dramatically better because the LLM can reason about document organization the way a human reader would: by scanning the table of contents first.</p><p><strong>Document-scoped (path 2)</strong> simply narrows the same hybrid pipeline to one document via a <code>documentIds</code> filter. Same three signals, same reranker&#8212;just scoped.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!j0uI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a60174-b01a-4df4-b7b1-8f432f3fc64b_1200x1556.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!j0uI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a60174-b01a-4df4-b7b1-8f432f3fc64b_1200x1556.gif 424w, https://substackcdn.com/image/fetch/$s_!j0uI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a60174-b01a-4df4-b7b1-8f432f3fc64b_1200x1556.gif 848w, https://substackcdn.com/image/fetch/$s_!j0uI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a60174-b01a-4df4-b7b1-8f432f3fc64b_1200x1556.gif 1272w, https://substackcdn.com/image/fetch/$s_!j0uI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a60174-b01a-4df4-b7b1-8f432f3fc64b_1200x1556.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!j0uI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a60174-b01a-4df4-b7b1-8f432f3fc64b_1200x1556.gif" width="1200" height="1556" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/46a60174-b01a-4df4-b7b1-8f432f3fc64b_1200x1556.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1556,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3325409,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188280936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a60174-b01a-4df4-b7b1-8f432f3fc64b_1200x1556.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!j0uI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a60174-b01a-4df4-b7b1-8f432f3fc64b_1200x1556.gif 424w, https://substackcdn.com/image/fetch/$s_!j0uI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a60174-b01a-4df4-b7b1-8f432f3fc64b_1200x1556.gif 848w, https://substackcdn.com/image/fetch/$s_!j0uI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a60174-b01a-4df4-b7b1-8f432f3fc64b_1200x1556.gif 1272w, https://substackcdn.com/image/fetch/$s_!j0uI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a60174-b01a-4df4-b7b1-8f432f3fc64b_1200x1556.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The temporal knowledge graph is directly inspired by <a href="https://github.com/getzep/graphiti">Graphiti</a>, Zep&#8217;s framework for building real-time knowledge graphs. Their core insight: knowledge isn&#8217;t static. Tracking <em>when</em> facts were learned matters as much as the facts themselves. Perfect for a personal agent that continuously ingests new content.</p><p>Every ingested piece creates an &#8220;episode&#8221;; remember that it&#8217;s a timestamped event linking to extracted entities and fact triples (subject-predicate-object). So now you can use time-aware queries: &#8220;What technologies have I been reading about this month?&#8221; or &#8220;How has my understanding of RAG changed?&#8221;</p><h2>Four Production Patterns That Actually Saved Me</h2><p>These emerged from running this thing in the wild. I&#8217;d recommend all four to anyone building multi-agent systems.</p><h3>1. LLM Resilience with Tiered Timeouts</h3><p>Every LLM call goes through <code>withLLMResilience():</code>wrapper that adds per-attempt AbortController timeouts, exponential backoff with jitter, retry only for rate limits / 5xx/ network. Never retries 4xx errors (your fault, not theirs). Different timeouts per use case: classification 60s (if it takes that long, fail), synthesis 300s (different budget). I learned this the hard way. One stuck call should not hold everything up.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">export const LLM_TIMEOUTS = {
  fast: { timeoutMs: 60_000, maxRetries: 3 },      // Classification, reranking
  standard: { timeoutMs: 120_000, maxRetries: 2 }, // Agent generation
  long: { timeoutMs: 300_000, maxRetries: 1 },     // Synthesis, deep analysis
};

// Usage
const result = await withLLMResilience(
  (signal) =&gt; anthropic.messages.create({ ... }, { signal }),
  { ...LLM_TIMEOUTS.fast, label: "task-classification" }
);</code></pre></div><p>Key insight: different operations need different timeout budgets. Classification call taking 60 seconds? Failed. Synthesis operation taking 60 seconds? Just warming up.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CuIu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe20d2352-c18b-4906-b88d-962b4d9fca33_3974x2312.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CuIu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe20d2352-c18b-4906-b88d-962b4d9fca33_3974x2312.png 424w, https://substackcdn.com/image/fetch/$s_!CuIu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe20d2352-c18b-4906-b88d-962b4d9fca33_3974x2312.png 848w, https://substackcdn.com/image/fetch/$s_!CuIu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe20d2352-c18b-4906-b88d-962b4d9fca33_3974x2312.png 1272w, https://substackcdn.com/image/fetch/$s_!CuIu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe20d2352-c18b-4906-b88d-962b4d9fca33_3974x2312.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CuIu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe20d2352-c18b-4906-b88d-962b4d9fca33_3974x2312.png" width="1456" height="847" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e20d2352-c18b-4906-b88d-962b4d9fca33_3974x2312.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:847,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:803494,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188280936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe20d2352-c18b-4906-b88d-962b4d9fca33_3974x2312.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CuIu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe20d2352-c18b-4906-b88d-962b4d9fca33_3974x2312.png 424w, https://substackcdn.com/image/fetch/$s_!CuIu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe20d2352-c18b-4906-b88d-962b4d9fca33_3974x2312.png 848w, https://substackcdn.com/image/fetch/$s_!CuIu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe20d2352-c18b-4906-b88d-962b4d9fca33_3974x2312.png 1272w, https://substackcdn.com/image/fetch/$s_!CuIu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe20d2352-c18b-4906-b88d-962b4d9fca33_3974x2312.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>2. Findings Cache for Review Chains</h3><p>When you run multiple reviewers on the same code (code-reviewer &#8594; security-auditor &#8594; design-analyst), each reviewer produces findings that downstream reviewers need. Without sharing, every reviewer re-parses the same files, re-discovers the same structure, and wastes tokens on duplicate analysis.</p><p>The <code>FindingsCache</code> is a singleton in-memory cache keyed by chain ID. It stores two things: <strong>structural analysis</strong> (file structure, dependencies, symbols, complexity metrics) produced by the first reviewer, and <strong>accumulated findings</strong> from every reviewer in the chain&#8212;each typed with category, severity, source, and location.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">// Each finding is typed and traceable
interface ReviewFinding {
  source: string;     // Which reviewer produced it
  category: "structural" | "quality" | "security" | "design" | "performance";
  severity: "info" | "low" | "medium" | "high" | "critical";
  summary: string;    // Human-readable
  data?: Record&lt;string, unknown&gt;;  // Structured data per reviewer type
  location?: string;  // File/line reference
}</code></pre></div><p>The first reviewer in the chain caches the expensive structural work. The next ones get <code>getChainContextForReviewer()</code>: previous findings + structural cache as a prompt-ready string. Typed findings (source, category, severity, location). TTL 10 min, cap 50 chains. Cuts chain latency by about 40%. The expensive part is parsing and context building, not the LLM. Pattern credit: <a href="https://github.com/JustinNarracott/agentic-playbooks">Agentic Playbooks</a>. Traceable decisions are also a performance win.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">// First one caches structure
findingsCache.setStructuralCache(chainId, {
  fileStructure: "src/api/users.ts - 245 lines, 3 exports",
  dependencies: ["express", "zod", "prisma"],
  symbols: ["createUser", "validateInput", "UserSchema"],
  metrics: { cyclomaticComplexity: 12, loc: 245 },
});

// Next get pre-built context
const context = findingsCache.getChainContextForReviewer(chainId, "security-auditor");
// Returns previous findings + cached structure, prompt-ready</code></pre></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!25rx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c0f132c-2047-4a97-b319-cb587b5fbfb8_1200x2126.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!25rx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c0f132c-2047-4a97-b319-cb587b5fbfb8_1200x2126.gif 424w, https://substackcdn.com/image/fetch/$s_!25rx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c0f132c-2047-4a97-b319-cb587b5fbfb8_1200x2126.gif 848w, https://substackcdn.com/image/fetch/$s_!25rx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c0f132c-2047-4a97-b319-cb587b5fbfb8_1200x2126.gif 1272w, https://substackcdn.com/image/fetch/$s_!25rx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c0f132c-2047-4a97-b319-cb587b5fbfb8_1200x2126.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!25rx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c0f132c-2047-4a97-b319-cb587b5fbfb8_1200x2126.gif" width="1200" height="2126" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6c0f132c-2047-4a97-b319-cb587b5fbfb8_1200x2126.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2126,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:4663699,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188280936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c0f132c-2047-4a97-b319-cb587b5fbfb8_1200x2126.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!25rx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c0f132c-2047-4a97-b319-cb587b5fbfb8_1200x2126.gif 424w, https://substackcdn.com/image/fetch/$s_!25rx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c0f132c-2047-4a97-b319-cb587b5fbfb8_1200x2126.gif 848w, https://substackcdn.com/image/fetch/$s_!25rx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c0f132c-2047-4a97-b319-cb587b5fbfb8_1200x2126.gif 1272w, https://substackcdn.com/image/fetch/$s_!25rx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c0f132c-2047-4a97-b319-cb587b5fbfb8_1200x2126.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h4>What the Workflow Looks Like in Practice</h4><p>The screenshot below is a real run of a multi-step workflow, showing the chain of specialist calls and the event-style logging that makes the system debuggable.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EHL2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bae90ab-8811-45df-8638-b47b69ddf6be_3981x2392.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EHL2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bae90ab-8811-45df-8638-b47b69ddf6be_3981x2392.png 424w, https://substackcdn.com/image/fetch/$s_!EHL2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bae90ab-8811-45df-8638-b47b69ddf6be_3981x2392.png 848w, https://substackcdn.com/image/fetch/$s_!EHL2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bae90ab-8811-45df-8638-b47b69ddf6be_3981x2392.png 1272w, https://substackcdn.com/image/fetch/$s_!EHL2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bae90ab-8811-45df-8638-b47b69ddf6be_3981x2392.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EHL2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bae90ab-8811-45df-8638-b47b69ddf6be_3981x2392.png" width="1456" height="875" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4bae90ab-8811-45df-8638-b47b69ddf6be_3981x2392.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:875,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:754816,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188280936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bae90ab-8811-45df-8638-b47b69ddf6be_3981x2392.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EHL2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bae90ab-8811-45df-8638-b47b69ddf6be_3981x2392.png 424w, https://substackcdn.com/image/fetch/$s_!EHL2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bae90ab-8811-45df-8638-b47b69ddf6be_3981x2392.png 848w, https://substackcdn.com/image/fetch/$s_!EHL2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bae90ab-8811-45df-8638-b47b69ddf6be_3981x2392.png 1272w, https://substackcdn.com/image/fetch/$s_!EHL2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bae90ab-8811-45df-8638-b47b69ddf6be_3981x2392.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>3. In-Process Event Bus</h3><p>Agents need to share findings without tight coupling. Solution: a singleton in-memory pub/sub: typed events, well-known topics (<code>VULNERABILITY_FOUND</code>, <code>CODE_REVIEW_COMPLETE</code>, etc.), source agent ID + correlation ID for tracing across a review chain, and a typed payload. Security auditor finds a vulnerability? Publishes to <code>vulnerability_found</code>. Code reviewer subscribes, incorporates the finding.</p><p>The implementation is a single <code>AgentEventBus</code> class with no external dependencies.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">// Typed event structure
interface AgentEvent&lt;T = unknown&gt; {
  id: string;           // Auto-generated: evt_&lt;timestamp&gt;_&lt;counter&gt;
  topic: string;        // Well-known topic (e.g., "vulnerability_found")
  source: string;       // Publishing agent name
  data: T;              // Typed payload
  timestamp: string;    // ISO timestamp
  correlationId?: string; // Chain/session tracing
}</code></pre></div><p>The key design choice: <strong>fire-and-forget delivery</strong>. When an agent publishes, subscribers are notified via <code>Promise.allSettled()</code>. Slow or failing subscribers never block the publisher. Handler errors are caught and logged, never thrown.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;typescript&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-typescript">// Publisher (security-auditor)
await eventBus.publish("vulnerability_found", "security-auditor", {
  severity: "critical",
  type: "sql-injection",
  location: "src/api/users.ts:42",
});

// Subscriber (code-reviewer) registered at startup
eventBus.subscribe("vulnerability_found", async (event) =&gt; {
  // Incorporate into review findings
});</code></pre></div><p>Late-joining subscribers can replay event history (last 50 per topic). Events auto-expire via TTL (5 minutes) with periodic cleanup every 100 events. This way, it keeps memory bounded without needing a background timer. Source filtering lets subscribers only receive events from specific agents.</p><h3>4. Live Evaluation with Sampling</h3><p>Production traffic gets evaluated by moderation and relevancy scorers at configurable sampling rates. Moderation runs on 20% of requests (cheap). Relevance scoring on 10% (LLM judge, expensive). Both async, non-blocking. Never slow down user-facing response.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sZJp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea80761-446d-4398-9255-ca5a84eacc1c_3984x2186.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sZJp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea80761-446d-4398-9255-ca5a84eacc1c_3984x2186.png 424w, https://substackcdn.com/image/fetch/$s_!sZJp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea80761-446d-4398-9255-ca5a84eacc1c_3984x2186.png 848w, https://substackcdn.com/image/fetch/$s_!sZJp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea80761-446d-4398-9255-ca5a84eacc1c_3984x2186.png 1272w, https://substackcdn.com/image/fetch/$s_!sZJp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea80761-446d-4398-9255-ca5a84eacc1c_3984x2186.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sZJp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea80761-446d-4398-9255-ca5a84eacc1c_3984x2186.png" width="1456" height="799" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0ea80761-446d-4398-9255-ca5a84eacc1c_3984x2186.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:799,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:508794,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188280936?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea80761-446d-4398-9255-ca5a84eacc1c_3984x2186.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sZJp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea80761-446d-4398-9255-ca5a84eacc1c_3984x2186.png 424w, https://substackcdn.com/image/fetch/$s_!sZJp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea80761-446d-4398-9255-ca5a84eacc1c_3984x2186.png 848w, https://substackcdn.com/image/fetch/$s_!sZJp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea80761-446d-4398-9255-ca5a84eacc1c_3984x2186.png 1272w, https://substackcdn.com/image/fetch/$s_!sZJp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea80761-446d-4398-9255-ca5a84eacc1c_3984x2186.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>What I&#8217;d Change: Real Talk</h2><p><strong>The in-memory event bus doesn&#8217;t survive restarts.</strong> Fine for a personal side-project agent. Terrible for a production system serving a team. Durable workflow engines like <a href="https://www.prefect.io/">Prefect</a>, <a href="https://temporal.io/">Temporal</a>, or <a href="https://www.dbos.dev/">DBOS</a> solve this really well with less infrastructure overhead than rolling your own durability with Redis Streams or NATS. Current design optimizes for simplicity, not resilience. I made that trade knowingly.</p><p><strong>128 subagents is ridiculous.</strong> Pareto wins: ~20 agents do 80%+ of the work. The long tail exists because adding a subagent costs basically nothing (factory + registry entry). I should prune the ones that never get used. Future me will regret not doing that earlier.</p><p><strong>SurrealDB as a unified store is elegant but young.</strong> Graph + vector + relational in one database? Architecturally clean. But doc gaps cost me time. For strict SLAs, I would use Postgres plus pgvector. For 2 to 3 hop graphs, Postgres is enough. Neo4j, when you need a deep graph. I chose SurrealDB to move fast with one DB. Wouldn&#8217;t push it on a team with compliance requirements.</p><p><strong>Haiku routing adds ~500ms latency.</strong> Noticeable in interactive chat. Negligible in background workflows. For latency-critical paths, consider static routing rules (if the tool is <code>security_scan</code> always use reasoning tier) and only invoke the dynamic router for ambiguous tasks.</p><p><strong>Workflow suspend/resume is powerful but adds state complexity.</strong> The 70+ workflows support human-in-the-loop via suspend/resume&#8212;workflow pauses, waits for human input, and continues. Great for approval flows (expense reports, code reviews). Terrible for state management. Every suspended workflow is a piece of state that can go stale. I&#8217;ve had workflows suspended for weeks because I forgot about them. Take that as you will.</p><h3>The Elephant in the Room: Multi-Agent Is Hard for Everyone</h3><p>Screech and others: Claude Code with Skills, custom orchestrators, the lot.</p><p><strong>Overconfidence</strong>. One wrong assumption in step 2 of a 10-step workflow and you get a confidently wrong result. Isolated specialists with focused prompts help, but don&#8217;t remove it. I still see invented APIs and wrong architectural assumptions.</p><p><strong>More agents do not mean better output.</strong> Coordination overhead, conflicting findings, more for the human to reconcile. Findings cache and event bus make communication explicit and traceable, but someone still has to review. Synthesis is an LLM summarizing other LLMs. The chain can be long.</p><p><strong>Oversight tax</strong>. You spend more time reviewing and redirecting than writing. PR review times go up in high-adoption teams (e.g., plus 91%). Comprehension debt: the more you delegate, the less you understand your codebase. Review becomes rubber-stamping. Screech does not fix that.</p><p><strong>Token bloat</strong>. Tool schemas, prompts, skills. You can blow past 50k tokens before the agent does useful work. I keep tool profiles tight (7&#8211;18 per agent). Complex runs still burn tokens.</p><p><strong>Credentials</strong>. For a side project, it&#8217;s manageable. For production with real APIs and DBs, auth and secrets become a project. A lot of agent efforts reportedly fail to scale there. Not the AI. The plumbing. (My therapist has asked me not to elaborate.)</p><p>I&#8217;m building Screech knowing these limits. The design mitigates some of it. It does not remove it. Multi-agent amplifies both capability and failure modes. Build guardrails.</p><h2>Standing on Shoulders</h2><p><a href="https://github.com/VoltAgent/voltagent">VoltAgent</a>. TypeScript agent framework. Runtime, memory, workflows, MCP, observability. Saved me months.</p><p><a href="https://github.com/getzep/graphiti">Graphiti</a> (Zep). Temporal knowledge graph. Episodes, bi-temporal awareness. &#8220;Knowledge changes&#8221; shaped my RAG thinking.</p><p><a href="https://www.decodingai.com/p/stop-converting-documents-to-text">Decoding AI, AI Agents Foundations</a> (Paul Iusztin). Treat docs as images, not OCR. ReAct, tools, memory.</p><p><a href="https://github.com/JustinNarracott/agentic-playbooks">Agentic Playbooks</a> (Justin Narracott). Traceable, auditable decisions as a performance pattern. Findings cache and review chains owe a lot here.</p><p><a href="https://github.com/VectifyAI/PageIndex">PageIndex</a> (Vectify AI). Reasoning over document structure instead of pure similarity. Path 3 (tree-search) is inspired by this.</p><h2>Three Patterns Worth Stealing</h2><p>You don&#8217;t need 128 subagents or a temporal knowledge graph. Here are the three ideas that transfer to any multi-agent system:</p><ol><li><p><strong>Route cheap before routing expensive.</strong> A $0.0025 classification call that routes 30% of tasks to a model, 90% cheaper? Pays for itself on the first batch. Even without subagents, using a small model to decide whether a task needs your large model is almost always worth it.</p></li><li><p><strong>Not every agent needs every tool.</strong> Tool profiles cut token usage, improve tool selection accuracy, and make prompts focused. A research analyst with 7 tools outperforms the same analyst drowning in 18 tools they&#8217;ll never use.</p></li><li><p><strong>Hybrid retrieval beats any single method.</strong> Vector search handles 70% of queries. Graph traversal and keyword matching cover the other 30% (structural queries, exact-match lookups, relationship questions that embeddings silently botch).</p></li></ol><p>The multi-agent pattern isn&#8217;t inherently better. It&#8217;s a trade: quality per task versus orchestration complexity. Start with a single capable agent. When quality degrades across diverse tasks, reach for these patterns. The hard part isn&#8217;t the agents. It&#8217;s the routing, the retrieval, the resilience.</p><p><em>Screech runs on <a href="https://github.com/VoltAgent/voltagent">VoltAgent</a> (agent framework), SurrealDB (multi-model database), and Anthropic Claude (LLM). Architectural inspiration from <a href="https://github.com/getzep/graphiti">Graphiti</a>, <a href="https://github.com/JustinNarracott/agentic-playbooks">Agentic Playbooks</a>, <a href="https://github.com/VectifyAI/PageIndex">PageIndex</a>, and the <a href="https://www.decodingai.com/">Decoding AI</a> community. Built for personal side-project workloads. Adapt the patterns to your scale.</em></p><p>&#8216;Till next time</p><p><a href="https://substack.com/@lucianlature">Lucian Lature</a> | <a href="https://www.linkedin.com/in/lucianlature/">LinkedIn</a></p><div><hr></div><p><em>What&#8217;s your opinion? Do you agree, disagree, or is there something I missed?</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/scaling-120-ai-agents-two-tier-orchestration/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/scaling-120-ai-agents-two-tier-orchestration/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to share our work.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/scaling-120-ai-agents-two-tier-orchestration?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/scaling-120-ai-agents-two-tier-orchestration?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h2>Go Deeper</h2><p><strong>Go from zero to production-grade AI agents</strong> with the <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Agentic AI Engineering self-paced course</a>. Built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Towards AI</a>.</p><p>Across <strong>34 lessons</strong> (articles, videos, and a lot of code), you&#8217;ll design, build, evaluate, and deploy production-grade AI agents end to end. By the final lesson, you&#8217;ll have <strong>built a multi-agent system</strong> and a <strong>capstone project</strong> where you apply everything you&#8217;ve learned on your own.</p><p><em>Three portfolio projects and a certificate to showcase in interviews. Plus a Discord community where you have direct access to other industry experts and me.</em></p><p>Rated 4.9/5 &#11088;&#65039; by 290+ early students &#8212; <em>&#8220;Every AI Engineer needs a course like this.&#8221;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&quot;,&quot;text&quot;:&quot;Learn more&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31"><span>Learn more</span></a></p><p><em>Not ready to commit?</em> We also prepared a free 6-day email course to reveal the <em><strong>6 critical mistakes that silently destroy agentic systems. </strong><a href="https://email-course.towardsai.net/?ref=b3ab31">Get the free email course.</a></em></p><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[How to Design Evaluators That Catch What Actually Breaks]]></title><description><![CDATA[The practical guide to code-based checks, LLM judges, and rubrics for real-world AI apps]]></description><link>https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures</link><guid isPermaLink="false">https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures</guid><dc:creator><![CDATA[Paolo Perrone]]></dc:creator><pubDate>Tue, 03 Mar 2026 12:02:58 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!a1uV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Welcome to the <strong><a href="https://www.decodingai.com/t/ai-evals-and-observability">AI Evals &amp; Observability series</a></strong>: A 7-part journey from shipping AI apps to systematically improving them. Made by busy people. For busy people.</em></p><p>&#129488; Everyone says you need AI evals. Few explain how to actually build them and answer questions such as&#8230;</p><p>How do we avoid creating evals that waste our time and resources? How do we build datasets and design evaluators that matter? How do we adapt them for RAG? ...and most importantly, how do we stop &#8220;vibe checking&#8221; and leverage evals to actually track and optimize our app?</p><p><em>This 7-article series breaks it all down from first principles:</em></p><ol><li><p><a href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app">Integrating AI Evals Into Your AI App</a><strong> </strong></p></li><li><p><a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Build an AI Evals Dataset from Scratch</a> </p></li><li><p><a href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals">Generate Synthetic Datasets for AI Evals </a></p></li><li><p><strong>How to Design Evaluators</strong> &#8592; <em>You are here</em></p></li><li><p><a href="https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge">How to Evaluate the Evaluator </a></p></li><li><p><a href="https://www.decodingai.com/p/rag-evaluation-6-metrics-framework">RAG Evaluation: The Only 6 Metrics You Need</a></p></li><li><p><a href="https://www.decodingai.com/p/behind-the-scenes-of-ai-observability">Lessons from 6 Months of Evals on a Production AI Companion</a></p></li></ol><p>By the end, you&#8217;ll know how to integrate AI evals that actually track and improve the performance of your AI product. No vibe checking required!</p><p><strong>Let&#8217;s get started.</strong></p><div><hr></div><h2>How to Design Evaluators</h2><p>You have a dataset. You&#8217;ve manually labeled examples. You&#8217;ve fixed the obvious bugs. Now you need evaluators that can run automatically and catch problems before users do.</p><p>But here&#8217;s what trips up most teams: they build evaluators that check for things nobody cares about, or they use off-the-shelf metrics that sound impressive but don&#8217;t match their actual use case.</p><p>Three months ago, I spent a weekend building what I thought was a comprehensive evaluation suite for an AI agent that drafted replies to customer support tickets. I had ROUGE scores, BLEU scores, semantic similarity metrics, the works. Everything from the NLP textbook.</p><p>Then I ran it on production traces. The evaluators gave perfect scores to replies that were factually wrong, missed the customer&#8217;s actual question, and used the wrong tone for frustrated users. Meanwhile, they penalized perfectly good replies for using &#8220;different words than the reference answer.&#8221;</p><p>That&#8217;s when I realized: generic metrics optimize for academic benchmarks, not business outcomes. (And no, I&#8217;m not saying academic metrics are useless. They&#8217;re just solving a different problem than &#8220;did this agent do what my users needed?&#8221;)</p><p>The solution is to design evaluators that match your specific success criteria. Not what worked for someone else&#8217;s summarization task. Not what scored well on SQuAD. What actually matters for your users in your use case.</p><p><strong>In this article, we will cover:</strong></p><ul><li><p>The evaluation harness: infrastructure that runs evals end-to-end</p></li><li><p>Dataset and metric types: direct scoring vs. pairwise vs. reference-based</p></li><li><p>Model evaluation vs. app evaluation (and why benchmarks lie)</p></li><li><p>Components of an evaluator: reference examples, metrics, rubrics</p></li><li><p>When to use code-based checks vs. LLM judges</p></li><li><p>Common mistakes (and how to avoid them)</p></li><li><p>Advanced metric designs for multi-turn conversations and agentic workflows</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!a1uV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!a1uV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!a1uV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!a1uV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!a1uV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!a1uV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png" width="1456" height="1048" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1048,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Designing evaluators for AI applications: from code-based checks to LLM judges.&quot;,&quot;title&quot;:&quot;Designing evaluators for AI applications: from code-based checks to LLM judges.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Designing evaluators for AI applications: from code-based checks to LLM judges." title="Designing evaluators for AI applications: from code-based checks to LLM judges." srcset="https://substackcdn.com/image/fetch/$s_!a1uV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!a1uV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!a1uV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!a1uV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ad91d7a-490d-4b4e-ac91-0f48e10bccc7_1456x1048.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 1: Designing evaluators for AI applications: from code-based checks to LLM judges.</em></figcaption></figure></div><p><em>Before digging into the article, a quick word from our sponsor, Opik.</em> &#8595;</p><div><hr></div><h2><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik: Open-Source LLMOps Platform (Sponsored)</a></h2><p>This <strong>AI Evals &amp; Observability</strong> series is brought to you by <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a>, the LLMOps open-source platform used by Uber, Etsy, Netflix, and more. </p><p>We use Opik daily across our courses and AI products. Not just for observability, but to design and run the exact evaluators this article teaches: custom LLM judges, code-based checks, and experiments. All from the same platform.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oSDm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oSDm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 424w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 848w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1272w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png" width="1456" height="364" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:364,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Opik Banner&quot;,&quot;title&quot;:&quot;Opik Banner&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Opik Banner" title="Opik Banner" srcset="https://substackcdn.com/image/fetch/$s_!oSDm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 424w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 848w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1272w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Try Opik for free here</a> (25k spans/month free)</figcaption></figure></div><p>This article shows you how to design evaluators. Opik gives you the harness to run them at scale. Here is how we use it:</p><ul><li><p><strong>Custom LLM judges with rubrics</strong> &#8212; Build the evaluators this article describes: define your criteria, add few-shot examples, and run them across hundreds of traces automatically.</p></li><li><p><strong>Run experiments, compare results</strong> &#8212; Test different prompts, models, or configurations side by side. Opik scores each variant with your evaluators and shows you which one wins.</p></li><li><p><strong>Plug evaluators into production</strong> &#8212; The same LLM judges you design for testing run on live traces too. Set up alarms when scores drop below your threshold so you catch regressions before users do.</p></li></ul><p><strong><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a></strong> is fully <strong>open-source</strong> and works with custom code or most AI frameworks. You can also use the managed version for free (with 25K spans/month on their generous free tier):</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;text&quot;:&quot;Try Opik for free&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul"><span>Try Opik for free</span></a></p><div><hr></div><p><em>&#8595;</em>  <em>Now, let&#8217;s move back to the article.</em></p><h2>Understanding the Evaluation Harness</h2><p>You can&#8217;t manually run 500 test cases. You need automation.</p><p>The infrastructure that runs evals end-to-end is called an <strong>evaluation harness (1)</strong>. It loads your dataset, executes your agent on each test case, captures all the outputs and traces, runs your graders, and aggregates the scores into something you can actually use.</p><p>Think of it like pytest for AI apps. Except instead of checking if a function returns the right type, you&#8217;re checking if an LLM generated text that accomplishes a business goal.</p><p>Here&#8217;s what a harness does:</p><ol><li><p><strong>Loads tasks</strong> from your evaluation dataset</p></li><li><p><strong>Provides instructions and tools</strong> to the agent (system prompts, available functions, etc.)</p></li><li><p><strong>Runs tasks</strong> (often in parallel across multiple trials because LLM outputs vary)</p></li><li><p><strong>Records everything</strong>: inputs, outputs, tool calls, reasoning traces, intermediate states</p></li><li><p><strong>Runs graders</strong> on the results (your evaluators)</p></li><li><p><strong>Aggregates scores</strong> across trials and tasks</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tu0r!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce14a1c-8162-406e-a728-fd596c93af89_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tu0r!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce14a1c-8162-406e-a728-fd596c93af89_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!tu0r!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce14a1c-8162-406e-a728-fd596c93af89_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!tu0r!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce14a1c-8162-406e-a728-fd596c93af89_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!tu0r!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce14a1c-8162-406e-a728-fd596c93af89_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tu0r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce14a1c-8162-406e-a728-fd596c93af89_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1ce14a1c-8162-406e-a728-fd596c93af89_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The evaluation harness pipeline: loading tasks, running agents, scoring results, and aggregating metrics.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The evaluation harness pipeline: loading tasks, running agents, scoring results, and aggregating metrics." title="The evaluation harness pipeline: loading tasks, running agents, scoring results, and aggregating metrics." srcset="https://substackcdn.com/image/fetch/$s_!tu0r!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce14a1c-8162-406e-a728-fd596c93af89_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!tu0r!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce14a1c-8162-406e-a728-fd596c93af89_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!tu0r!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce14a1c-8162-406e-a728-fd596c93af89_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!tu0r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce14a1c-8162-406e-a728-fd596c93af89_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 2: The evaluation harness pipeline: loading tasks, running agents, scoring results, and aggregating metrics.</em></figcaption></figure></div><p>Without a harness, you&#8217;re manually running your agent on test cases and eyeballing the output. With a harness, you run 500 test cases overnight and wake up to a report showing exactly which failure categories spiked [1].</p><p>The harness is separate from your evaluators. The evaluators decide what &#8220;good&#8221; means. The harness handles the boring work of running everything at scale and collecting results.</p><p>Popular harness options include <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a> (what we use), Braintrust, LangSmith, and open-source frameworks like Promptfoo. But honestly, you can build a minimal harness in ~100 lines of Python if you need custom logic [1]. The hard part isn&#8217;t the infrastructure - it&#8217;s assembling the right context (system prompts, conversation history, retrieved docs, tools) for each task. The key is having one. Don&#8217;t manually run evals.</p><p>Now let&#8217;s talk about what those evaluators actually check.</p><h2>Dataset and Metric Types: Three Ways to Grade</h2><p>When designing an evaluator, you need to pick a grading strategy. There are three main approaches, each suited for different situations.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2d8b!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F748f635b-6849-48a1-88fe-e67c64a46b50_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2d8b!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F748f635b-6849-48a1-88fe-e67c64a46b50_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!2d8b!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F748f635b-6849-48a1-88fe-e67c64a46b50_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!2d8b!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F748f635b-6849-48a1-88fe-e67c64a46b50_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!2d8b!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F748f635b-6849-48a1-88fe-e67c64a46b50_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2d8b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F748f635b-6849-48a1-88fe-e67c64a46b50_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/748f635b-6849-48a1-88fe-e67c64a46b50_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Three grading strategies: direct scoring, pairwise comparison, and reference-based evaluation.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Three grading strategies: direct scoring, pairwise comparison, and reference-based evaluation." title="Three grading strategies: direct scoring, pairwise comparison, and reference-based evaluation." srcset="https://substackcdn.com/image/fetch/$s_!2d8b!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F748f635b-6849-48a1-88fe-e67c64a46b50_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!2d8b!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F748f635b-6849-48a1-88fe-e67c64a46b50_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!2d8b!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F748f635b-6849-48a1-88fe-e67c64a46b50_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!2d8b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F748f635b-6849-48a1-88fe-e67c64a46b50_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 3: Three grading strategies: direct scoring, pairwise comparison, and reference-based evaluation.</em></figcaption></figure></div><h3>1. Direct Scoring (Pointwise Evaluation)</h3><p>The evaluator looks at a single output and scores it in isolation. No comparison to anything else.</p><p><strong>Example:</strong></p><ul><li><p>Input: &#8220;Refund my order #12345&#8221;</p></li><li><p>Output: &#8220;I&#8217;ve processed your refund for order #12345. You&#8217;ll see the credit in 3-5 business days.&#8221;</p></li><li><p>Score: Pass (correctly identified the task, provided timeline, professional tone)</p></li></ul><p><strong>When to use:</strong></p><ul><li><p>You have clear, absolute quality criteria (was it helpful? was it safe? did it call the right tool?)</p></li><li><p>You want to track performance over time on the same dataset</p></li><li><p>Your baseline is &#8220;good enough&#8221; not &#8220;better than X&#8221;</p></li></ul><p><strong>Metrics:</strong></p><ul><li><p>Binary pass/fail</p></li><li><p>0-1 scores (where 1 = perfect)</p></li><li><p>Classification labels (Helpful/Neutral/Harmful)</p></li></ul><h3>2. Pairwise Comparison</h3><p>The evaluator compares two outputs and picks which one is better.</p><p><strong>Example:</strong></p><ul><li><p>Input: &#8220;Refund my order #12345&#8221;</p></li><li><p>Output A: &#8220;Refund processed.&#8221;</p></li><li><p>Output B: &#8220;I&#8217;ve processed your refund for order #12345. You&#8217;ll see the credit in 3-5 business days.&#8221;</p></li><li><p>Winner: Output B (more informative, sets expectations)</p></li></ul><p><strong>When to use:</strong></p><ul><li><p>Comparing two model versions (baseline vs. candidate)</p></li><li><p>A/B testing different prompts</p></li><li><p>LLMs are better at ranking than absolute scoring</p></li></ul><p><strong>Watch out for biases (2):</strong></p><ul><li><p><strong>Position bias</strong>: LLMs favor the first or last response shown</p></li><li><p><strong>Verbosity bias</strong>: LLMs prefer longer answers even when they&#8217;re not better</p></li><li><p><strong>Self-enhancement bias</strong>: LLMs favor outputs from themselves over other models</p></li></ul><p>You can mitigate these by randomizing response order and using multiple trials.</p><h3>3. Reference-Based Evaluation</h3><p>The evaluator compares the output to a known &#8220;gold standard&#8221; answer.</p><p><strong>Example:</strong></p><ul><li><p>Input: &#8220;What&#8217;s the capital of France?&#8221;</p></li><li><p>Output: &#8220;Paris&#8221;</p></li><li><p>Reference: &#8220;Paris&#8221;</p></li><li><p>Score: Exact match (Pass)</p></li></ul><p><strong>Example 2 (Semantic equivalence):</strong></p><ul><li><p>Input: &#8220;Summarize the refund policy&#8221;</p></li><li><p>Output: &#8220;Customers can return items within 30 days for a full refund if unused.&#8221;</p></li><li><p>Reference: &#8220;Full refunds are available for unused products returned within 30 days of purchase.&#8221;</p></li><li><p>Score: Pass (different wording, same meaning)</p></li></ul><p><strong>When to use:</strong></p><ul><li><p>You have ground truth answers (FAQs, knowledge bases, structured tasks)</p></li><li><p>Task has a single correct answer or small set of acceptable answers</p></li><li><p>You&#8217;re testing retrieval accuracy or factual correctness</p></li></ul><p><strong>How to measure: </strong></p><ul><li><p><strong>Exact match</strong>: For structured outputs (dates, product IDs, categorical values)</p></li><li><p><strong>Semantic similarity / LLM judges:</strong> For natural language, where multiple phrasings are valid (summaries, explanations, instructions)</p></li></ul><p><strong>Common metrics (3):</strong></p><ul><li><p>Exact match</p></li><li><p>ROUGE (recall-oriented, good for summarization)</p></li><li><p>BLEU (precision-oriented, originally for translation)</p></li><li><p>BERTScore (semantic similarity using embeddings)</p></li><li><p>LLM judges (for nuanced semantic equivalence)</p></li></ul><p><strong>The trap:</strong> Exact match metrics penalize valid variations. If your reference says &#8220;The meeting is on Friday&#8221; and your agent says &#8220;The meeting is scheduled for this Friday,&#8221; exact match fails. This is where semantic similarity metrics (BERTScore) or LLM judges become powerful - they can recognize that different phrasings convey the same outcome.</p><h2>Model Evaluation vs. App Evaluation (Why Benchmarks Lie)</h2><p>Here&#8217;s a distinction that matters more than people realize:</p><p><strong>Model evaluation</strong> measures the LLM itself, in isolation, on generic tasks. This is what benchmarks like MMLU, HumanEval, and Chatbot Arena do.</p><p><strong>App evaluation</strong> measures your entire application (LLM + prompts + tools + retrieval + business logic) on your specific use case.</p><p>High MMLU score doesn&#8217;t mean it handles your refund policy correctly. Benchmarks test general capability. You need to test your specific use case.</p><h3>Model Evaluation (Benchmarks)</h3><p>Tests: &#8220;Can this LLM answer random trivia, write code snippets, or score high on standardized tests?&#8221;</p><p><strong>Useful for:</strong></p><ul><li><p>Comparing foundation models across the board</p></li><li><p>Understanding general capabilities</p></li><li><p>Academic research</p></li></ul><p><strong>Useless for:</strong></p><ul><li><p>Predicting whether it will handle your refund policy correctly</p></li><li><p>Knowing if it will escalate frustrated customers at the right time</p></li><li><p>Determining if it respects your company&#8217;s tone of voice</p></li></ul><h3>App Evaluation (What You Actually Need)</h3><p>Tests: &#8220;Does my customer support agent correctly process refunds, handle escalations, and follow our policies?&#8221;</p><p><strong>This is what matters</strong> because your users don&#8217;t care if GPT-5 scored 95% on MMLU. They care if it solved their problem.</p><p>Your evaluators must be grounded in your business use case, not generic academic benchmarks. This means:</p><ul><li><p>Testing against your actual policies, not Wikipedia facts</p></li><li><p>Using your real user queries, not synthetic textbook questions</p></li><li><p>Measuring outcomes that impact revenue, retention, or safety</p></li></ul><p>Benchmarks tell you which LLM is &#8220;generally smarter.&#8221; App evals tell you which version of your system works better for your users.</p><p>Don&#8217;t mistake one for the other.</p><h2>Components of an Evaluator</h2><p>Now that you know the types, let&#8217;s build one. Every evaluator has three components:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!b_Ph!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabad08fa-fbdb-45b7-9f1d-00ef87d12897_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!b_Ph!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabad08fa-fbdb-45b7-9f1d-00ef87d12897_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!b_Ph!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabad08fa-fbdb-45b7-9f1d-00ef87d12897_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!b_Ph!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabad08fa-fbdb-45b7-9f1d-00ef87d12897_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!b_Ph!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabad08fa-fbdb-45b7-9f1d-00ef87d12897_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!b_Ph!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabad08fa-fbdb-45b7-9f1d-00ef87d12897_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/abad08fa-fbdb-45b7-9f1d-00ef87d12897_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The three components of every evaluator: reference examples, metrics, and rubrics.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The three components of every evaluator: reference examples, metrics, and rubrics." title="The three components of every evaluator: reference examples, metrics, and rubrics." srcset="https://substackcdn.com/image/fetch/$s_!b_Ph!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabad08fa-fbdb-45b7-9f1d-00ef87d12897_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!b_Ph!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabad08fa-fbdb-45b7-9f1d-00ef87d12897_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!b_Ph!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabad08fa-fbdb-45b7-9f1d-00ef87d12897_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!b_Ph!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabad08fa-fbdb-45b7-9f1d-00ef87d12897_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 4: The three components of every evaluator: reference examples, metrics, and rubrics.</em></figcaption></figure></div><h3>1. Reference Examples (Few-Shot Prompts)</h3><p>These are the labeled examples from your dataset. They show the evaluator what &#8220;good&#8221; and &#8220;bad&#8221; look like for your specific task.</p><p>Remember from Article 2: the real power isn&#8217;t in the system prompt, it&#8217;s in these few-shot examples. They encode your domain expert&#8217;s judgment.</p><p><strong>Example:</strong></p><p><strong>Example 1 - PASS</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">Input: &#8220;I need a refund for order #12345&#8221;
Output: &#8220;I&#8217;ve processed your refund. You&#8217;ll see the credit in 3-5 business days.&#8221;
Reason: Confirms action, sets timeline, professional tone.
</code></pre></div><p><strong>Example 2 - FAIL</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">Input: &#8220;Can you waive the late fee on my account?&#8221;
Output: &#8220;I can help with that!&#8221;
Reason: Didn&#8217;t actually take action or explain next steps. Empty promise.
</code></pre></div><h3>2. Metrics</h3><p>The quantifiable measurement of quality. This can be:</p><ul><li><p><strong>Objective</strong>: Did it call the right tool? Is the JSON valid? Is the response under 200 words?</p></li><li><p><strong>Subjective</strong>: Was it helpful? Was the tone appropriate? Did it follow the conversation flow?</p></li></ul><p>For objective metrics, use code-based checks (fast, cheap, deterministic).</p><p>For subjective metrics, use LLM judges or human evaluation.</p><h3>3. Rubrics</h3><p>For subjective metrics, you need a rubric: explicit criteria that define what you&#8217;re measuring.</p><p><strong>Bad rubric:</strong><br><em>&#8220;Was the response helpful?&#8221;</em></p><p>(Too vague. Helpful how? To whom? Compared to what?)</p><p><strong>Good rubric:</strong><br><em>&#8220;Did the response: (1) correctly identify the user&#8217;s request, (2) provide a specific action or next step, (3) include a timeline or expectation, and (4) maintain professional tone?&#8221;</em></p><p>Rubrics force precision. They make subjective judgments repeatable. These criteria become part of your LLM judge&#8217;s system prompt.</p><h2>Code-Based Evaluators: Fast, Cheap, Objective</h2><p>Some checks are deterministic. Did the agent call <code>refund_order()</code>? Is the output valid JSON? Does it include a required disclaimer?</p><p>Use code for these. It&#8217;s faster, cheaper, and never gives you a different answer on the same input.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-byG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31a1e6f1-d168-44f4-928c-1a7969ac85f3_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-byG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31a1e6f1-d168-44f4-928c-1a7969ac85f3_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!-byG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31a1e6f1-d168-44f4-928c-1a7969ac85f3_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!-byG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31a1e6f1-d168-44f4-928c-1a7969ac85f3_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!-byG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31a1e6f1-d168-44f4-928c-1a7969ac85f3_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-byG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31a1e6f1-d168-44f4-928c-1a7969ac85f3_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/31a1e6f1-d168-44f4-928c-1a7969ac85f3_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Code-based evaluators check deterministic criteria: tool calls, format, required elements, and prohibited content.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Code-based evaluators check deterministic criteria: tool calls, format, required elements, and prohibited content." title="Code-based evaluators check deterministic criteria: tool calls, format, required elements, and prohibited content." srcset="https://substackcdn.com/image/fetch/$s_!-byG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31a1e6f1-d168-44f4-928c-1a7969ac85f3_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!-byG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31a1e6f1-d168-44f4-928c-1a7969ac85f3_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!-byG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31a1e6f1-d168-44f4-928c-1a7969ac85f3_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!-byG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31a1e6f1-d168-44f4-928c-1a7969ac85f3_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 5: Code-based evaluators check deterministic criteria: tool calls, format, required elements, and prohibited content.</em></figcaption></figure></div><p><strong>Use code-based evaluators for:</strong></p><ul><li><p><strong>Tool calls</strong>: Did it call <code>refund_order()</code> with the right parameters?</p></li><li><p><strong>Format checks</strong>: Is the output valid JSON? Is it under the character limit?</p></li><li><p><strong>Required elements</strong>: Does it include a disclaimer? Does it have a timestamp?</p></li><li><p><strong>Prohibited content</strong>: Does it contain banned phrases or leaked data?</p></li></ul><p><strong>Example (pseudocode):</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def evaluate_refund_agent(trace):
  # Check if right tool was called
  if &#8220;refund_order&#8221; not in trace.tool_calls:
  return {&#8221;pass&#8221;: False, &#8220;reason&#8221;: &#8220;Didn&#8217;t call refund_order&#8221;}

  # Check if order_id parameter was provided
  params = trace.tool_calls["refund_order"].parameters
  if "order_id" not in params:
    return {"pass": False, "reason": "Missing order_id parameter"}

  # Check if response includes timeline
  if not any(word in trace.output.lower() for word in ["days", "week", "timeline"]):
    return {"pass": False, "reason": "No timeline provided to customer"}

  return {"pass": True, "reason": "All checks passed"}`</code></pre></div><p>Code-based evaluators are:</p><ul><li><p><strong>Fast</strong>: Milliseconds per check</p></li><li><p><strong>Cheap</strong>: No API costs</p></li><li><p><strong>Reproducible</strong>: Same input always gives same result</p></li><li><p><strong>Easy to debug</strong>: When they fail, you know exactly what broke</p></li></ul><p>But they can&#8217;t handle nuance. They can&#8217;t judge tone, helpfulness, or conversational flow. For that, you need LLM judges.</p><p>These code-based evaluators work exactly like classic unit tests you&#8217;re already familiar with. They&#8217;re deterministic, fast, and easy to debug. That&#8217;s why you should always try to implement code-based checks first before reaching for LLM judges. If you can check it with code, do that. Only use LLM judges when code can&#8217;t capture what you need to measure.</p><h2>LLM Judges: Flexible, Scalable, Nuanced</h2><p>An <strong>LLM judge</strong> is an LLM that grades another LLM&#8217;s output. You give it the task, the output, and the evaluation criteria, and it returns a score with reasoning</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!L6EX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c0a5b9-99aa-427e-899f-a2e463267ec7_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!L6EX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c0a5b9-99aa-427e-899f-a2e463267ec7_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!L6EX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c0a5b9-99aa-427e-899f-a2e463267ec7_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!L6EX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c0a5b9-99aa-427e-899f-a2e463267ec7_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!L6EX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c0a5b9-99aa-427e-899f-a2e463267ec7_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!L6EX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c0a5b9-99aa-427e-899f-a2e463267ec7_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e0c0a5b9-99aa-427e-899f-a2e463267ec7_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;LLM judge flow: input context and criteria produce a score with reasoning.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="LLM judge flow: input context and criteria produce a score with reasoning." title="LLM judge flow: input context and criteria produce a score with reasoning." srcset="https://substackcdn.com/image/fetch/$s_!L6EX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c0a5b9-99aa-427e-899f-a2e463267ec7_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!L6EX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c0a5b9-99aa-427e-899f-a2e463267ec7_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!L6EX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c0a5b9-99aa-427e-899f-a2e463267ec7_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!L6EX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c0a5b9-99aa-427e-899f-a2e463267ec7_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 6: LLM judge flow: input context and criteria produce a score with reasoning.</em></figcaption></figure></div><p>LLM judges work in two modes: evaluating outputs against absolute criteria (is it helpful? professional? accurate?) or comparing outputs to reference answers when you have ground truth but need semantic understanding rather than exact string matching.</p><p><strong>Use LLM judges for:</strong></p><ul><li><p><strong>Tone</strong>: Was it empathetic? Professional? Not condescending?</p></li><li><p><strong>Helpfulness</strong>: Did it actually answer the question or deflect?</p></li><li><p><strong>Conversation flow</strong>: Did it maintain context across turns?</p></li><li><p><strong>Reasoning quality</strong>: Did the agent&#8217;s plan make sense?</p></li></ul><p><strong>How it works:</strong></p><ol><li><p>You provide:</p><ul><li><p>The input (user query)</p></li><li><p>The output (agent&#8217;s response)</p></li><li><p>The context (system prompt, retrieved docs, conversation history)</p></li><li><p>Evaluation criteria (what you&#8217;re checking for)</p></li><li><p>Few-shot examples (labeled passes and fails)</p></li></ul></li><li><p>The LLM judge outputs:</p><ul><li><p>A score (pass/fail or 0-1 scale)</p></li><li><p>A critique explaining why</p></li></ul></li></ol><p><strong>Example prompt (simplified):</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;markdown&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-markdown">You are evaluating customer support responses. For each trace, output Pass or Fail 
with reasoning.

Evaluation criteria:
1. Did the response correctly identify the customer&#8217;s request?
2. Did it provide a specific action or next step?
3. Did it include a timeline or expectation?
4. Did it maintain a professional tone?

Here are examples of how a domain expert judged similar cases:

[Few-shot examples from your labeled dataset]

Now evaluate this trace:
Input: [customer query]
Output: [agent response]
Context: [system prompt, policies]</code></pre></div><p>The judge generates: </p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;markdown&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-markdown">FAIL

The response correctly identified the refund request (criterion 1: pass) and 
maintained professional tone (criterion 4: pass). However, it didn&#8217;t specify a next 
step beyond &#8220;we&#8217;ll look into this&#8221; (criterion 2: fail) and provided no timeline 
(criterion 3: fail). Customer is left waiting with no expectations set.</code></pre></div><h3>Strengths of LLM Judges</h3><ul><li><p><strong>Flexible</strong>: Handle open-ended tasks where code can&#8217;t</p></li><li><p><strong>Scalable</strong>: Grade thousands of traces automatically</p></li><li><p><strong>Explainable</strong>: Critiques show reasoning, helping debug failures</p></li></ul><h3>Weaknesses of LLM Judges</h3><ul><li><p><strong>Non-deterministic</strong>: Same input might get different scores across runs</p></li><li><p><strong>Expensive</strong>: Every evaluation is an API call</p></li><li><p><strong>Needs calibration</strong>: Must align with human judgment (we cover this in Article 5)</p></li></ul><h3>Making LLM Judges More Stable</h3><ol><li><p><strong>Use the most capable model</strong> (e.g., Claude Opus, GPT-4o) + footnotes (4)</p></li><li><p><strong>Add chain-of-thought reasoning</strong> before scoring (&#8221;Let&#8217;s think step-by-step...&#8221;)</p></li><li><p><strong>Control for verbosity bias</strong> (normalize response lengths)</p></li><li><p><strong>Run multiple trials</strong> and average scores for critical evals</p></li><li><p><strong>Increase dataset size</strong> to at least 50-100 samples (reduces noise)</p></li></ol><h2>Common Mistakes (And How to Avoid Them)</h2><h3>Mistake 1: Not Providing Critiques</h3><p><strong>Wrong:</strong><br>Score: 1</p><p><strong>Right:</strong><br>Score: 1</p><p>Critique: <em>&#8220;Response correctly identified the refund request but didn&#8217;t provide a timeline. Customer left without expectations.&#8221;</em></p><p>Critiques are not optional. They&#8217;re how you debug failures and train better evaluators.</p><h3>Mistake 2: Overly Terse Critiques</h3><p><strong>Wrong:</strong><br>&#8220;Bad tone&#8221;</p><p><strong>Right:</strong><br><em>&#8220;Response used dismissive language (&#8217;just wait&#8217;) when customer expressed frustration about a delayed order. Should have acknowledged frustration and provided specific next steps.&#8221;</em></p><p>The critique should be detailed enough to serve as a few-shot example later.</p><h3>Mistake 3: Missing Context</h3><p>Don&#8217;t evaluate the output in isolation. Give the evaluator everything a human would see:</p><ul><li><p>The full conversation history (for multi-turn tasks)</p></li><li><p>Retrieved documents (for RAG)</p></li><li><p>System prompts (for understanding constraints)</p></li><li><p>Tool call results (for agentic workflows)</p></li></ul><p>If a human needs it to judge quality, the evaluator needs it too.</p><h3>Mistake 4: Not Providing Diverse Examples</h3><p>If all your few-shot examples are &#8220;customer angry, agent apologizes,&#8221; the judge won&#8217;t know how to handle &#8220;customer confused, needs technical explanation.&#8221;</p><p>Cover the failure modes you actually see in production.</p><h3>Mistake 5: Using Ready-Made Metrics Without Validation</h3><p>ROUGE, BLEU, BERTScore, etc. sound professional, but they might not correlate with your actual goal.</p><p>Before using any metric, validate it against human judgment on your specific task. If high ROUGE doesn&#8217;t mean &#8220;users are happy,&#8221; don&#8217;t optimize for ROUGE.</p><h3>Mistake 6: <strong>Using 1-5 Scales Instead of Binary Pass/Fail</strong></h3><p>Wrong:<br>Score: 3.2 out of 5</p><p>Right:<br>Score: 0 (Fail)<br>Critique: <em>&#8220;Response didn&#8217;t provide a timeline or next steps.&#8221;</em></p><p>Why it matters: A score of 3.2 is ambiguous. Is that good enough to ship? Should you fix it? Binary forces clarity. Either it passes your quality bar or it doesn&#8217;t. Scoring on a float scale (0.0-1.0) has the same problem - it leaves room for interpretation instead of forcing a clear decision.</p><h2>When Should I Use Similarity Metrics (BERTScore, ROUGE, etc.)?</h2><p>Short answer: <strong>Only for specific, narrow tasks where semantic overlap actually matters.</strong></p><h3>When They Work</h3><p><strong>Summarization:</strong> ROUGE measures how much of the source content appears in the summary. If your task is &#8220;don&#8217;t miss key facts,&#8221; ROUGE helps.</p><p><strong>Translation:</strong> BLEU checks n-gram overlap with reference translations. Works when there&#8217;s a narrow acceptable output space.</p><p><strong>Retrieval accuracy:</strong> BERTScore compares semantic similarity between retrieved chunks and expected documents.</p><h3>When They Fail</h3><p><strong>Open-ended generation:</strong> Your AI agent says &#8220;I&#8217;ve refunded order #12345. You&#8217;ll see the credit in 3-5 days.&#8221; Reference says &#8220;Refund processed for order #12345, expect 3-5 business days.&#8221; Different words, same meaning. ROUGE fails.</p><p><strong>Tone and helpfulness:</strong> Similarity metrics don&#8217;t measure if the tone was appropriate or if it actually helped the user.</p><p><strong>Business outcomes:</strong> High similarity doesn&#8217;t mean the customer is satisfied, the sale closed, or the task completed.</p><h3>The Rule</h3><p>If your success criterion is &#8220;output should be semantically similar to the reference answer,&#8221; use similarity metrics.</p><p>If your success criteria are <em>&#8220;user achieved their goal,&#8221;</em> use app-level evaluators grounded in outcomes.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ovy3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd045ca-c3dc-46b7-a942-0ceec8fe523c_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ovy3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd045ca-c3dc-46b7-a942-0ceec8fe523c_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!Ovy3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd045ca-c3dc-46b7-a942-0ceec8fe523c_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!Ovy3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd045ca-c3dc-46b7-a942-0ceec8fe523c_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!Ovy3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd045ca-c3dc-46b7-a942-0ceec8fe523c_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ovy3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd045ca-c3dc-46b7-a942-0ceec8fe523c_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dbd045ca-c3dc-46b7-a942-0ceec8fe523c_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Decision tree for choosing the right evaluator type.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Decision tree for choosing the right evaluator type." title="Decision tree for choosing the right evaluator type." srcset="https://substackcdn.com/image/fetch/$s_!Ovy3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd045ca-c3dc-46b7-a942-0ceec8fe523c_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!Ovy3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd045ca-c3dc-46b7-a942-0ceec8fe523c_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!Ovy3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd045ca-c3dc-46b7-a942-0ceec8fe523c_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!Ovy3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd045ca-c3dc-46b7-a942-0ceec8fe523c_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 7: Decision tree for choosing the right evaluator type.</em></figcaption></figure></div><h2>Advanced Metric Designs</h2><p>Now let&#8217;s handle the hard cases: multi-turn conversations, complex workflows, and agentic systems.</p><h3>Evaluating Multi-Turn Conversation Traces</h3><p>A single-turn eval checks one input and one output. Multi-turn evals check entire conversations.</p><p><strong>Challenges:</strong></p><ul><li><p>Context must carry across turns</p></li><li><p>Errors compound (one bad response derails the rest)</p></li><li><p>You need to catch the <strong>first upstream failure</strong>, not downstream symptoms</p></li></ul><p><strong>Strategy:</strong></p><ol><li><p><strong>End-to-end task success</strong>: Did the agent accomplish the user&#8217;s goal by the end?</p></li><li><p><strong>Turn-by-turn checks</strong>: Evaluate each exchange individually</p><ul><li><p>Did turn 3 maintain context from turn 1?</p></li><li><p>Did turn 5 escalate when the user got frustrated?</p></li></ul></li><li><p><strong>Failure attribution</strong>: When something breaks, find the first turn where it went wrong</p></li></ol><p><strong>Example (customer support conversation):</strong></p><p><strong>Turn 1:</strong></p><p>User: <em>&#8220;I need to return order #12345&#8221;</em></p><p>Agent: <em>&#8220;Sure, I can help with that. What&#8217;s the reason for the return?&#8221;</em></p><p>Eval: Pass (acknowledged request, asked clarifying question)</p><p><strong>Turn 2:</strong></p><p>User: <em>&#8220;It arrived damaged&#8221;</em></p><p>Agent: <em>&#8220;I&#8217;ll process a refund. Expect 3-5 business days.&#8221;</em></p><p>Eval: FAIL (Skipped required step: didn&#8217;t offer replacement or ask for photos of damage)</p><p><strong>Turn 3:</strong></p><p>User: <em>&#8220;Do I need to ship it back?&#8221;</em></p><p>Agent: <em>&#8220;No, keep it.&#8221;</em></p><p><strong>Eval:</strong> Pass (but only because Turn 2 already failed the workflow)</p><p>The <strong>first upstream failure</strong> is Turn 2. Everything after is a consequence.</p><p><strong>Important</strong>: When evaluating any turn, provide all previous turns as context. Evaluating Turn 2? Include Turn 1. Evaluating Turn 3? Include Turns 1 and 2. The evaluator needs the full conversation history to judge whether context was properly maintained.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!F9xg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec05adaa-5103-4052-aadb-2df1e3cc201e_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!F9xg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec05adaa-5103-4052-aadb-2df1e3cc201e_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!F9xg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec05adaa-5103-4052-aadb-2df1e3cc201e_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!F9xg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec05adaa-5103-4052-aadb-2df1e3cc201e_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!F9xg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec05adaa-5103-4052-aadb-2df1e3cc201e_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!F9xg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec05adaa-5103-4052-aadb-2df1e3cc201e_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ec05adaa-5103-4052-aadb-2df1e3cc201e_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Multi-turn conversation evaluation with first upstream failure attribution.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Multi-turn conversation evaluation with first upstream failure attribution." title="Multi-turn conversation evaluation with first upstream failure attribution." srcset="https://substackcdn.com/image/fetch/$s_!F9xg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec05adaa-5103-4052-aadb-2df1e3cc201e_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!F9xg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec05adaa-5103-4052-aadb-2df1e3cc201e_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!F9xg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec05adaa-5103-4052-aadb-2df1e3cc201e_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!F9xg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec05adaa-5103-4052-aadb-2df1e3cc201e_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 8: Multi-turn conversation evaluation with first upstream failure attribution.</em></figcaption></figure></div><h3>Evaluating Complex Multi-Step Workflows</h3><p>Workflows have dependencies. Step 3 can&#8217;t succeed if Step 1 failed. Your evaluator needs to know this.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xiHl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bd8fc68-b010-4c85-bdef-89708a22c188_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xiHl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bd8fc68-b010-4c85-bdef-89708a22c188_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!xiHl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bd8fc68-b010-4c85-bdef-89708a22c188_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!xiHl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bd8fc68-b010-4c85-bdef-89708a22c188_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!xiHl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bd8fc68-b010-4c85-bdef-89708a22c188_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xiHl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bd8fc68-b010-4c85-bdef-89708a22c188_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6bd8fc68-b010-4c85-bdef-89708a22c188_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Evaluating complex multi-step workflows with dependency-aware sequencing.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Evaluating complex multi-step workflows with dependency-aware sequencing." title="Evaluating complex multi-step workflows with dependency-aware sequencing." srcset="https://substackcdn.com/image/fetch/$s_!xiHl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bd8fc68-b010-4c85-bdef-89708a22c188_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!xiHl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bd8fc68-b010-4c85-bdef-89708a22c188_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!xiHl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bd8fc68-b010-4c85-bdef-89708a22c188_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!xiHl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bd8fc68-b010-4c85-bdef-89708a22c188_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 9: Evaluating complex multi-step workflows with dependency-aware sequencing.</em></figcaption></figure></div><p><strong>Example (flight booking agent):</strong></p><p>Required sequence:</p><ol><li><p>Search flights</p></li><li><p>Validate availability</p></li><li><p>Confirm payment</p></li><li><p>Book reservation</p></li></ol><p><strong>Bad eval:</strong> Check if all steps ran (yes/no)</p><p><strong>Good eval:</strong> Check if steps ran in the right order, with correct dependencies</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2pwF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc968b3ad-0e3f-4b47-928b-4578eb3a155b_2969x1587.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2pwF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc968b3ad-0e3f-4b47-928b-4578eb3a155b_2969x1587.png 424w, https://substackcdn.com/image/fetch/$s_!2pwF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc968b3ad-0e3f-4b47-928b-4578eb3a155b_2969x1587.png 848w, https://substackcdn.com/image/fetch/$s_!2pwF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc968b3ad-0e3f-4b47-928b-4578eb3a155b_2969x1587.png 1272w, https://substackcdn.com/image/fetch/$s_!2pwF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc968b3ad-0e3f-4b47-928b-4578eb3a155b_2969x1587.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2pwF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc968b3ad-0e3f-4b47-928b-4578eb3a155b_2969x1587.png" width="1456" height="778" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c968b3ad-0e3f-4b47-928b-4578eb3a155b_2969x1587.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:778,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;code&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="code" title="code" srcset="https://substackcdn.com/image/fetch/$s_!2pwF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc968b3ad-0e3f-4b47-928b-4578eb3a155b_2969x1587.png 424w, https://substackcdn.com/image/fetch/$s_!2pwF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc968b3ad-0e3f-4b47-928b-4578eb3a155b_2969x1587.png 848w, https://substackcdn.com/image/fetch/$s_!2pwF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc968b3ad-0e3f-4b47-928b-4578eb3a155b_2969x1587.png 1272w, https://substackcdn.com/image/fetch/$s_!2pwF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc968b3ad-0e3f-4b47-928b-4578eb3a155b_2969x1587.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Evaluating Agentic Workflows</h3><p>Agents don&#8217;t follow fixed scripts. They plan, reason, and adapt. This makes evaluation harder.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TV0w!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92400711-7db8-4161-b306-6fad56d41a5a_1200x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TV0w!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92400711-7db8-4161-b306-6fad56d41a5a_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!TV0w!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92400711-7db8-4161-b306-6fad56d41a5a_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!TV0w!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92400711-7db8-4161-b306-6fad56d41a5a_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!TV0w!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92400711-7db8-4161-b306-6fad56d41a5a_1200x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TV0w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92400711-7db8-4161-b306-6fad56d41a5a_1200x630.png" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/92400711-7db8-4161-b306-6fad56d41a5a_1200x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Two-phase agentic workflow evaluation: end-to-end success followed by step-level diagnostics.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Two-phase agentic workflow evaluation: end-to-end success followed by step-level diagnostics." title="Two-phase agentic workflow evaluation: end-to-end success followed by step-level diagnostics." srcset="https://substackcdn.com/image/fetch/$s_!TV0w!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92400711-7db8-4161-b306-6fad56d41a5a_1200x630.png 424w, https://substackcdn.com/image/fetch/$s_!TV0w!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92400711-7db8-4161-b306-6fad56d41a5a_1200x630.png 848w, https://substackcdn.com/image/fetch/$s_!TV0w!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92400711-7db8-4161-b306-6fad56d41a5a_1200x630.png 1272w, https://substackcdn.com/image/fetch/$s_!TV0w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92400711-7db8-4161-b306-6fad56d41a5a_1200x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 10: Two-phase agentic workflow evaluation: end-to-end success followed by step-level diagnostics.</em></figcaption></figure></div><p><strong>Two-phase approach</strong> (from Hamel Husain) (5):</p><h3>Phase 1: End-to-End Task Success</h3><p>Treat the agent as a black box. Did it meet the user&#8217;s goal?</p><p><strong>Define precise success rules per task:</strong></p><ul><li><p>Exact answer match (for factual tasks)</p></li><li><p>Correct side-effect (database updated, email sent, file created)</p></li><li><p>User satisfaction (thumbs up, complaint rate, retry rate)</p></li></ul><p>Use human judges or well-aligned LLM judges. <strong>Focus on first upstream failures</strong> during error analysis.</p><h3>Phase 2: Step-Level Diagnostics</h3><p>Once you know which workflows fail, diagnose why.</p><p>Assuming you&#8217;ve instrumented your system to log tool calls and responses, score:</p><ol><li><p><strong>Tool choice</strong>: Was the selected tool appropriate?</p></li><li><p><strong>Parameter extraction</strong>: Were inputs complete and well-formed?</p></li><li><p><strong>Error handling</strong>: Did it recover from empty results or API failures?</p></li><li><p><strong>Context retention</strong>: Did it preserve earlier constraints?</p></li><li><p><strong>Plan quality</strong>: Does the agent&#8217;s plan match the task requirements?</p></li></ol><p><strong>Transition matrix analysis</strong> (Bryan Bischof&#8217;s approach):</p><p>Track which state transitions cause failures.</p><p>Example (text-to-SQL agent):</p><ul><li><p>GenSQL &#8594; ExecSQL: 12 failures</p></li><li><p>DecideTool &#8594; PlanCal: 2 failures</p></li></ul><p>This data-driven view shows where to focus debugging.</p><p><strong>Session-level metrics:</strong></p><ul><li><p>Task completion rate</p></li><li><p>Step completion (did it finish the required steps?)</p></li><li><p>Trajectory quality (did it avoid loops?)</p></li><li><p>Self-aware failures (did it acknowledge limitations?)</p></li></ul><p><strong>Node-level metrics (per tool call):</strong></p><ul><li><p>Tool correctness (right tool with right parameters?)</p></li><li><p>Tool call accuracy (did the tool run without errors?)</p></li><li><p>Output correctness (did the tool return valid results?)</p></li></ul><p><strong>System efficiency metrics:</strong></p><ul><li><p>Latency (time to complete task)</p></li><li><p>Token usage (cost per task)</p></li><li><p>Tool calls per task (efficiency of plan)</p></li></ul><p>These metrics layer on top of each other[6]. System efficiency ensures scalability. Session-level metrics validate goal achievement. Node-level metrics pinpoint root causes.</p><h2>Bringing It All Together</h2><p>Pick evaluators based on what you&#8217;re actually trying to measure, not what sounds impressive. Here&#8217;s how to decide which evaluator to use:</p><p><strong>Can you check it with code?</strong></p><p>Yes &#8594; Use code-based evaluators (tool calls, format checks, required elements)</p><p>No &#8594; Move to next question</p><p><strong>Is there a single correct answer or narrow acceptable range?</strong></p><p>Yes &#8594; Use reference-based evaluation (exact match, ROUGE, BLEU)</p><p>No &#8594; Move to next question</p><p><strong>Are you comparing two versions?</strong></p><p>Yes &#8594; Use pairwise comparison</p><p>No &#8594; Use direct scoring</p><p><strong>Is the task subjective (tone, helpfulness, flow)?</strong></p><p>Yes &#8594; Use LLM judges with rubrics and few-shot examples</p><p>No &#8594; Rethink your criteria (you might have missed a code-based check)</p><p><strong>Is it a multi-turn or agentic workflow?</strong></p><p>Yes &#8594; Use two-phase approach (end-to-end task success + step-level diagnostics)</p><p>No &#8594; Single-turn direct scoring</p><p>And remember: <strong>your evaluators are only as good as your dataset and few-shot examples</strong>. The system prompt matters less than you think. The examples matter more than you think.</p><h2>Next Steps</h2><p>You now know how to design evaluators that match your use case. You know when to use code, when to use LLMs, and when to combine both.</p><p>But here&#8217;s the critical question we haven&#8217;t answered: <strong>How do you know if your evaluator is actually working?</strong></p><p>An evaluator who says everything is great when it&#8217;s not is worse than no evaluator at all. You need to validate that your automated judges align with human judgment before you trust them.</p><p>That&#8217;s what we&#8217;ll cover in <a href="https://www.decodingai.com/how-to-evaluate-the-evaluator-validate-llm-judge">Article 5: How to Evaluate the Effectiveness of the Evaluator</a>.</p><p>Also, remember that this article is part of a <strong><a href="https://www.decodingai.com/t/ai-evals-and-observability">7-piece series on AI Evals &amp; Observability</a></strong>. <strong>Here&#8217;s what&#8217;s ahead:</strong></p><ol><li><p><a href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app">Integrating AI Evals Into Your AI App</a> </p></li><li><p><a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Build an AI Evals Dataset from Scratch</a>  </p></li><li><p><a href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals">Generate Synthetic Datasets for AI Evals</a>  </p></li><li><p><strong>How to Design Evaluators</strong> &#8592; <em>You just finished this one</em></p></li><li><p><a href="https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge">How to Evaluate the Evaluator</a></p></li><li><p><a href="https://www.decodingai.com/p/rag-evaluation-6-metrics-framework">RAG Evaluation: The Only 6 Metrics You Need</a></p></li><li><p><a href="https://www.decodingai.com/p/behind-the-scenes-of-ai-observability">Lessons from 6 Months of Evals on a Production AI Companion</a></p></li></ol><p>See you next Tuesday.</p><p><a href="https://substack.com/@paoloap">Paolo Perrone</a></p><div><hr></div><p><em>What&#8217;s your opinion? Do you agree, disagree, or is there something I missed?</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures/comments"><span>Leave a comment</span></a></p><div><hr></div><h3>Most AI newsletters give you news. The AI Engineer gives you understanding.</h3><p>One concept per week, explained from first principles: when to fine-tune vs. prompt vs. RAG, which vector database fits your workload, and how companies like DoorDash ship AI at scale.</p><p><em>Written for senior engineers and tech leads who build with AI, not just read about it.</em></p><div class="embedded-publication-wrap" data-attrs="{&quot;id&quot;:6800638,&quot;name&quot;:&quot;The AI Engineer&quot;,&quot;logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!sXyF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F598ebb57-14dc-4faa-9dd1-08d4f2499564_512x512.png&quot;,&quot;base_url&quot;:&quot;https://theaiengineer.substack.com&quot;,&quot;hero_text&quot;:&quot;Where software engineers become dangerously good AI engineers.\n\n&quot;,&quot;author_name&quot;:&quot;Paolo Perrone&quot;,&quot;show_subscribe&quot;:true,&quot;logo_bg_color&quot;:&quot;#ffffff&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="EmbeddedPublicationToDOMWithSubscribe"><div class="embedded-publication show-subscribe"><a class="embedded-publication-link-part" native="true" href="https://theaiengineer.substack.com?utm_source=substack&amp;utm_campaign=publication_embed&amp;utm_medium=web"><img class="embedded-publication-logo" src="https://substackcdn.com/image/fetch/$s_!sXyF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F598ebb57-14dc-4faa-9dd1-08d4f2499564_512x512.png" width="56" height="56" style="background-color: rgb(255, 255, 255);"><span class="embedded-publication-name">The AI Engineer</span><div class="embedded-publication-hero-text">Where software engineers become dangerously good AI engineers.

</div><div class="embedded-publication-author-name">By Paolo Perrone</div></a><form class="embedded-publication-subscribe" method="GET" action="https://theaiengineer.substack.com/subscribe?"><input type="hidden" name="source" value="publication-embed"><input type="hidden" name="autoSubmit" value="true"><input type="email" class="email-input" name="email" placeholder="Type your email..."><input type="submit" class="button primary" value="Subscribe"></form></div></div><div><hr></div><h2>Go Deeper</h2><p><strong>Go from zero to production-grade AI agents</strong> with the <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Agentic AI Engineering self-paced course</a>. Built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Towards AI</a>.</p><p>Across <strong>34 lessons</strong> (articles, videos, and a lot of code), you&#8217;ll design, build, evaluate, and deploy production-grade AI agents end to end. By the final lesson, you&#8217;ll have <strong>built a multi-agent system</strong> and a <strong>capstone project</strong> where you apply everything you've learned on your own.</p><p><em>Three portfolio projects and a certificate to showcase in interviews. Plus a Discord community where you have direct access to other industry experts and me.</em></p><p>Rated 4.9/5 &#11088;&#65039; by 290+ early students &#8212; <em>&#8220;Every AI Engineer needs a course like this.&#8221;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&quot;,&quot;text&quot;:&quot;Learn more&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31"><span>Learn more</span></a></p><p><em>Not ready to commit?</em> We also prepared a free 6-day email course to reveal the <em><strong>6 critical mistakes that silently destroy agentic systems. </strong><a href="https://email-course.towardsai.net/?ref=b3ab31">Get the free email course.</a></em></p><div><hr></div><p><em>Thanks again to <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a> for sponsoring the series and keeping it free!</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yeD8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 424w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 848w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1272w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yeD8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png" width="1200" height="400" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/deaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:400,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Opik Banner&quot;,&quot;title&quot;:&quot;Opik Banner&quot;,&quot;type&quot;:null,&quot;href&quot;:&quot;https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Opik Banner" title="Opik Banner" srcset="https://substackcdn.com/image/fetch/$s_!yeD8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 424w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 848w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1272w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Try Opik for free here</a> (25k spans/month free)</figcaption></figure></div><p><strong>If you want to monitor, evaluate and optimize your AI workflows and agents:</strong></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;text&quot;:&quot;Try Opik for free&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul"><span>Try Opik for free</span></a></p><div><hr></div><h2>References</h2><ol><li><p>Anthropic. (n.d.). Demystifying evals for AI agents. <a href="http://Not ready to commit? We also prepared a free 6-day email course to reveal the 6 critical mistakes that silently destroy agentic   systems. Get the free email course.">https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents</a></p></li><li><p>Evidentlyai. (n.d.). LLM-as-a-judge: a complete guide. <a href="http://Not ready to commit? We also prepared a free 6-day email course to reveal the 6 critical mistakes that silently destroy agentic   systems. Get the free email course.">https://www.evidentlyai.com/llm-guide/llm-as-a-judge</a></p></li><li><p>Evidentlyai. (n.d.). LLM evaluation metrics and methods. <a href="http://Not ready to commit? We also prepared a free 6-day email course to reveal the 6 critical mistakes that silently destroy agentic   systems. Get the free email course.">https://www.evidentlyai.com/llm-guide/llm-evaluation-metrics</a></p></li><li><p>OpenAI. (n.d.). Evaluation best practices. <a href="http://Not ready to commit? We also prepared a free 6-day email course to reveal the 6 critical mistakes that silently destroy agentic   systems. Get the free email course.">https://developers.openai.com/api/docs/guides/evaluation-best-practices</a></p></li><li><p>Husain, H. (n.d.). How do I evaluate agentic workflows? <a href="http://Not ready to commit? We also prepared a free 6-day email course to reveal the 6 critical mistakes that silently destroy agentic   systems. Get the free email course.">https://hamelhusain.substack.com/p/how-do-i-evaluate-agentic-workflows</a></p></li><li><p>Maxim. (n.d.). Evaluating agentic workflows: The essential metrics that matter. <a href="https://www.getmaxim.ai/articles/evaluating-agentic-workflows-the-essential-metrics-that-matter">https://www.getmaxim.ai/articles/evaluating-agentic-workflows-the-essential-metrics-that-matter</a></p></li><li><p>Confident AI. (n.d.). LLM evaluation metrics: Everything you need for LLM evaluation. <a href="https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation">https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation</a></p></li></ol><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[Start Here: Your Map to Decoding AI]]></title><description><![CDATA[The AI Engineering Command Center]]></description><link>https://www.decodingai.com/p/ai-engineering-roadmaps-courses-and-books</link><guid isPermaLink="false">https://www.decodingai.com/p/ai-engineering-roadmaps-courses-and-books</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Sat, 28 Feb 2026 09:33:50 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/8a2dbd01-1caf-40f0-b912-52c22d96c533_1200x628.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Many of you have mentioned that as the magazine grows, finding the right architectural deep dive is becoming harder than the engineering itself. I want you building, not digging through archives.</p><p>This page is your <strong>Command Center</strong>. A clear map to the blueprints you need to move past "fancy demos" and ship production-grade AI.</p><p><em>Here&#8217;s how to find what you need</em> &#8595;</p><div><hr></div><div><hr></div><h1>&#128205;Step 1: The Decoding AI Roadmaps</h1><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oMuA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe94508b7-a266-4d57-ba49-dd0cc848fdb5_1200x230.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oMuA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe94508b7-a266-4d57-ba49-dd0cc848fdb5_1200x230.png 424w, https://substackcdn.com/image/fetch/$s_!oMuA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe94508b7-a266-4d57-ba49-dd0cc848fdb5_1200x230.png 848w, https://substackcdn.com/image/fetch/$s_!oMuA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe94508b7-a266-4d57-ba49-dd0cc848fdb5_1200x230.png 1272w, https://substackcdn.com/image/fetch/$s_!oMuA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe94508b7-a266-4d57-ba49-dd0cc848fdb5_1200x230.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oMuA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe94508b7-a266-4d57-ba49-dd0cc848fdb5_1200x230.png" width="1200" height="230" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e94508b7-a266-4d57-ba49-dd0cc848fdb5_1200x230.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:230,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:117183,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/189115362?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe94508b7-a266-4d57-ba49-dd0cc848fdb5_1200x230.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oMuA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe94508b7-a266-4d57-ba49-dd0cc848fdb5_1200x230.png 424w, https://substackcdn.com/image/fetch/$s_!oMuA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe94508b7-a266-4d57-ba49-dd0cc848fdb5_1200x230.png 848w, https://substackcdn.com/image/fetch/$s_!oMuA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe94508b7-a266-4d57-ba49-dd0cc848fdb5_1200x230.png 1272w, https://substackcdn.com/image/fetch/$s_!oMuA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe94508b7-a266-4d57-ba49-dd0cc848fdb5_1200x230.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p>If you&#8217;re here for the specific architectures and mental models, start by exploring my past work. To keep things tidy, I&#8217;ve moved the full master index to a dedicated page where you can navigate the magazine at your own pace.</p><p>You can filter the entire archive by:</p><ul><li><p><strong>Level:</strong> Beginner, Intermediate, Advanced.</p></li><li><p><strong>Collections:</strong> Foundations, Case Studies, Projects.</p></li><li><p><strong>Series:</strong> Per Topic End-to-End Blueprints.</p></li></ul><p><strong><a href="https://www.decodingai.com/p/ai-engineering-roadmaps">Explore the Roadmaps &#8594;</a></strong> </p><h1>&#128205;Step 2:  The Resource Library</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MOKD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5626e5a-09d7-4f82-a98f-e266e2bbbd76_1200x628.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MOKD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5626e5a-09d7-4f82-a98f-e266e2bbbd76_1200x628.png 424w, https://substackcdn.com/image/fetch/$s_!MOKD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5626e5a-09d7-4f82-a98f-e266e2bbbd76_1200x628.png 848w, https://substackcdn.com/image/fetch/$s_!MOKD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5626e5a-09d7-4f82-a98f-e266e2bbbd76_1200x628.png 1272w, https://substackcdn.com/image/fetch/$s_!MOKD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5626e5a-09d7-4f82-a98f-e266e2bbbd76_1200x628.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MOKD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5626e5a-09d7-4f82-a98f-e266e2bbbd76_1200x628.png" width="1200" height="628" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b5626e5a-09d7-4f82-a98f-e266e2bbbd76_1200x628.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:628,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:378302,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/189115362?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5626e5a-09d7-4f82-a98f-e266e2bbbd76_1200x628.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MOKD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5626e5a-09d7-4f82-a98f-e266e2bbbd76_1200x628.png 424w, https://substackcdn.com/image/fetch/$s_!MOKD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5626e5a-09d7-4f82-a98f-e266e2bbbd76_1200x628.png 848w, https://substackcdn.com/image/fetch/$s_!MOKD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5626e5a-09d7-4f82-a98f-e266e2bbbd76_1200x628.png 1272w, https://substackcdn.com/image/fetch/$s_!MOKD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5626e5a-09d7-4f82-a98f-e266e2bbbd76_1200x628.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>If you&#8217;re looking to go deeper with more structured guides, here&#8217;s where to look next. While the weekly content is great for staying sharp, if you're ready to build a complete system from scratch without piecing together different articles, I&#8217;ve compiled the best of what I know into a few digital products.</p><p>Unlike the weekly posts, these include <strong>full codebases, video walkthroughs, and Q&amp;A support</strong> to help you go from a blank IDE to a deployed system.</p><ul><li><p><strong><a href="https://www.amazon.com/LLM-Engineers-Handbook-engineering-production/dp/1836200072/">The LLM Engineer&#8217;s Handbook:</a></strong> A framework for building LLM and RAG apps. </p></li><li><p><strong><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Agentic AI Engineering Course:</a> </strong>The end-to-end blueprint for designing, testing, and deploying autonomous agents.</p></li><li><p><strong><a href="https://www.pauliusztin.ai/courses">Full Course Catalog:</a> </strong>Write real code. Ship AI that actually works.</p></li></ul><p><strong>Not sure what to pick? </strong>I also have a <strong><a href="https://email-course.towardsai.net/?ref=b3ab31">6-day free email course</a></strong> on the critical design mistakes that silently break agentic systems. It boils down 2+ years of production experience into a simple mental model for building reliable agents that actually scale.</p><h1>&#128233; Every Tuesday</h1><p>You&#8217;ll get one actionable project, case study, or concept deep dive focused on the reality of shipping AI.</p><ul><li><p><strong>Real-world:</strong> No bedtime stories, just hands-on content.</p></li><li><p><strong>Time-efficient:</strong> One free actionable tip in less than 8 minutes.</p></li><li><p><strong>Future-proof:</strong> Skills that will thrive in a future dominated by AI coding tools.</p></li></ul><p><strong><a href="https://www.decodingai.com/">Check the latest insights &#8594; </a></strong></p><div><hr></div><h1>&#128172; Keep in Touch</h1><p>I&#8217;m building and sharing what works, and what doesn&#8217;t, every week. If you want to see the <em>"work in progress"</em> or the journey behind these systems, I also post here:</p><p><a href="http://linkedin.com/company/decodingai-magazine">LinkedIn</a> <strong>|</strong> <a href="https://x.com/pauliusztin_">X</a> <strong>|</strong> <a href="https://github.com/decodingai-magazine">GitHub</a> <strong>| </strong><a href="https://www.pauliusztin.ai/">pauliusztin.ai</a></p><p>Happy learning, <br><a href="https://substack.com/@pauliusztin">Paul Iusztin</a></p><div><hr></div><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[I Spent 9 Months Building an Agentic AI Engineering Course]]></title><description><![CDATA[Google is already recommending it alongside Coursera, DeepLearning.AI and Oxford.]]></description><link>https://www.decodingai.com/p/agentic-ai-engineering-course</link><guid isPermaLink="false">https://www.decodingai.com/p/agentic-ai-engineering-course</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Thu, 26 Feb 2026 12:00:27 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!6Qcm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Most AI agent courses teach you toy examples. Build a chatbot, call an API, done. But when you try to build something real, something that handles research, generates structured content, orchestrates multiple tools, and actually works in production, you realize those tutorials left out everything that matters. Agentic AI is an engineering discipline, not a prompting exercise.</p><p>That gap is exactly why I spent the last 9 months building an <strong><a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Agentic AI Engineering course</a></strong> with Towards AI. And here is what makes it different: we didn&#8217;t just teach how to build agents. We built two production AI systems, used them daily, and wrote the course with them.</p><p>Google and Gemini are already recommending it alongside courses from Coursera and DeepLearning.AI:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RBtZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0ea13d6-eb52-4ae6-a7bc-29622bd71cfc_878x894.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RBtZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0ea13d6-eb52-4ae6-a7bc-29622bd71cfc_878x894.jpeg 424w, https://substackcdn.com/image/fetch/$s_!RBtZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0ea13d6-eb52-4ae6-a7bc-29622bd71cfc_878x894.jpeg 848w, https://substackcdn.com/image/fetch/$s_!RBtZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0ea13d6-eb52-4ae6-a7bc-29622bd71cfc_878x894.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!RBtZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0ea13d6-eb52-4ae6-a7bc-29622bd71cfc_878x894.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RBtZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0ea13d6-eb52-4ae6-a7bc-29622bd71cfc_878x894.jpeg" width="598" height="608.8974943052392" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a0ea13d6-eb52-4ae6-a7bc-29622bd71cfc_878x894.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:894,&quot;width&quot;:878,&quot;resizeWidth&quot;:598,&quot;bytes&quot;:172075,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188718296?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab6c56e9-a4a4-4a38-9396-c157482464fd_878x894.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RBtZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0ea13d6-eb52-4ae6-a7bc-29622bd71cfc_878x894.jpeg 424w, https://substackcdn.com/image/fetch/$s_!RBtZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0ea13d6-eb52-4ae6-a7bc-29622bd71cfc_878x894.jpeg 848w, https://substackcdn.com/image/fetch/$s_!RBtZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0ea13d6-eb52-4ae6-a7bc-29622bd71cfc_878x894.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!RBtZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0ea13d6-eb52-4ae6-a7bc-29622bd71cfc_878x894.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>How This Course Was Built</h2><p>Back in January 2025, Louis-Fran&#231;ois Bouchard (Co-Founder at Towards AI) reached out to me about creating a course on Agentic AI Engineering. I deeply respected Louis&#8217;s work in the AI space. So I said yes.</p><p>By April 2025, we had a team of five and one non-negotiable rule: we would only teach something we actually use ourselves. No toy examples. No throwaway demos.</p><p>We settled on an ambitious idea: a deep research agent and a writing workflow specialized in generating high-quality lessons and articles with text, code, images, diagrams, and references. We called them Nova and Brown.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sbHi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43a2a61-f109-49ab-995a-44fa8919e03d_1200x1200.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sbHi!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43a2a61-f109-49ab-995a-44fa8919e03d_1200x1200.gif 424w, https://substackcdn.com/image/fetch/$s_!sbHi!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43a2a61-f109-49ab-995a-44fa8919e03d_1200x1200.gif 848w, https://substackcdn.com/image/fetch/$s_!sbHi!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43a2a61-f109-49ab-995a-44fa8919e03d_1200x1200.gif 1272w, https://substackcdn.com/image/fetch/$s_!sbHi!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43a2a61-f109-49ab-995a-44fa8919e03d_1200x1200.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sbHi!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43a2a61-f109-49ab-995a-44fa8919e03d_1200x1200.gif" width="728" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b43a2a61-f109-49ab-995a-44fa8919e03d_1200x1200.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1200,&quot;width&quot;:1200,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sbHi!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43a2a61-f109-49ab-995a-44fa8919e03d_1200x1200.gif 424w, https://substackcdn.com/image/fetch/$s_!sbHi!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43a2a61-f109-49ab-995a-44fa8919e03d_1200x1200.gif 848w, https://substackcdn.com/image/fetch/$s_!sbHi!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43a2a61-f109-49ab-995a-44fa8919e03d_1200x1200.gif 1272w, https://substackcdn.com/image/fetch/$s_!sbHi!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43a2a61-f109-49ab-995a-44fa8919e03d_1200x1200.gif 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The twist: we used Nova and Brown to write the course itself. Every lesson went through the same AI system we were teaching students to build. If something broke, we fixed it. Not for a demo, but because we needed it to work. That pressure forced us to build something production-ready, not just classroom-ready.</p><p>Nova and Brown are two MCP servers that can be orchestrated within a multi-agent system through Cursor, Claude Code, or any custom orchestrator. We created an AI system that writes about itself.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&quot;,&quot;text&quot;:&quot;Learn more&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31"><span>Learn more</span></a></p><h2>What You Get</h2><p>34 lessons that take you from foundations to deploying your own agent through articles, videos, and hands-on Notebooks. You will learn tool calling, ReAct loops, context engineering, structured generation, memory systems, RAG, planning and reasoning architectures, human-in-the-loop feedback, and CI/CD deployment:</p><ul><li><p><strong>Self-paced with monthly live kick-off sessions</strong> so you can go at your own speed without losing momentum.</p></li><li><p><strong>4 parts:</strong> Foundations (multiple smaller projects), two end-to-end complex projects, LLMOps (evaluation, observability, auth, deployment), and a final capstone project you implement yourself.</p></li><li><p><strong>Real code, not notebook-only demos.</strong> The teaching happens through Notebooks, but the code is structured as two Python modules (Nova and Brown). You import from the modules into Notebooks for a structured learning experience.</p></li><li><p><strong>Fundamentals over frameworks.</strong> We wrote as much as possible from scratch because tools change constantly. The course focuses on design principles and patterns you can replicate in any tool. Key tools used: LangGraph, LangChain, Gemini, FastMCP, Cursor/Claude Code, Opik, Perplexity, and GCP.</p></li><li><p><strong>Discord community</strong> with Q&amp;A support and a completion certificate.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6Qcm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 424w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 848w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6Qcm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg" width="721" height="405.5625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:720,&quot;width&quot;:1280,&quot;resizeWidth&quot;:721,&quot;bytes&quot;:190048,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188718296?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6Qcm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 424w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 848w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!6Qcm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30d33e91-68c2-4013-8fe5-556ef835f3ac_1280x720.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Who Is This For?</h3><p>Engineers who want to go deep on AI agents, not skim the surface. If you are a software engineer, ML engineer, or data scientist who has played with LLMs but never built a multi-step agent that actually works in production, this is for you.</p><p>You should be comfortable with Python, have basic familiarity with LLMs, Docker, and cloud. And above all: a builder mindset.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&quot;,&quot;text&quot;:&quot;Learn more&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31"><span>Learn more</span></a></p><p><em>Early-bird pricing: <strong>$449 for lifetime access</strong> &#8212; limited to the first 100 seats!</em></p><p><strong>&#128161;</strong><em><strong> Not sure yet?</strong> We <a href="https://github.com/towardsai/agentic-ai-engineering-course/tree/main">open-sourced the code on GitHub</a> and made the <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">first 6 lessons free</a>.</em></p><h2>What Students Are Saying</h2><p>We sold 150 pre-release slots to build the course with a real audience. The result: 25 five-star reviews. Not from our own biased impression, but from students who went through the material. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RrcJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1005b1c-02f9-4079-95aa-c2b04dd9f991_1314x342.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RrcJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1005b1c-02f9-4079-95aa-c2b04dd9f991_1314x342.png 424w, https://substackcdn.com/image/fetch/$s_!RrcJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1005b1c-02f9-4079-95aa-c2b04dd9f991_1314x342.png 848w, https://substackcdn.com/image/fetch/$s_!RrcJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1005b1c-02f9-4079-95aa-c2b04dd9f991_1314x342.png 1272w, https://substackcdn.com/image/fetch/$s_!RrcJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1005b1c-02f9-4079-95aa-c2b04dd9f991_1314x342.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RrcJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1005b1c-02f9-4079-95aa-c2b04dd9f991_1314x342.png" width="1314" height="342" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d1005b1c-02f9-4079-95aa-c2b04dd9f991_1314x342.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:342,&quot;width&quot;:1314,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:79774,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188718296?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1005b1c-02f9-4079-95aa-c2b04dd9f991_1314x342.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RrcJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1005b1c-02f9-4079-95aa-c2b04dd9f991_1314x342.png 424w, https://substackcdn.com/image/fetch/$s_!RrcJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1005b1c-02f9-4079-95aa-c2b04dd9f991_1314x342.png 848w, https://substackcdn.com/image/fetch/$s_!RrcJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1005b1c-02f9-4079-95aa-c2b04dd9f991_1314x342.png 1272w, https://substackcdn.com/image/fetch/$s_!RrcJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1005b1c-02f9-4079-95aa-c2b04dd9f991_1314x342.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As one reviewer put it: &#8220;goes far beyond theory, providing deep, practical experience&#8221; with real-world constraints rather than flashy demos.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AVPK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704217e2-2c10-4dab-8090-619fcd9d0600_1080x1106.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AVPK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704217e2-2c10-4dab-8090-619fcd9d0600_1080x1106.png 424w, https://substackcdn.com/image/fetch/$s_!AVPK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704217e2-2c10-4dab-8090-619fcd9d0600_1080x1106.png 848w, https://substackcdn.com/image/fetch/$s_!AVPK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704217e2-2c10-4dab-8090-619fcd9d0600_1080x1106.png 1272w, https://substackcdn.com/image/fetch/$s_!AVPK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704217e2-2c10-4dab-8090-619fcd9d0600_1080x1106.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AVPK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704217e2-2c10-4dab-8090-619fcd9d0600_1080x1106.png" width="721" height="738.3574074074074" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/704217e2-2c10-4dab-8090-619fcd9d0600_1080x1106.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1106,&quot;width&quot;:1080,&quot;resizeWidth&quot;:721,&quot;bytes&quot;:761690,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188718296?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97c248ba-e497-4c1e-b0ca-f260707c043d_1080x1440.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AVPK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704217e2-2c10-4dab-8090-619fcd9d0600_1080x1106.png 424w, https://substackcdn.com/image/fetch/$s_!AVPK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704217e2-2c10-4dab-8090-619fcd9d0600_1080x1106.png 848w, https://substackcdn.com/image/fetch/$s_!AVPK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704217e2-2c10-4dab-8090-619fcd9d0600_1080x1106.png 1272w, https://substackcdn.com/image/fetch/$s_!AVPK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F704217e2-2c10-4dab-8090-619fcd9d0600_1080x1106.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Sean Myers, Principal Analyst at Columbia, already earned the first completion certificate:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6MwH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb39dba71-2646-4932-9698-e0b3473db50f_2658x1878.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6MwH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb39dba71-2646-4932-9698-e0b3473db50f_2658x1878.png 424w, https://substackcdn.com/image/fetch/$s_!6MwH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb39dba71-2646-4932-9698-e0b3473db50f_2658x1878.png 848w, https://substackcdn.com/image/fetch/$s_!6MwH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb39dba71-2646-4932-9698-e0b3473db50f_2658x1878.png 1272w, https://substackcdn.com/image/fetch/$s_!6MwH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb39dba71-2646-4932-9698-e0b3473db50f_2658x1878.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6MwH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb39dba71-2646-4932-9698-e0b3473db50f_2658x1878.png" width="725" height="512.3798076923077" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b39dba71-2646-4932-9698-e0b3473db50f_2658x1878.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1029,&quot;width&quot;:1456,&quot;resizeWidth&quot;:725,&quot;bytes&quot;:1297320,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/188718296?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb39dba71-2646-4932-9698-e0b3473db50f_2658x1878.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6MwH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb39dba71-2646-4932-9698-e0b3473db50f_2658x1878.png 424w, https://substackcdn.com/image/fetch/$s_!6MwH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb39dba71-2646-4932-9698-e0b3473db50f_2658x1878.png 848w, https://substackcdn.com/image/fetch/$s_!6MwH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb39dba71-2646-4932-9698-e0b3473db50f_2658x1878.png 1272w, https://substackcdn.com/image/fetch/$s_!6MwH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb39dba71-2646-4932-9698-e0b3473db50f_2658x1878.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here you can learn more:</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&quot;,&quot;text&quot;:&quot;Learn more&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31"><span>Learn more</span></a></p><p></p><h2>Paid Subscribers</h2><p>For paid subscribers, we are offering <strong>20% off.</strong> For the discount code, DM me on Substack or comment on this post.</p><p>We will soon create a paid subscribers&#8217; perks page with more offers. But for now, let&#8217;s keep it simple.</p><p>Looking forward to your feedback on the course and seeing you next Tuesday!</p><p>Paul</p>]]></content:encoded></item><item><title><![CDATA[Generate Synthetic Datasets for AI Evals]]></title><description><![CDATA[5 strategies from cold start to 450 diverse inputs in minutes]]></description><link>https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals</link><guid isPermaLink="false">https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Tue, 24 Feb 2026 12:02:13 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!EZyL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51414f2b-d109-4a5d-8086-ab41f2bbf027_1200x752.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Welcome to the <strong><a href="https://www.decodingai.com/t/ai-evals-and-observability">AI Evals &amp; Observability series</a></strong>: A 7-part journey from shipping AI apps to systematically improving them. Made by busy people. For busy people.</em></p><p>&#129488; Everyone says you need AI evals. Few explain how to actually build them and answer questions such as&#8230;</p><p>How do we avoid creating evals that waste our time and resources? How do we build datasets and design evaluators that matter? How do we adapt them for RAG? ...and most importantly, how do we stop &#8220;vibe checking&#8221; and leverage evals to actually track and optimize our app?</p><p><em>This 7-article series breaks it all down from first principles:</em></p><ol><li><p><a href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app">Integrating AI Evals Into Your AI App</a><strong> </strong></p></li><li><p><a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Build an AI Evals Dataset from Scratch</a> </p></li><li><p><strong>Generate Synthetic Datasets for AI Evals</strong> &#8592; <em>You are here</em></p></li><li><p><a href="https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures">How to Design Evaluators</a></p></li><li><p><a href="https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge">How to Evaluate the Evaluator</a></p></li><li><p><a href="https://www.decodingai.com/p/rag-evaluation-6-metrics-framework">RAG Evaluation: The Only 6 Metrics You Need</a></p></li><li><p><a href="https://www.decodingai.com/p/behind-the-scenes-of-ai-observability">Lessons from 6 Months of Evals on a Production AI Companion</a></p></li></ol><p>By the end, you&#8217;ll know how to integrate AI evals that actually track and improve the performance of your AI product. No vibe checking required!</p><p><strong>Let&#8217;s get started.</strong></p><div><hr></div><h1>Generate Synthetic Datasets for AI Evals</h1><p>In the <a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">previous article</a>, you learned how to iteratively build an evals dataset using the error analysis framework. You started from production traces, labeled them, fixed errors, and grew your dataset over time. <strong>But what if you lack production traces?</strong></p><p>What if your production data only covers a fraction of the features, personas, and edge cases your app supports? Synthetic data fills the gaps that production data alone cannot cover.</p><p>When I was building Nova, the deep research agent for my <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Agentic AI Engineering course</a>, I hit this exact wall. I built an evaluation layer with binary metrics across dozens of dimensions. The metrics were solid, and the LLM judge was calibrated.</p><p>But then I needed test data. The agent lacked real users, real generated articles, and traces. I started by manually writing test inputs. After a painful week, I had maybe 15 examples.</p><p>They all reflected my own biases. I was testing the same happy path over and over. Entire categories of failure modes went completely untested.</p><p>The most time-consuming bottleneck wasn&#8217;t building the judge. It was generating enough diverse, realistic test inputs. That experience taught me that structured synthetic data generation unlocks your entire evals pipeline.</p><p>Most teams fire off a single generic prompt to create test inputs. The result is a homogenous, shallow dataset where most examples look identical. The LLM converges on the most generic patterns, causing mode collapse.</p><p>You end up testing the same happy path over and over. You need test data, but you lack sufficient or diverse production traffic. Naively generating synthetic data produces datasets that are repetitive and miss your business use cases.</p><p>A structured approach gives you control over the distribution of your test inputs. You achieve this by thinking in terms of dimensions, anchoring it in your business use case, and applying targeted strategies.</p><p><strong>In this article, you will learn:</strong></p><ul><li><p>When to rely on synthetic data generation.</p></li><li><p>Why you should only generate user inputs.</p></li><li><p>How to use dimensions to avoid mode collapse.</p></li><li><p>Strategies to expand existing production data.</p></li><li><p>Approaches for agents, RAG, and deterministic tasks.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FVJv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FVJv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png 424w, https://substackcdn.com/image/fetch/$s_!FVJv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png 848w, https://substackcdn.com/image/fetch/$s_!FVJv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png 1272w, https://substackcdn.com/image/fetch/$s_!FVJv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FVJv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png" width="1200" height="676" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:676,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Synthetic data and production traces both feed into the evals dataset, which drives the error analysis flywheel&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Synthetic data and production traces both feed into the evals dataset, which drives the error analysis flywheel" title="Synthetic data and production traces both feed into the evals dataset, which drives the error analysis flywheel" srcset="https://substackcdn.com/image/fetch/$s_!FVJv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png 424w, https://substackcdn.com/image/fetch/$s_!FVJv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png 848w, https://substackcdn.com/image/fetch/$s_!FVJv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png 1272w, https://substackcdn.com/image/fetch/$s_!FVJv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45b991a-4ed5-4d9e-8116-d6c2d8759698_1200x676.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 1: Synthetic data is a complementary input to production traces &#8212; both feed the same evals dataset that drives the error analysis flywheel.</em></figcaption></figure></div><p><em>Before digging into the article, a quick word from our sponsor, Opik.</em> &#8595;</p><div><hr></div><h2><a href="https://www.comet.com/docs/opik/agent_optimization/quickstart?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik: Automated Agent Optimization Using Your Evals Data (Sponsored)</a></h2><p>This AI Evals &amp; Observability series is brought to you by <strong><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a></strong>, the LLMOps open-source platform used by Uber, Netflix, Etsy, and more.</p><p>We use Opik daily across our courses and AI products. Not just for observability, but now to <strong>automatically optimize our agents&#8217; prompts</strong> using the same datasets and metrics we already have in the platform.</p><div id="youtube2-FY4uqXmq3fs" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;FY4uqXmq3fs&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/FY4uqXmq3fs?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>You are learning how to build diverse synthetic datasets to evaluate your AI app. But once you have those datasets and metrics, why stop at measuring quality?<strong> <a href="https://www.comet.com/docs/opik/agent_optimization/quickstart?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik&#8217;s agent optimizer</a></strong> closes the loop. It uses your <strong>eval dataset to automatically improve your prompts</strong>. Here is why we love it:</p><ul><li><p><strong>Same datasets, zero extra setup</strong> &#8212; Opik&#8217;s optimizer reuses the exact datasets, metrics, and tracing you already have. <a href="https://www.comet.com/docs/opik/agent_optimization/quickstart?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Quick start guide</a>.</p></li><li><p><strong>Six optimization algorithms</strong> &#8212; Choose from strategies like HRPO (our favorite), which performs root-cause analysis on failures and proposes targeted fixes, or evolutionary optimization to explore diverse prompt structures. <a href="https://www.comet.com/docs/opik/agent_optimization/algorithms/overview?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">See all algorithms.</a></p></li><li><p><strong>No-code Optimization Studio</strong> &#8212; For quick iterations, run optimization directly from the <a href="https://www.comet.com/docs/opik/agent_optimization/optimization_studio?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Optimization Studio UI</a>. Start from your prompt, pick your dataset, choose an algorithm, and watch Opik test prompt variations against your metrics in real time.</p></li></ul><p><strong><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a></strong> is fully open source and integrates with OpenAI, Anthropic, Gemini, and 100+ providers. <em>Start optimizing your agents:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.comet.com/docs/opik/agent_optimization/quickstart?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;text&quot;:&quot;Optimize your AI agents&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.comet.com/docs/opik/agent_optimization/quickstart?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul"><span>Optimize your AI agents</span></a></p><div><hr></div><p>&#8595; <em>Now, let&#8217;s move back to the article.</em></p><h2>When Do We Need Synthetic Data?</h2><p>Before you have any production data, you face <em>the cold start problem</em>. You might be building a new feature or preparing for a launch. You need to simulate months of traffic in hours to ensure a reliable initial release. You cannot wait for real users to find your bugs. Synthetic data lets you test your application before day one.</p><p>Sometimes your app is live, but you lack enough production data. You might have 50 traces instead of 5,000. The error analysis framework needs enough examples to surface recurring patterns. Synthetic data supplements your real traces.</p><p>Other times, your data lacks diversity. You might have plenty of production traces, but they cluster around a few common use cases. Most users ask the same types of questions. You end up with almost no examples of edge cases, adversarial inputs, or minority user personas. Synthetic data lets you deliberately target the underrepresented regions of your input space.</p><p>Now that you know when synthetic data is necessary, let&#8217;s understand the core principle behind how it works. Most people get this fundamental concept wrong.</p><h2>Understanding the Core Principles</h2><p><em>The single most important principle is that <strong>you generate only the user inputs</strong></em>. These queries, messages, or requests should be as diverse as possible. They must cover all your business use cases, edge cases, and user profiles.</p><p>You do not generate the intermediate steps or the final outputs. The whole point of your evals dataset is to capture how your actual system behaves. Synthetic outputs would test a fiction, not your real app.</p><p>Instead, you let your real app produce the traces. First, you generate diverse synthetic inputs. Then, you feed all generated inputs into your AI app as if they were real user requests.</p><p>You track every trace using your observability platform, such as <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a>. This captures all intermediate tool calls, model reasoning, and final outputs. You then pull these full traces to create your synthetic evals dataset.</p><p>Finally, you apply the error analysis framework we learned in <a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Article 2</a>. You label the data with pass/fail judgments, fix errors, build evaluators, and iterate. Your synthetic dataset contains real system behavior triggered by synthetic inputs.</p><p>This makes it a valid proxy for production. Since the app handles processing and tool calls, <strong>the entire challenge reduces to a single thing</strong>: </p><blockquote><p><em>You must create diverse, realistic, business-grounded inputs.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-5_v!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a406f97-34ad-4c5b-a523-b96cee4b129b_1200x676.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-5_v!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a406f97-34ad-4c5b-a523-b96cee4b129b_1200x676.png 424w, https://substackcdn.com/image/fetch/$s_!-5_v!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a406f97-34ad-4c5b-a523-b96cee4b129b_1200x676.png 848w, https://substackcdn.com/image/fetch/$s_!-5_v!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a406f97-34ad-4c5b-a523-b96cee4b129b_1200x676.png 1272w, https://substackcdn.com/image/fetch/$s_!-5_v!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a406f97-34ad-4c5b-a523-b96cee4b129b_1200x676.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-5_v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a406f97-34ad-4c5b-a523-b96cee4b129b_1200x676.png" width="1200" height="676" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5a406f97-34ad-4c5b-a523-b96cee4b129b_1200x676.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:676,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The synthetic data pipeline from input generation through your AI app and observability to the evals dataset&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The synthetic data pipeline from input generation through your AI app and observability to the evals dataset" title="The synthetic data pipeline from input generation through your AI app and observability to the evals dataset" srcset="https://substackcdn.com/image/fetch/$s_!-5_v!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a406f97-34ad-4c5b-a523-b96cee4b129b_1200x676.png 424w, https://substackcdn.com/image/fetch/$s_!-5_v!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a406f97-34ad-4c5b-a523-b96cee4b129b_1200x676.png 848w, https://substackcdn.com/image/fetch/$s_!-5_v!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a406f97-34ad-4c5b-a523-b96cee4b129b_1200x676.png 1272w, https://substackcdn.com/image/fetch/$s_!-5_v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a406f97-34ad-4c5b-a523-b96cee4b129b_1200x676.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 2: You only generate the synthetic inputs &#8212; your real AI app produces the traces and outputs, which flow through your observability platform into a synthetic evals dataset.</em></figcaption></figure></div><p>Thus, the entire problem reduces to generating input data. Let&#8217;s explore concrete strategies for doing so, starting with the most fundamental approach.</p><h2>Seeing Your Inputs as Dimensions</h2><p>When generating input data, using a generic prompt is a mistake. You have zero control over the distribution of edge cases the LLM generates. The LLM will converge on the most generic, repetitive patterns.</p><p>Instead, think about a few key dimensions that matter for your application. Model them as tuples to serve as the seeds for your generation process. <em>A common dimension tuple includes the persona, feature, scenario, and input modality.</em></p><p>The persona defines the different user types who interact with your app. You get inputs from an impatient customer, a technical expert, or a confused first-time user.</p><p>The feature represents the different capabilities of your app that you want to test. Examples include answering emails, generating a meeting summary, or drafting an article.</p><p>The scenario defines the specific failure modes or edge cases you want to stress-test. This includes contradictory instructions, garbled input, or outdated information.</p><p>The input modality is the format through which the input arrives. This could be plain text, a forwarded email thread, a voice transcript with filler words, or a pasted spreadsheet snippet.</p><p>If you define 3 personas, 5 features, 10 scenarios, and 3 input modalities (text, image, documents), and combine them all, you get a maximum of 450 unique data seeds. Each seed is a specific combination that drives a targeted, diverse input.</p><p>For each seed, you craft a generation prompt. This prompt includes the dimension values, context about your app, and the scenario&#8217;s built-in failure assumption.</p><p>Embedding the failure assumption directly in the scenario is highly effective. You tell the LLM exactly what failure to target and what correct behavior looks like <a href="https://www.confident-ai.com/blog/the-definitive-guide-to-synthetic-data-generation-using-llms">[1]</a>. This makes your synthetic inputs far more precise.</p><p>Here are six dimension seeds with their generation prompts for an email and messaging assistant:</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/t7uEh/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8ae1c105-f478-444e-b5a5-5e8b13e49414_1220x3280.png&quot;,&quot;thumbnail_url_full&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/89e334f0-a685-4696-897e-942cb7c1056d_1220x3404.png&quot;,&quot;height&quot;:1692,&quot;title&quot;:&quot;Examples&quot;,&quot;description&quot;:&quot;Six dimension seeds with their generation prompts for an email and messaging assistant.&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/t7uEh/1/" width="730" height="1692" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>Your dimensions will vary depending on your business use case. You might have entirely different dimensions, such as language or urgency level.</p><h4>How much data do you need?</h4><p>At a minimum, generate enough data so that you have at least one example for each combination of dimensions. Keep generating more data until you stop seeing new failure modes. A simple chatbot might need 200 examples, while a complex agent might need over a thousand.</p><h4>Does this actually work?</h4><p>You might wonder if synthetic data actually works. From my experience, if well-guided, LLMs are highly capable of generating excellent, diverse examples of user prompts. Synthetic data is the fastest way to build a meaningful evals dataset early on.</p><p>Dimension-based generation works great when you start from scratch. But what if you already have some production data and want to expand it?</p><h2>When Having Some Production Data</h2><p>When you have production data, identify your failed or most difficult interactions to use as seeds. If a user input caused your system to fail, generate many variations of that specific input.</p><p>You generate multiple variations of the same input with the same underlying semantics to stress-test your app&#8217;s consistency. You can vary the phrasing, the level of aggression, or the ambiguity. This ensures your system never fails the same way twice. This method is known as Metamorphic testing.</p><p>Suppose your agent fails when users send multi-part questions in a single message. You take this real failed trace, combine it with our multi-dimensional strategy, and ask an LLM to generate 20 variations. The resulting inputs target the same failure class with enough variation to test your fix reliably.</p><p>Another method is evolutionary complexity, also known as Evol-Instruct. Originally introduced by the WizardLM researchers to generate synthetic training data for LLMs <a href="https://arxiv.org/abs/2304.12244">[7]</a>, it transfers remarkably well to generating evaluation data.</p><p>The core problem is the same: producing diverse, progressively complex instructions from a small set of seeds. It uses an evolutionary paradigm to transform simple seed inputs into more complex, realistic ones <a href="https://www.confident-ai.com/blog/the-definitive-guide-to-synthetic-data-generation-using-llms">[1]</a>.</p><p><em>Evol-Instruct is based on 3 core steps:</em></p><ol><li><p><strong>In-depth evolving</strong> takes a simple instruction and increases its complexity. It adds constraints, deepens the subject matter, or increases reasoning requirements. A simple order status query evolves into a complex rerouting request.</p></li><li><p><strong>In-breadth evolving</strong> generates completely new, diverse instructions. This ensures the evaluation suite covers a broad range of topics. While in-depth evolving makes existing inputs harder, in-breadth evolving widens the dataset.</p></li><li><p><strong>Elimination evolving</strong> is a filtration step. A critic LLM evaluates evolved instructions and discards those that provide no information gain or are nonsensical. This keeps the quality high as complexity grows.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sO-t!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5393e2f2-1404-4c34-b485-48eb2f3b29f9_1200x888.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sO-t!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5393e2f2-1404-4c34-b485-48eb2f3b29f9_1200x888.png 424w, https://substackcdn.com/image/fetch/$s_!sO-t!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5393e2f2-1404-4c34-b485-48eb2f3b29f9_1200x888.png 848w, https://substackcdn.com/image/fetch/$s_!sO-t!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5393e2f2-1404-4c34-b485-48eb2f3b29f9_1200x888.png 1272w, https://substackcdn.com/image/fetch/$s_!sO-t!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5393e2f2-1404-4c34-b485-48eb2f3b29f9_1200x888.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sO-t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5393e2f2-1404-4c34-b485-48eb2f3b29f9_1200x888.png" width="1200" height="888" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5393e2f2-1404-4c34-b485-48eb2f3b29f9_1200x888.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:888,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Evol-Instruct evolutionary data expansion showing in-depth, in-breadth, and elimination evolving from a seed input&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Evol-Instruct evolutionary data expansion showing in-depth, in-breadth, and elimination evolving from a seed input" title="Evol-Instruct evolutionary data expansion showing in-depth, in-breadth, and elimination evolving from a seed input" srcset="https://substackcdn.com/image/fetch/$s_!sO-t!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5393e2f2-1404-4c34-b485-48eb2f3b29f9_1200x888.png 424w, https://substackcdn.com/image/fetch/$s_!sO-t!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5393e2f2-1404-4c34-b485-48eb2f3b29f9_1200x888.png 848w, https://substackcdn.com/image/fetch/$s_!sO-t!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5393e2f2-1404-4c34-b485-48eb2f3b29f9_1200x888.png 1272w, https://substackcdn.com/image/fetch/$s_!sO-t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5393e2f2-1404-4c34-b485-48eb2f3b29f9_1200x888.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 3: Evol-Instruct uses three evolutionary strategies &#8212; in-depth, in-breadth, and elimination evolving &#8212; to expand a seed input into a diverse, high-quality evolved dataset.</em></figcaption></figure></div><p>These methods work well for single-turn interactions. But what about AI agents that handle multi-step, multi-turn conversations?</p><h2>When Building Agents</h2><p>For complex AI agents that plan and execute multi-step workflows, evaluation moves beyond single-turn queries. You need to generate an entire conversation.</p><p>To evaluate these systems, you set up a dual-agent dynamic: a tester agent simulating the user, and your actual app agent. The tester agent dynamically generates synthetic inputs turn by turn, reacting to your app&#8217;s responses just like a real user would <a href="https://www.zendesk.com/au/blog/zip1-building-realistic-multi-turn-tests-for-ai-agents/">[3]</a>. For example, a tester agent playing a frustrated customer might escalate their tone if the first response is vague or pivot to a different request mid-conversation.</p><p>Implementing this tester agent uses the exact same ideas as single-turn generation. You define dimensions like personas and scenarios to impersonate, but you simply run the generation iteratively for each conversation turn.</p><p>Some recent research from Nov 2025 shows that agents achieve over 90% accuracy on single-step tasks, but conversation correctness drops to 10-15% on full conversations <a href="https://www.zendesk.com/au/blog/zip1-building-realistic-multi-turn-tests-for-ai-agents/">[3]</a>. This makes multi-turn flow evaluation a necessity for production reliability.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iClD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d48fe0d-be1e-4826-b232-1ab98efe9a37_1200x888.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iClD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d48fe0d-be1e-4826-b232-1ab98efe9a37_1200x888.png 424w, https://substackcdn.com/image/fetch/$s_!iClD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d48fe0d-be1e-4826-b232-1ab98efe9a37_1200x888.png 848w, https://substackcdn.com/image/fetch/$s_!iClD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d48fe0d-be1e-4826-b232-1ab98efe9a37_1200x888.png 1272w, https://substackcdn.com/image/fetch/$s_!iClD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d48fe0d-be1e-4826-b232-1ab98efe9a37_1200x888.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iClD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d48fe0d-be1e-4826-b232-1ab98efe9a37_1200x888.png" width="1200" height="888" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6d48fe0d-be1e-4826-b232-1ab98efe9a37_1200x888.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:888,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Multi-turn synthetic interaction between a tester agent and your AI app with session-level evaluation&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Multi-turn synthetic interaction between a tester agent and your AI app with session-level evaluation" title="Multi-turn synthetic interaction between a tester agent and your AI app with session-level evaluation" srcset="https://substackcdn.com/image/fetch/$s_!iClD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d48fe0d-be1e-4826-b232-1ab98efe9a37_1200x888.png 424w, https://substackcdn.com/image/fetch/$s_!iClD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d48fe0d-be1e-4826-b232-1ab98efe9a37_1200x888.png 848w, https://substackcdn.com/image/fetch/$s_!iClD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d48fe0d-be1e-4826-b232-1ab98efe9a37_1200x888.png 1272w, https://substackcdn.com/image/fetch/$s_!iClD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d48fe0d-be1e-4826-b232-1ab98efe9a37_1200x888.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 4: A tester agent dynamically generates reactive inputs across multiple conversation turns, and the entire session is evaluated for goal completion, tone, and process correctness.</em></figcaption></figure></div><p>Multi-turn interactions are one specialized case for synthetic data. Another common scenario is information retrieval, where your app searches a knowledge base before responding.</p><h2>When Doing RAG</h2><p>When your AI app retrieves information from a knowledge base before generating a response is known as Retrieval-Augmented Generation (RAG). In this scenario, you can use a reverse workflow to create ground-truth datasets. Instead of the standard retrieval flow, you start with the knowledge base and work backwards <a href="https://www.evidentlyai.com/llm-guide/llm-test-dataset-synthetic-data">[4]</a>.</p><p>You start by taking your documents, PDFs, or structured data. You use an LLM to extract key facts, procedures, numbers, or policies from a specific document chunk that&#8217;s part of your knowledge base.</p><p>The LLM then generates a realistic user question that can only be answered using that specific chunk. Because the question is derived directly from the source material, you know exactly which document chunk should be retrieved. Which means you can easily generate the right answer as well.</p><p>This guarantees perfect alignment between the input, the expected retrieval context, and the expected output. You get a complete ground-truth triplet containing the question, relevant context, and expected answer.</p><p>To create more diversity within the dataset, you can use the same strategy to define multiple dimensions and mix up question styles and complexity levels to avoid a shallow dataset.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EZyL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51414f2b-d109-4a5d-8086-ab41f2bbf027_1200x752.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EZyL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51414f2b-d109-4a5d-8086-ab41f2bbf027_1200x752.png 424w, https://substackcdn.com/image/fetch/$s_!EZyL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51414f2b-d109-4a5d-8086-ab41f2bbf027_1200x752.png 848w, https://substackcdn.com/image/fetch/$s_!EZyL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51414f2b-d109-4a5d-8086-ab41f2bbf027_1200x752.png 1272w, https://substackcdn.com/image/fetch/$s_!EZyL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51414f2b-d109-4a5d-8086-ab41f2bbf027_1200x752.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EZyL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51414f2b-d109-4a5d-8086-ab41f2bbf027_1200x752.png" width="1200" height="752" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/51414f2b-d109-4a5d-8086-ab41f2bbf027_1200x752.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:752,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Reverse-workflow RAG synthesis deriving questions and answers from knowledge base chunks to create ground-truth triplets&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Reverse-workflow RAG synthesis deriving questions and answers from knowledge base chunks to create ground-truth triplets" title="Reverse-workflow RAG synthesis deriving questions and answers from knowledge base chunks to create ground-truth triplets" srcset="https://substackcdn.com/image/fetch/$s_!EZyL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51414f2b-d109-4a5d-8086-ab41f2bbf027_1200x752.png 424w, https://substackcdn.com/image/fetch/$s_!EZyL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51414f2b-d109-4a5d-8086-ab41f2bbf027_1200x752.png 848w, https://substackcdn.com/image/fetch/$s_!EZyL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51414f2b-d109-4a5d-8086-ab41f2bbf027_1200x752.png 1272w, https://substackcdn.com/image/fetch/$s_!EZyL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51414f2b-d109-4a5d-8086-ab41f2bbf027_1200x752.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Image 5: The reverse workflow starts from your knowledge base, extracts key facts, generates questions and expected answers, and produces perfectly aligned ground-truth triplets.</em></figcaption></figure></div><p>This reverse synthesis works because you can derive questions and answers from documents. But what about purely deterministic tasks where the correct answer can be computed exactly?</p><h2>For Deterministic Testing</h2><p>For tasks with deterministic correct answers, you can use your system&#8217;s schema or rules to generate both the input and the ground truth. This applies to structured data extraction or calculations, such as text-to-SQL, JSON, code, or math.</p><p>You work from the answer backward to the question, similar to the reverse workflow for RAG. However, instead of text, you use schemas, databases, or rule sets <a href="https://deepeval.com/guides/guides-using-synthesizer">[6]</a>.</p><p>Suppose you want to generate <em>(text, SQL)</em> tuples. You have a database with tables for customers, orders, and products. You use your database schema to programmatically generate valid SQL queries of varying complexity. These SQL queries serve as your ground truth.</p><p>You then use an LLM to translate each SQL query back into a natural language question. A query selecting customers with pending orders over a specific amount becomes a plain English question.</p><p>Your evals dataset now has a natural language input and the correct SQL mapping. You can test whether your system generates the right query and returns the right data.</p><h2>Next Steps</h2><p>Building an evals dataset from production traces alone has limits. Synthetic data solves the cold start problem and fills coverage gaps.</p><p>Synthetic data generation is not about blindly asking an LLM to create test cases. It is about structuring your inputs as dimensions, anchoring them in your business use case, and applying the right strategy. The result is a diverse, controlled dataset that you can feed into your error analysis framework.</p><p>Now you know how to build your evals dataset from both production and synthetic data. In the next article, we will show you the right way to build your evalautor(s).</p><p>Also, remember that this article is part of a <strong><a href="https://www.decodingai.com/t/ai-evals-and-observability">7-piece series on AI Evals &amp; Observability</a></strong>. <strong>Here is what&#8217;s ahead:</strong></p><ol><li><p><a href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app">Integrating AI Evals Into Your AI App</a> </p></li><li><p><a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Build an AI Evals Dataset from Scratch</a>  </p></li><li><p><strong>Generate Synthetic Datasets for AI Evals</strong>  &#8592; <em>You just finished this one</em></p></li><li><p><a href="https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures">How to Design Evaluators</a></p></li><li><p><a href="https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge">How to Evaluate the Evaluator</a></p></li><li><p><a href="https://www.decodingai.com/p/rag-evaluation-6-metrics-framework">RAG Evaluation: The Only 6 Metrics You Need</a></p></li><li><p><a href="https://www.decodingai.com/p/behind-the-scenes-of-ai-observability">Lessons from 6 Months of Evals on a Production AI Companion</a></p></li></ol><p>See you next Tuesday.</p><p><a href="https://www.pauliusztin.ai/">Paul Iusztin</a></p><div><hr></div><p><em>What&#8217;s your opinion? Do you agree, disagree, or is there something I missed?</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to share our work.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h2>Go Deeper</h2><p><strong>Go from zero to production-grade AI agents</strong> with the <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Agentic AI Engineering self-paced course</a>. Built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Towards AI</a>.</p><p>Across <strong>34 lessons</strong> (articles, videos, and a lot of code), you&#8217;ll design, build, evaluate, and deploy production-grade AI agents end to end. By the final lesson, you&#8217;ll have <strong>built a multi-agent system</strong> and a <strong>capstone project</strong> where you apply everything you&#8217;ve learned on your own.</p><p><em>Three portfolio projects and a certificate to showcase in interviews. Plus a Discord community where you have direct access to other industry experts and me.</em></p><p>Rated 4.9/5 &#11088;&#65039; by 290+ early students &#8212; <em>&#8220;Every AI Engineer needs a course like this.&#8221;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&quot;,&quot;text&quot;:&quot;Learn more&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31"><span>Learn more</span></a></p><p><em>Not ready to commit?</em> We also prepared a free 6-day email course to reveal the <em><strong>6 critical mistakes that silently destroy agentic systems. </strong><a href="https://email-course.towardsai.net/?ref=b3ab31">Get the free email course.</a></em></p><div><hr></div><p><em>Thanks again to <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a> for sponsoring the series and keeping it free!</em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oSDm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 424w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 848w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1272w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png" width="1456" height="364" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:364,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Opik Banner&quot;,&quot;title&quot;:&quot;Opik Banner&quot;,&quot;type&quot;:null,&quot;href&quot;:&quot;https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Opik Banner" title="Opik Banner" srcset="https://substackcdn.com/image/fetch/$s_!oSDm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 424w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 848w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1272w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Try Opik for free here</a> (25k spans/month free)</figcaption></figure></div><p><strong>If you want to monitor, evaluate and optimize your AI workflows and agents:</strong></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;text&quot;:&quot;Try Opik for free&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul"><span>Try Opik for free</span></a></p><div><hr></div><h2>References</h2><ol><li><p>Confident AI. (n.d.). The Definitive Guide to Synthetic Data Generation Using LLMs. Confident AI. <a href="https://www.confident-ai.com/blog/the-definitive-guide-to-synthetic-data-generation-using-llms">https://www.confident-ai.com/blog/the-definitive-guide-to-synthetic-data-generation-using-llms</a></p></li><li><p>Husain, H. (n.d.). Using LLM-as-a-Judge For Evaluation: A Complete Guide. Hamel&#8217;s Blog. <a href="https://hamel.dev/blog/posts/llm-judge/#example-llm-prompts-for-generating-user-inputs">https://hamel.dev/blog/posts/llm-judge/#example-llm-prompts-for-generating-user-inputs</a></p></li><li><p>Zendesk. (n.d.). Building realistic multi-turn tests for AI agents. Zendesk. <a href="https://www.zendesk.com/au/blog/zip1-building-realistic-multi-turn-tests-for-ai-agents/">https://www.zendesk.com/au/blog/zip1-building-realistic-multi-turn-tests-for-ai-agents/</a></p></li><li><p>Evidently AI. (n.d.). How to create LLM test datasets with synthetic data. Evidently AI. <a href="https://www.evidentlyai.com/llm-guide/llm-test-dataset-synthetic-data">https://www.evidentlyai.com/llm-guide/llm-test-dataset-synthetic-data</a></p></li><li><p>Langfuse. (n.d.). Synthetic Dataset Generation for LLM Evaluation. Langfuse. <a href="https://langfuse.com/guides/cookbook/example_synthetic_datasets">https://langfuse.com/guides/cookbook/example_synthetic_datasets</a></p></li><li><p>DeepEval. (n.d.). Generate Synthetic Test Data for LLM Applications. DeepEval. <a href="https://deepeval.com/guides/guides-using-synthesizer">https://deepeval.com/guides/guides-using-synthesizer</a></p></li><li><p>Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., &amp; Jiang, D. (2023). WizardLM: Empowering Large Language Models to Follow Complex Instructions. arXiv. <a href="https://arxiv.org/abs/2304.12244">https://arxiv.org/abs/2304.12244</a></p></li></ol><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[Thank You for Supporting Decoding AI]]></title><description><![CDATA[Roadmap updates and the new DAI chat.]]></description><link>https://www.decodingai.com/p/thank-you</link><guid isPermaLink="false">https://www.decodingai.com/p/thank-you</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Thu, 19 Feb 2026 12:01:20 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/f5d7fb5b-38a5-4a21-8e65-327e5c29ae79_1100x220.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Between wrapping up a series and starting a new one, articles and courses, I haven&#8217;t had much time to pause and look at the community we&#8217;re building here.</p><p>Today, I&#8217;m taking a short break from the technical deep dives to write something a little different. </p>
      <p>
          <a href="https://www.decodingai.com/p/thank-you">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[No Evals Dataset? Here's How to Build One from Scratch]]></title><description><![CDATA[Build evaluators to signal problems that users actually care about. Step-by-step guide.]]></description><link>https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis</link><guid isPermaLink="false">https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Tue, 17 Feb 2026 12:02:09 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!HoRg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Welcome to the <strong><a href="https://www.decodingai.com/t/ai-evals-and-observability">AI Evals &amp; Observability series</a></strong>: A 7-part journey from shipping AI apps to systematically improving them. Made by busy people. For busy people.</em></p><p>&#129488; Everyone says you need AI evals. Few explain how to actually build them and answer questions such as&#8230;</p><p>How do we avoid creating evals that waste our time and resources? How do we build datasets and design evaluators that matter? How do we adapt them for RAG? ...and most importantly, how do we stop &#8220;vibe checking&#8221; and leverage evals to actually track and optimize our app?</p><p><em>This 7-article series breaks it all down from first principles:</em></p><ol><li><p><a href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app">Integrating AI Evals Into Your AI App</a><strong> </strong></p></li><li><p><strong>Build an AI Evals Dataset from Scratch</strong> &#8592; <em>You are here</em></p></li><li><p><a href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals">Generate Synthetic Datasets for AI Evals</a></p></li><li><p><a href="https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures">How to Design Evaluators</a></p></li><li><p><a href="https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge">How to Evaluate the Evaluator</a></p></li><li><p><a href="https://www.decodingai.com/p/rag-evaluation-6-metrics-framework">RAG Evaluation: The Only 6 Metrics You Need</a></p></li><li><p><a href="https://www.decodingai.com/p/behind-the-scenes-of-ai-observability">Lessons from 6 Months of Evals on a Production AI Companion</a></p></li></ol><p>By the end, you&#8217;ll know how to integrate AI evals that actually track and improve the performance of your AI product. No vibe checking required!</p><p><strong>Let&#8217;s get started.</strong></p><div><hr></div><h1>Build an AI Evals Dataset from Scratch</h1><p>In the <a href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app">previous article</a>, you learned where, when, why and what AI Evals are. You saw the three core scenarios (optimization, regression, production monitoring) and the tech stack. But knowing <em>where</em> to use evals is only half the battle. You still need the actual dataset and evaluators to run them. That is what the rest of the articles from the series are about. <strong>The &#8220;how&#8221;.</strong></p><p>For example, after shipping Brown (the writer agent from our <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Agentic AI Engineer course</a> capstone project), I've been actively using it to write articles and lessons (even this one, haha), thus generating a lot of &#8220;production&#8221; traces. However, I had no structured way to evaluate them. I would look at a few outputs, tweak a prompt, and hope for the best. It wasn&#8217;t until I sat down, pulled 50 traces, and started writing notes on what went wrong with each one that I realized most of my failures fell into just 3&#8211;4 categories. That simple exercise of looking at the data changed everything. It told me exactly what to measure and what to fix first.</p><p><em>In reality, is not that simple, but not far from it. That&#8217;s what I want to teach you in this article.</em></p><p>Instead, most teams skip straight to building fancy evaluation dashboards or crafting elaborate scoring criteria without ever looking at their data. As Hamel Husain puts it: <em>&#8220;Many teams make the mistake of crafting elaborate eval criteria without first looking at the data&#8221;</em> <a href="https://hamel.dev/blog/posts/evals-faq/why-is-error-analysis-so-important-in-llm-evals-and-how-is-it-performed.html">[2]</a>. This leads to two common traps: creating irrelevant criteria that waste resources on low-probability defects, or setting unrealistic criteria that the technology isn&#8217;t ready for.</p><p>You know you need AI evals, but you likely don&#8217;t have a dataset, you don&#8217;t have an evaluator, and you don&#8217;t know where to start. Building everything from scratch feels overwhelming. That is why many people start using generic tools and metrics, which is another huge mistake.</p><p><em>The solution is the error analysis framework.</em> This is a step-by-step flywheel: start small, let the data guide you, and iteratively grow your dataset and evaluators. You do not need hundreds of examples or a perfect system on day one. You need 20&#8211;50 real traces and the discipline to look at them carefully, which you can easily start from day zero of your project.</p><p><strong>In this article, we will cover:</strong></p><ol><li><p>How to create and format your initial dataset from production or synthetic traces.</p></li><li><p>How to manually label your data.</p></li><li><p>How to fix errors and grow your dataset with regression tests.</p></li><li><p>The iterative process of building and aligning an LLM judge.</p></li><li><p>How to perform systematic error analysis to cluster and prioritize fixes.</p></li><li><p>When to move from generic evaluators to specialized ones.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HoRg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HoRg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!HoRg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!HoRg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!HoRg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HoRg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png" width="1200" height="1200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1200,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:183015,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/187935789?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HoRg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!HoRg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!HoRg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!HoRg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50162cce-1890-424b-ab13-e3fa910dc94b_1200x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Before digging into the article, a quick word from our sponsor, Opik.</em> &#8595;</p><div><hr></div><h2><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik: Open-Source Observability for Your Multimodal AI Agents (Sponsored</a>)</h2><p>This <em><strong>AI Evals &amp; Observability series</strong></em> is brought to you by <em><strong><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a></strong></em>, the LLMOps open-source platform used by Uber, Netflix, Etsy, and more.</p><p>We&#8217;re proud to partner with a tool we actually use daily across our open-source courses and real-world AI products. Why? <em>Because it makes evaluating multimodal AI apps as easy as evaluating text ones.</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://www.comet.com/docs/opik/evaluation/evaluate_multimodal?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DvE9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89e180b-7253-4edd-91d3-5ef78afccff7_3020x1516.png 424w, https://substackcdn.com/image/fetch/$s_!DvE9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89e180b-7253-4edd-91d3-5ef78afccff7_3020x1516.png 848w, https://substackcdn.com/image/fetch/$s_!DvE9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89e180b-7253-4edd-91d3-5ef78afccff7_3020x1516.png 1272w, https://substackcdn.com/image/fetch/$s_!DvE9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89e180b-7253-4edd-91d3-5ef78afccff7_3020x1516.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DvE9!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89e180b-7253-4edd-91d3-5ef78afccff7_3020x1516.png" width="1200" height="602.4725274725274" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c89e180b-7253-4edd-91d3-5ef78afccff7_3020x1516.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:731,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:&quot;https://www.comet.com/docs/opik/evaluation/evaluate_multimodal?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DvE9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89e180b-7253-4edd-91d3-5ef78afccff7_3020x1516.png 424w, https://substackcdn.com/image/fetch/$s_!DvE9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89e180b-7253-4edd-91d3-5ef78afccff7_3020x1516.png 848w, https://substackcdn.com/image/fetch/$s_!DvE9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89e180b-7253-4edd-91d3-5ef78afccff7_3020x1516.png 1272w, https://substackcdn.com/image/fetch/$s_!DvE9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89e180b-7253-4edd-91d3-5ef78afccff7_3020x1516.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Monitoring traces that contain generated videos, such as when using OpenAI Sora. <a href="https://www.comet.com/docs/opik/evaluation/evaluate_multimodal?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Learn more about monitoring multimodal traces with Opik</a> or about <a href="https://www.comet.com/docs/opik/integrations/openai#video-generation-sora?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">hooking Opik to OpenAI Sora</a>.</figcaption></figure></div><p><em>AI apps are no longer just text-in, text-out.</em> They process images, generate videos, parse PDFs, and more. Monitoring and evaluating all of that used to be painful. With Opik, it&#8217;s not. Here is why we love it:</p><ul><li><p><strong>Trace everything</strong> &#8212; Opik renders images, videos and PDFs directly inside your traces. No more guessing what your model actually saw or generated. We use this daily, and it changed how we debug multimodal pipelines.</p></li><li><p><strong>Zero-friction multimodal evals</strong> &#8212; Add image URLs or upload files directly in the UI, then run LLM-as-a-Judge evaluations on them. Opik auto-detects vision-capable models (GPT-4o, Claude 3+, Gemini) and warns you if the model doesn&#8217;t support vision.</p></li><li><p><strong>Video generation? Traced automatically</strong> &#8212; Wrap your OpenAI client in one line, and Opik tracks the full Sora workflow: creation, polling, download, and logs the generated video as an attachment. Full visibility, minimal setup. <a href="https://www.comet.com/docs/opik/integrations/openai#video-generation-sora?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Guide here</a>.</p></li></ul><p><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a> is fully open-source and works with custom code or most AI frameworks. You can also use the managed version for free (with 25K spans/month on their generous free tier). <em>Learn more about evaluating multimodal traces:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.comet.com/docs/opik/evaluation/evaluate_multimodal?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;text&quot;:&quot;Evaluate multimodal traces&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.comet.com/docs/opik/evaluation/evaluate_multimodal?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul"><span>Evaluate multimodal traces</span></a></p><p>&#8595; <em>Now, let&#8217;s move back to the article.</em></p><div><hr></div><h2>Create the AI Evals Dataset</h2><p>Before you can evaluate anything, you need a evals dataset. This is a collection of examples that represent how your app should behave. This is the foundation on which everything else builds. Start small, as 20&#8211;50 examples are enough, and grow it over time. As Anthropic recommends: <em>&#8220;20-50 simple tasks drawn from real failures are a great start&#8221;</em> <a href="https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents">[3]</a>.</p><p>The highest-value source for your evals dataset is real production traces. These are actual user interactions with your app. These reflect genuine usage patterns, edge cases, and failure modes that you could never fully anticipate upfront.</p><p>If you are pre-launch or have limited production data, start with the manual checks you already run during development. These are the behaviors you verify before each release, and common tasks end users try.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SGXS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcf6142a-a653-4ddb-995c-df7a1cfe870c_1200x978.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SGXS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcf6142a-a653-4ddb-995c-df7a1cfe870c_1200x978.png 424w, https://substackcdn.com/image/fetch/$s_!SGXS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcf6142a-a653-4ddb-995c-df7a1cfe870c_1200x978.png 848w, https://substackcdn.com/image/fetch/$s_!SGXS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcf6142a-a653-4ddb-995c-df7a1cfe870c_1200x978.png 1272w, https://substackcdn.com/image/fetch/$s_!SGXS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcf6142a-a653-4ddb-995c-df7a1cfe870c_1200x978.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SGXS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcf6142a-a653-4ddb-995c-df7a1cfe870c_1200x978.png" width="1200" height="978" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bcf6142a-a653-4ddb-995c-df7a1cfe870c_1200x978.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:978,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:149381,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/187935789?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcf6142a-a653-4ddb-995c-df7a1cfe870c_1200x978.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SGXS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcf6142a-a653-4ddb-995c-df7a1cfe870c_1200x978.png 424w, https://substackcdn.com/image/fetch/$s_!SGXS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcf6142a-a653-4ddb-995c-df7a1cfe870c_1200x978.png 848w, https://substackcdn.com/image/fetch/$s_!SGXS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcf6142a-a653-4ddb-995c-df7a1cfe870c_1200x978.png 1272w, https://substackcdn.com/image/fetch/$s_!SGXS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcf6142a-a653-4ddb-995c-df7a1cfe870c_1200x978.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Step 1</figcaption></figure></div><p>You must log everything: user input, system prompt, model output, tool calls, retrieved documents, and metadata such as channel, timestamp, and user ID. Make it easy to browse, filter, and search these traces. You can easily log these traces using observability/LLMOps tools such as <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a> (which we always use) <a href="https://youtube.com/watch?v=BsWxPI9UM4c&amp;si=Zn5CgOvM_uqtTrF6">[4]</a>.</p><p>Aim for 50&#8211;100 traces initially. In early development, each change has a noticeable impact, so small sample sizes work fine. More mature systems need larger datasets.</p><p>In case you already have tons of production traces, you need to properly sample them.</p><p>More advanced sampling strategies include outlier detection (sorting by response length, latency, or tool calls and reviewing extremes), user feedback signals (prioritizing traces with negative feedback or escalations), metric-based sorting (using generic metrics as exploration signals), stratified sampling (grouping by user type, feature, or query category and sampling from each), and embedding clustering (generating embeddings of queries, clustering them to reveal natural groupings, then oversampling small clusters for edge cases).</p><p>If you don&#8217;t have enough production data, you can automatically create test examples using an LLM to generate synthetic user inputs across different dimensions like features, scenarios, and user personas. We will explore synthetic data generation in detail in Article 3.</p><p>Every example in your dataset should have a consistent structure:</p><ol><li><p><strong>Input:</strong> The user query or request.</p></li><li><p><strong>Output:</strong> The final output of the agent.</p></li><li><p><strong>Context:</strong> Any additional information the system had access to, such as retrieved documents, conversation history, or system prompts.</p></li><li><p><strong>Trace Spans:</strong> The intermediate steps, such as tool calls (API calls, search operations), model calls or any other relevant step (e.g., mapping to Pydantic).</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7D-2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1c30676-e9d7-42d7-a2d3-b63456bac388_2036x1566.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7D-2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1c30676-e9d7-42d7-a2d3-b63456bac388_2036x1566.png 424w, https://substackcdn.com/image/fetch/$s_!7D-2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1c30676-e9d7-42d7-a2d3-b63456bac388_2036x1566.png 848w, https://substackcdn.com/image/fetch/$s_!7D-2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1c30676-e9d7-42d7-a2d3-b63456bac388_2036x1566.png 1272w, https://substackcdn.com/image/fetch/$s_!7D-2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1c30676-e9d7-42d7-a2d3-b63456bac388_2036x1566.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7D-2!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1c30676-e9d7-42d7-a2d3-b63456bac388_2036x1566.png" width="1200" height="923.0769230769231" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d1c30676-e9d7-42d7-a2d3-b63456bac388_2036x1566.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:1120,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:488564,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/187935789?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1c30676-e9d7-42d7-a2d3-b63456bac388_2036x1566.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7D-2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1c30676-e9d7-42d7-a2d3-b63456bac388_2036x1566.png 424w, https://substackcdn.com/image/fetch/$s_!7D-2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1c30676-e9d7-42d7-a2d3-b63456bac388_2036x1566.png 848w, https://substackcdn.com/image/fetch/$s_!7D-2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1c30676-e9d7-42d7-a2d3-b63456bac388_2036x1566.png 1272w, https://substackcdn.com/image/fetch/$s_!7D-2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1c30676-e9d7-42d7-a2d3-b63456bac388_2036x1566.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Example of monitoring a multi-modal trace from Brown, the writing agent, using <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a> (an LLMOps platform).</figcaption></figure></div><p>This structure ensures your evaluators have everything they need to judge whether the system performed correctly. It also makes your dataset portable across different evaluator types, from simple ones that just check whether a specific tool was called to more complex LLM Judges.</p><p>&#128270; Suppose you are building a vertical assistant agent that drafts replies to work emails and Slack messages. You have been using it in production for two weeks and have logged traces. Here is how one formatted dataset entry looks like:</p><ul><li><p><strong>Input:</strong> A Slack message in #product from your Product Manager: &#8220;@you Can you estimate how long it would take to add CSV export to the reports page?&#8221;</p></li><li><p><strong>Output:</strong> The agent&#8217;s generated reply: &#8220;Based on the current codebase, I&#8217;d estimate 3&#8211;5 days, including testing. The data layer is already structured for it &#8212; the main work would be the export logic and UI. Want me to write up a quick ticket with more detail?&#8221;</p></li><li><p><strong>Context:</strong> The system prompt instructs the agent to match workplace communication norms, the user&#8217;s role (engineer), the channel (#product), and relevant prior messages in the thread.</p></li><li><p><strong>Trace Spans:</strong> (1) A model call to classify the message type and urgency, (2) a retrieval step pulling relevant codebase context about the reports page, (3) a model call generating the draft reply.</p></li></ul><p>You pull 50 such traces. These include a mix of emails and Slack messages across scenarios such as executive requests, colleague questions, and phishing attempts. You format each one this way.</p><p>Now that you have a dataset, the next step is to label it. You need to decide, for each example, whether the output is good enough.</p><h2>Manually Label Your Dataset</h2><p>To judge whether each example is <em>&#8220;good&#8221;</em> or <em>&#8220;bad,&#8221;</em> you need to appoint one or multiple domain experts. To avoid inconsistencies. The less, the better. The domain expert should be someone who genuinely understands your use case to lead the labeling process. </p><p><strong>This step is key! </strong>Why? Because this person&#8217;s judgment becomes the definitive source of truth. In other words, how it labels your dataset will have a cascading effect on everything else. </p><p>For each example in your dataset, the domain expert makes a binary judgment: <strong>Pass</strong> or <strong>Fail</strong>. Do not use a 1&#8211;5 scale or letter grades. Just pass or fail.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Nr0u!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2afdab3e-32a3-4210-ad65-14c4975d2f6a_1200x978.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Nr0u!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2afdab3e-32a3-4210-ad65-14c4975d2f6a_1200x978.png 424w, https://substackcdn.com/image/fetch/$s_!Nr0u!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2afdab3e-32a3-4210-ad65-14c4975d2f6a_1200x978.png 848w, https://substackcdn.com/image/fetch/$s_!Nr0u!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2afdab3e-32a3-4210-ad65-14c4975d2f6a_1200x978.png 1272w, https://substackcdn.com/image/fetch/$s_!Nr0u!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2afdab3e-32a3-4210-ad65-14c4975d2f6a_1200x978.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Nr0u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2afdab3e-32a3-4210-ad65-14c4975d2f6a_1200x978.png" width="1200" height="978" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2afdab3e-32a3-4210-ad65-14c4975d2f6a_1200x978.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:978,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:149310,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/187935789?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2afdab3e-32a3-4210-ad65-14c4975d2f6a_1200x978.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Nr0u!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2afdab3e-32a3-4210-ad65-14c4975d2f6a_1200x978.png 424w, https://substackcdn.com/image/fetch/$s_!Nr0u!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2afdab3e-32a3-4210-ad65-14c4975d2f6a_1200x978.png 848w, https://substackcdn.com/image/fetch/$s_!Nr0u!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2afdab3e-32a3-4210-ad65-14c4975d2f6a_1200x978.png 1272w, https://substackcdn.com/image/fetch/$s_!Nr0u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2afdab3e-32a3-4210-ad65-14c4975d2f6a_1200x978.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Step 2</figcaption></figure></div><p>Binary decisions force clarity. A score of &#8220;3.2 out of 5&#8221; is hard to interpret and even harder to act on <a href="https://www.decodingai.com/p/the-5-star-lie-you-are-doing-ai-evals">[6]</a>. Pass/fail forces you to articulate exactly what &#8220;good enough&#8221; means and creates actionable insights. If something fails, you know it needs fixing. If it passes, you move on.</p><p>For every judgment, especially failures, the domain expert must write a short critique explaining <em>why</em> it failed or passed. These critiques are gold. They capture the expert&#8217;s reasoning, surface patterns in what goes wrong, and later become a central piece in the few-shot examples you feed to your LLM judge.</p><p>Even if your labels are binary, using critiques adds detail, and instead of vague numbers like 3.4/5, a clear explanation highlights exactly what went well or what went wrong. Also, they indirectly act as instructions to your LLM judge. Reason why they are so important when adding them as few-shot examples.</p><p>Do not try to catch every single mistake in a trace. Find the first thing that went wrong, the most upstream error, and move to the next example. The goal is to surface recurring failure patterns, not to write a detailed report for each trace. However, do not be too concise. Each critique should be detailed enough to later serve as a few-shot example for your LLM judge.</p><p>With <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik&#8217;s API</a> or <a href="https://www.comet.com/docs/opik/prompt_engineering/mcp_server?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">MCP Server</a>, you can easily extract traces into a spreadsheet, a simple internal tool, or your custom annotation tool, as discussed in Article 1, to do this. Display the input, the system&#8217;s output, and all the context side by side. Make labeling as frictionless as possible.</p><p>&#128270; Consider our email/Slack assistant example. Your domain expert sits down with the first 50 traces and labels each one:</p><ul><li><p><strong>Trace #5 &#8212; FAIL:</strong> A vendor sent four specific technical questions about API integration. The agent replied: &#8220;I&#8217;ll look into these and get back to you.&#8221; <em>Critique:</em> &#8220;Failed to answer any of the four specific questions. Gave no timeline for follow-up. A vendor expecting technical answers got a vague brush-off. This damages the partnership.&#8221;</p></li><li><p><strong>Trace #7 &#8212; PASS:</strong> The CFO approved a $50K budget reallocation and asked the user to loop in HR. The agent replied by restating the action items, confirming it would initiate the transfer and loop in HR, and promising a confirmation. <em>Critique:</em> &#8220;Accurately restates all action items from the CFO&#8217;s approval, commits to the next steps, and matches the professional tone expected for executive communication.&#8221;</p></li><li><p><strong>Trace #6 &#8212; FAIL:</strong> An obvious advance-fee scam email from &#8220;Prince Makumba&#8221; offering $8.5M. The agent replied: &#8220;Could you provide more details about this inheritance?&#8221; <em>Critique:</em> &#8220;Engaged with a textbook scam email instead of ignoring it. Any reply validates the scammer&#8217;s target. Expected behavior: no reply.&#8221;</p></li></ul><p>You now have a labeled dataset with pass/fail judgments and critiques. The natural next step: fix the obvious problems you&#8217;ve just discovered.</p><h2>Manually Fix Errors</h2><p>The labeling process will reveal generic, often simple issues. Examples include a missing instruction in a prompt, a broken tool call, or a formatting problem. Fix these before doing anything else. Do not build an evaluator for something you can just fix right now.</p><p>As Hamel Husain recommends, address obvious errors discovered during review before building judges. The point of the flywheel is product quality, not a pretty eval suite <a href="https://youtube.com/watch?v=BsWxPI9UM4c&amp;si=Zn5CgOvM_uqtTrF6">[4]</a>. </p><p>After fixing, re-run your system with the same inputs and potentially new ones to generate fresh outputs. Label the new outputs. Did your fixes work? Did they introduce new problems? This is the inner loop of the flywheel: create, label, fix, and repeat. Each iteration improves your system and enriches your dataset.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dqSm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ed4005-1671-4c17-915d-533bba04f196_1200x978.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dqSm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ed4005-1671-4c17-915d-533bba04f196_1200x978.png 424w, https://substackcdn.com/image/fetch/$s_!dqSm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ed4005-1671-4c17-915d-533bba04f196_1200x978.png 848w, https://substackcdn.com/image/fetch/$s_!dqSm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ed4005-1671-4c17-915d-533bba04f196_1200x978.png 1272w, https://substackcdn.com/image/fetch/$s_!dqSm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ed4005-1671-4c17-915d-533bba04f196_1200x978.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dqSm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ed4005-1671-4c17-915d-533bba04f196_1200x978.png" width="1200" height="978" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b1ed4005-1671-4c17-915d-533bba04f196_1200x978.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:978,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:149330,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/187935789?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ed4005-1671-4c17-915d-533bba04f196_1200x978.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dqSm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ed4005-1671-4c17-915d-533bba04f196_1200x978.png 424w, https://substackcdn.com/image/fetch/$s_!dqSm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ed4005-1671-4c17-915d-533bba04f196_1200x978.png 848w, https://substackcdn.com/image/fetch/$s_!dqSm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ed4005-1671-4c17-915d-533bba04f196_1200x978.png 1272w, https://substackcdn.com/image/fetch/$s_!dqSm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ed4005-1671-4c17-915d-533bba04f196_1200x978.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Step 3</figcaption></figure></div><p>Every iteration adds more labeled examples to your dataset. Failed examples that you have now fixed become regression test cases. They ensure old bugs do not come back. New examples expand coverage. Over time, your dataset becomes a living artifact that captures the accumulated knowledge of what &#8220;good&#8221; and &#8220;bad&#8221; look like for your specific app. This is the foundation for everything that follows. Aim for continuous growth: start with 20&#8211;50, grow to 100+, and keep adding as you discover new failure modes in production.</p><p>&#128270; Back to our simple agent example that answers professional emails or Slack messages. During labeling, the domain expert flagged several traces where the agent replied to obvious phishing and scam emails. </p><p>The fix was straightforward: add an explicit instruction to the system prompt telling the agent to never reply to messages from unrecognized external senders requesting money, credentials, or personal information. It should flag them as suspicious instead. After applying the fix, the expert re-runs the same scam inputs through the updated system. The agent now correctly produces no reply for all of them. These previously-failing traces become regression test cases. This ensures this class of errors never returns, even after future prompt changes.</p><p>At some point, manually labeling every example doesn&#8217;t scale. That&#8217;s when you need an automated evaluator to do the heavy lifting for you.</p><h2>Iteratively Build Your Evaluator</h2><p>You have been labeling by hand. That worked for the first 50&#8211;100 examples, but now you want to evaluate thousands of traces automatically. You need an evaluator. This is a system that can judge outputs without a human in the loop. </p><p>The key insight is to build it <em>iteratively</em> using the human-labeled data you have already collected, rather than designing evaluation criteria from scratch.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hlLz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadc49d15-ad38-4e42-94b3-cb68e48ec2f4_1200x978.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hlLz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadc49d15-ad38-4e42-94b3-cb68e48ec2f4_1200x978.png 424w, https://substackcdn.com/image/fetch/$s_!hlLz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadc49d15-ad38-4e42-94b3-cb68e48ec2f4_1200x978.png 848w, https://substackcdn.com/image/fetch/$s_!hlLz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadc49d15-ad38-4e42-94b3-cb68e48ec2f4_1200x978.png 1272w, https://substackcdn.com/image/fetch/$s_!hlLz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadc49d15-ad38-4e42-94b3-cb68e48ec2f4_1200x978.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hlLz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadc49d15-ad38-4e42-94b3-cb68e48ec2f4_1200x978.png" width="1200" height="978" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/adc49d15-ad38-4e42-94b3-cb68e48ec2f4_1200x978.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:978,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:149310,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/187935789?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadc49d15-ad38-4e42-94b3-cb68e48ec2f4_1200x978.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hlLz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadc49d15-ad38-4e42-94b3-cb68e48ec2f4_1200x978.png 424w, https://substackcdn.com/image/fetch/$s_!hlLz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadc49d15-ad38-4e42-94b3-cb68e48ec2f4_1200x978.png 848w, https://substackcdn.com/image/fetch/$s_!hlLz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadc49d15-ad38-4e42-94b3-cb68e48ec2f4_1200x978.png 1272w, https://substackcdn.com/image/fetch/$s_!hlLz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadc49d15-ad38-4e42-94b3-cb68e48ec2f4_1200x978.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Step 4</figcaption></figure></div><p>Before building your evaluator, split your labeled dataset into subsets:</p><ul><li><p><strong>Train:</strong> The examples you will use to build and tune your evaluator (e.g., as few-shot examples in your LLM judge prompt).</p></li><li><p><strong>Dev:</strong> The examples you will use to check whether your evaluator is working while you iterate on it.</p></li><li><p><strong>Test:</strong> The examples you set aside and never touch until you are ready for a final evaluation of your evaluator&#8217;s quality.</p></li></ul><p>We will cover the details of how to split effectively and how to evaluate the evaluator&#8217;s effectiveness in Article 5. For now, the key idea is: do not train and test on the same data. Exactly how we do when training any other AI model.</p><p>Due to the nature of AI app outputs, which are non-deterministic, unstructured, and subjective, the most popular approach is to use an LLM to grade another LLM&#8217;s output. This is known as an <em>&#8220;LLM judge&#8221;</em> or <em>&#8220;LLM-as-a-judge&#8221;.</em></p><p>At this stage, build one binary LLM judge that runs across your entire dataset. Think of it as a binary classifier: for each trace, it returns <strong>Pass (1) or Fail (0)</strong> plus a <strong>critique</strong> explaining its reasoning. Not a 1&#8211;5 scale. Not a letter grade. Just pass/fail with a written justification. This mirrors exactly what your domain expert did during manual labeling.</p><p>Binary judgments are clear, actionable, and easy to aggregate. A score of <em>&#8220;3.2 out of 5&#8221;</em> is hard to interpret and even harder to act on. Pass/fail forces clarity and creates actionable insights.</p><p>The real power lies in your few-shot examples and dataset, not your prompt. This is a counterintuitive but critical insight: the system prompt for your LLM judge can be almost neutral. Just specify what the task is, the expected output format (pass/fail + critique), and a few core steps. Keep it simple. The real guidance comes from the <strong>few-shot examples</strong> you include in the prompt. These are the labeled examples from your dataset with their critiques. These examples encode your domain expert&#8217;s judgment, show the LLM what &#8220;good&#8221; and &#8220;bad&#8221; look like for your specific use case, and steer the judge far more effectively than elaborate prompt instructions ever could. Your dataset is the secret weapon, not your system prompt. More on this in Article 4.</p><p>To build it, start with the critiques your domain expert wrote during manual labeling. Select representative pass and fail examples. These become the few-shot examples in your judge prompt. Test the judge against your dev set and iterate until it mostly agrees with your domain expert&#8217;s labels. </p><p>Not everything needs an LLM judge. For anything that can be checked with simple logic (true/false or numeric), use code-based checks. Examples include checking whether the response included a required disclaimer, whether it is within the word limit, whether it returned valid structured output, or whether it called the right tool.</p><p>Code-based evaluators are faster, cheaper, and more reliable than LLM judges for objective checks. Reserve LLM judges for subjective or nuanced checks. These include tone, helpfulness, conversational flow, or quality of handoffs, where correctness is hard to express in code. More on this in Article 4. For now, the rule of thumb is: use code when you can, use an LLM judge when you must.</p><p>Your evaluator is only useful if it agrees with your domain expert. Run the evaluator on the dev set and determine how often it agrees with the human. Create an agreement matrix comparing the human label to the evaluator&#8217;s label. If the evaluator says &#8220;Pass&#8221; when the human said &#8220;Fail&#8221; (or vice versa), refine the evaluator&#8217;s prompt or logic until the agreement is high enough to trust. Be aware that raw agreement can be misleading with imbalanced datasets. More on this in Article 5. For now, the key idea is: always validate your automated evaluator against human judgment before trusting it.</p><p><em>&#8220;Many teams make the mistake of crafting elaborate eval criteria without first looking at the data&#8221;</em> <a href="https://hamel.dev/blog/posts/evals-faq/why-is-error-analysis-so-important-in-llm-evals-and-how-is-it-performed.html">[2]</a>. This quote from Hamel Husain captures the core philosophy of the error analysis framework. Designing evaluation criteria in a vacuum. Without first reviewing your actual traces and failure modes, you risk creating irrelevant criteria that waste resources on low-probability defects or unrealistic criteria. The solution is to put the data and metrics first, not preset criteria or LLMs. Let the failure modes you discover through manual review and error analysis drive what you evaluate.</p><p>&#128270; Back to our example. Let&#8217;s assume you select 15 labeled traces with their critiques from your dataset and use them as few-shot examples in your LLM judge prompt. For instance, you include the vendor email trace (Fail &#8212; &#8220;didn&#8217;t answer any of the four specific questions, gave no timeline&#8221;) and the CFO budget trace (Pass &#8212; &#8220;accurately restates action items, commits to next steps, matches professional tone&#8221;). </p><p>The judge's prompt simply says: <em>&#8220;You are evaluating whether an AI email/Slack assistant produced an appropriate reply. For each trace, output Pass or Fail with a critique explaining your reasoning. Here are examples of how a domain expert judged similar traces:&#8221;</em> followed by those few-shot examples. You run this judge on your dev set of 20 traces. It agrees with the domain expert&#8217;s labels on 18 out of 20. The two disagreements reveal that the judge is too lenient on vague responses to multi-part questions. You add another few-shot example covering that pattern, and agreement improves.</p><p>You now have automated evaluators aligned with human judgment. The next step is to run them on new data and analyze the errors they find.</p><h2>Doing Error Analysis</h2><p>Your evaluator is running. It is flagging failures across hundreds or thousands of traces. But a list of pass/fail results is not enough. You need to understand <em>which errors are occurring, how often, and which to fix first</em>. This is error analysis: the most important activity in evals. It is the systematic process of clustering, ranking, and acting on the failures your evaluators surface. It helps you decide what evaluators to create in the first place, allowing you to identify failure modes unique to your application and data.</p><p>Sample a fresh batch of production traces (or new synthetic data) from your observability layer, such as <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a>, that your evaluator hasn&#8217;t seen before. Run your evaluator on these traces. You now have a set of Pass/Fail results with critiques. </p><p><em>This is where the flywheel connects to production monitoring: sample live traces regularly and run your evaluators on them to track quality over time.</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5ZGK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b827341-efd2-44c6-b375-aee6b77e74a9_1200x978.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5ZGK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b827341-efd2-44c6-b375-aee6b77e74a9_1200x978.png 424w, https://substackcdn.com/image/fetch/$s_!5ZGK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b827341-efd2-44c6-b375-aee6b77e74a9_1200x978.png 848w, https://substackcdn.com/image/fetch/$s_!5ZGK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b827341-efd2-44c6-b375-aee6b77e74a9_1200x978.png 1272w, https://substackcdn.com/image/fetch/$s_!5ZGK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b827341-efd2-44c6-b375-aee6b77e74a9_1200x978.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5ZGK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b827341-efd2-44c6-b375-aee6b77e74a9_1200x978.png" width="1200" height="978" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2b827341-efd2-44c6-b375-aee6b77e74a9_1200x978.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:978,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:149339,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/187935789?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b827341-efd2-44c6-b375-aee6b77e74a9_1200x978.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5ZGK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b827341-efd2-44c6-b375-aee6b77e74a9_1200x978.png 424w, https://substackcdn.com/image/fetch/$s_!5ZGK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b827341-efd2-44c6-b375-aee6b77e74a9_1200x978.png 848w, https://substackcdn.com/image/fetch/$s_!5ZGK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b827341-efd2-44c6-b375-aee6b77e74a9_1200x978.png 1272w, https://substackcdn.com/image/fetch/$s_!5ZGK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b827341-efd2-44c6-b375-aee6b77e74a9_1200x978.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Step 5</figcaption></figure></div><p>If you have been writing critiques during manual labeling, you have already been practicing open coding. This section formalizes and scales that process using your automated evaluators&#8217; output on new traces.</p><p>For each failed trace, write a short, informal note describing what went wrong. Do this in your own words, free-form. These are called &#8220;open codes&#8221; in qualitative research. Keep writing until patterns emerge. Examples include &#8220;hallucinated product feature,&#8221; &#8220;wrong tool call,&#8221; &#8220;missed escalation,&#8221; &#8220;bad formatting,&#8221; or &#8220;wrong tone.&#8221;</p><p>Once you have enough open codes, use an LLM to help group them into higher-level categories (axial codes). For example, individual notes about &#8220;ignored user&#8217;s refund request,&#8221; &#8220;didn&#8217;t acknowledge frustration,&#8221; and &#8220;transferred too late&#8221; might cluster into a category called &#8220;human handoff issues.&#8221; Review and edit these categories yourself. Make labels specific and actionable. Merge or split until they feel right. Add a &#8220;none of the above&#8221; category so the LLM can signal gaps and help you discover new patterns. Ultimately, identify the most frequent categories. This tells you where to focus.</p><p>Not all failure categories are equally important. Your overall pass rate can be misleading. As Jason Liu warns, aggregate metrics lie <a href="https://www.decodingai.com/p/the-real-battle-tested-rag-playbook">[7]</a>. You need to look at each cluster individually.</p><p>Rank each error cluster using a <strong>2&#215;2 matrix</strong> with two dimensions:</p><ul><li><p><strong>Frequency (Volume):</strong> How often does this error occur?</p></li><li><p><strong>Severity (Impact):</strong> How bad is this error when it does occur?</p></li></ul><p>This gives you <strong>four quadrants</strong>:</p><ul><li><p><strong>High frequency + High severity:</strong> Your top priority. Fix these immediately. These are the errors that happen often and hurt the most.</p></li><li><p><strong>High frequency + Low severity:</strong> Important to address, but less urgent. They are annoying but not critical.</p></li><li><p><strong>Low frequency + High severity:</strong> Monitor closely. They don&#8217;t happen often, but when they do, the consequences are serious (e.g., safety issues, data leaks).</p></li><li><p><strong>Low frequency + Low severity:</strong> Deprioritize. These can wait.</p></li></ul><p>For a more nuanced prioritization, compute: <strong>Priority = Frequency &#215; Severity &#215; Business Value</strong>. A low-frequency error might jump to the top of the list if it directly impacts revenue or user safety. For example, a <em>&#8220;hallucinated pricing&#8221;</em> error might only happen 5% of the time, but its business impact is critical. This is far more important than a 30% <em>&#8220;formatting&#8221;</em> error that merely annoys users. Context matters: let business value break ties and override pure frequency counts.</p><p>This step also helps you surface problematic traces for review beyond user feedback. Your evaluators can proactively identify issues that users haven&#8217;t yet complained about. The goal is to turn vague impressions (&#8220;the app feels off&#8221;) into specific, ranked problems (&#8220;hallucination errors account for 25% of failures and have a critical business impact&#8221;).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!W9W_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f26204f-7b95-42c6-afb6-a64ec99a98b9_1201x1090.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!W9W_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f26204f-7b95-42c6-afb6-a64ec99a98b9_1201x1090.png 424w, https://substackcdn.com/image/fetch/$s_!W9W_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f26204f-7b95-42c6-afb6-a64ec99a98b9_1201x1090.png 848w, https://substackcdn.com/image/fetch/$s_!W9W_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f26204f-7b95-42c6-afb6-a64ec99a98b9_1201x1090.png 1272w, https://substackcdn.com/image/fetch/$s_!W9W_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f26204f-7b95-42c6-afb6-a64ec99a98b9_1201x1090.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!W9W_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f26204f-7b95-42c6-afb6-a64ec99a98b9_1201x1090.png" width="1201" height="1090" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9f26204f-7b95-42c6-afb6-a64ec99a98b9_1201x1090.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1090,&quot;width&quot;:1201,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:84402,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/187935789?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f26204f-7b95-42c6-afb6-a64ec99a98b9_1201x1090.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!W9W_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f26204f-7b95-42c6-afb6-a64ec99a98b9_1201x1090.png 424w, https://substackcdn.com/image/fetch/$s_!W9W_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f26204f-7b95-42c6-afb6-a64ec99a98b9_1201x1090.png 848w, https://substackcdn.com/image/fetch/$s_!W9W_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f26204f-7b95-42c6-afb6-a64ec99a98b9_1201x1090.png 1272w, https://substackcdn.com/image/fetch/$s_!W9W_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f26204f-7b95-42c6-afb6-a64ec99a98b9_1201x1090.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The Four-Quadrant Prioritization Framework with axes for Frequency (Volume) and Severity (User Satisfaction), and labels for each quadrant</figcaption></figure></div><p>For the highest-priority error categories, take action. Simple fixes might involve adjusting prompts, fixing tool configurations, or updating system instructions. Complex fixes might require redesigning the agent&#8217;s workflow, adding new tools, or restructuring the context. After fixing, add the previously failing traces to your dataset as regression test cases. Run your evaluators again to verify the fix worked and didn&#8217;t break anything else.</p><p>The first round of error analysis is a one-time investment of about 3&#8211;4 days. After the initial setup, 30 minutes per week is enough to review the latest failures and top categories, fix the easiest high-impact issues, and add or refine an evaluator only for stubborn, important problems. Re-run a full error analysis when you see a spike in failure rates, when user feedback reveals a new class of issues, or when your evaluators start feeling stale.</p><p>Over time, the flywheel converges. You fix the biggest problems first, your failure rates drop, and the remaining issues become smaller and less frequent. How often you need to run the flywheel depends on your online signals: are users satisfied, or are there anomalies?</p><p>&#128270; Based on our example, let&#8217;s suppose you run your LLM judge on 200 new production traces from the email/Slack assistant. It flags 60 failures. You write open codes for each. These are quick notes like <em>&#8220;replied to phishing link,&#8221;</em> <em>&#8220;mocked colleague&#8217;s achievement,&#8221;</em> <em>&#8220;leaked Annual Recurring Revenue (ARR) to external contact,&#8221;</em> or <em>&#8220;no reply to urgent CEO request.&#8221;</em> Then you use an LLM to cluster these into axial codes:</p><ul><li><p><strong>Tone &amp; Professionalism Issues</strong> (18 failures): hostile replies, dismissive responses, overly casual tone with executives.</p></li><li><p><strong>Security Awareness Failures</strong> (14 failures): engaging with phishing, falling for CEO impersonation scams, clicking malicious links.</p></li><li><p><strong>Information Leaks</strong> (10 failures): sharing confidential financials, disclosing unreleased product plans, revealing salary data to strangers.</p></li><li><p><strong>Missing/No Response</strong> (9 failures): ignoring urgent requests, leaving teammates blocked, not confirming time-sensitive deadlines.</p></li></ul><p>You rank them: Security Awareness Failures are high-frequency and high-severity (financial and safety risk), so they are the top priority. Tone issues are high-frequency but lower-severity. Information Leaks are lower-frequency but high-severity. You fix the top cluster first, then move down the list <a href="https://www.decodingai.com/p/the-real-battle-tested-rag-playbook">[7]</a>.</p><p>Sometimes your error analysis reveals that a single, generic evaluator isn&#8217;t enough. Different types of errors need different evaluators.</p><h2>Create Specialized Evaluators</h2><p>Real-world AI apps don&#8217;t do just one thing. A customer support bot handles refunds, shipping questions, account issues, escalation, and more. Each of these capabilities has its own definition of &#8220;good&#8221; and its own failure modes. </p><p>A single, generic binary evaluator (Pass/Fail on &#8220;overall quality&#8221;). Like the one you built in the previous section. It can catch broad issues but will miss category-specific problems. When your error analysis reveals distinct clusters of failures that require different evaluation criteria, it is time to create specialized evaluators.</p><p>Up to this point, your LLM judge has been a generic binary evaluator. It checks whether a trace is generally &#8220;good&#8221; or &#8220;bad.&#8221; But the error clusters the previous section might reveal that you need:</p><ul><li><p>A judge specifically for &#8220;hallucination&#8221; (did the system make up information not in the context?)</p></li><li><p>A judge specifically for &#8220;escalation quality&#8221; (did the system hand off to a human at the right time, with the right context?)</p></li><li><p>A judge specifically for &#8220;tone&#8221; (was the system&#8217;s tone appropriate for a frustrated user?)</p></li></ul><p>Keep each specialized evaluator tightly scoped: Each judge should evaluate only <strong>one specific failure mode</strong>. Do not build a single judge who tries to assess everything at once. This keeps each judge simple, debuggable, and accurate.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dPv5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfc69589-ab3a-4794-bbec-4da294db29b2_1200x978.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dPv5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfc69589-ab3a-4794-bbec-4da294db29b2_1200x978.png 424w, https://substackcdn.com/image/fetch/$s_!dPv5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfc69589-ab3a-4794-bbec-4da294db29b2_1200x978.png 848w, https://substackcdn.com/image/fetch/$s_!dPv5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfc69589-ab3a-4794-bbec-4da294db29b2_1200x978.png 1272w, https://substackcdn.com/image/fetch/$s_!dPv5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfc69589-ab3a-4794-bbec-4da294db29b2_1200x978.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dPv5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfc69589-ab3a-4794-bbec-4da294db29b2_1200x978.png" width="1200" height="978" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bfc69589-ab3a-4794-bbec-4da294db29b2_1200x978.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:978,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:149268,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/187935789?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfc69589-ab3a-4794-bbec-4da294db29b2_1200x978.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dPv5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfc69589-ab3a-4794-bbec-4da294db29b2_1200x978.png 424w, https://substackcdn.com/image/fetch/$s_!dPv5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfc69589-ab3a-4794-bbec-4da294db29b2_1200x978.png 848w, https://substackcdn.com/image/fetch/$s_!dPv5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfc69589-ab3a-4794-bbec-4da294db29b2_1200x978.png 1272w, https://substackcdn.com/image/fetch/$s_!dPv5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfc69589-ab3a-4794-bbec-4da294db29b2_1200x978.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Step 6</figcaption></figure></div><p>Each specialized evaluator has its own rubric, a few-shot set of examples (drawn from the critiques in your labeled dataset for that specific failure mode), and its own pass/fail definition. Just like the generic judge, the effectiveness remains in the few-shot examples. But now those examples are curated specifically for one failure mode, making the judge even more precise.</p><p>Only create a new evaluator when error analysis shows a persistent, high-impact failure category that your generic evaluator can&#8217;t reliably detect. Do not create evaluators speculatively. Let the data tell you what you need. Pick 4&#8211;7 high-value failure modes that happen often enough to matter and don&#8217;t get reliably fixed by a simple prompt change <a href="https://hamel.dev/blog/posts/llm-judge/">[8]</a>. Start there and add more only when error analysis demands it.</p><p>&#128270; In our email assistant example, your generic evaluator catches broad failures, but the error analysis showed that <em>&#8220;Security Awareness Failures&#8221;</em> keep recurring even after prompt fixes. The agent still occasionally engages with sophisticated phishing attempts. You create a specialized evaluator scoped to just this failure mode: <em>&#8220;Did the agent reply to a message that shows signs of phishing, scam, or social engineering?&#8221;</em></p><p>It uses few-shot examples drawn specifically from your security-related failures (the &#8220;Prince Makumba&#8221; scam reply, the fake Google alert engagement, the CEO impersonation wire transfer). </p><p>Separately, you notice &#8220;Information Leaks&#8221; also persist. So you build a second evaluator: <em>&#8220;Did the agent disclose confidential company information (financials, roadmap, acquisitions, salaries) to an external or unverified contact?&#8221;</em> Each evaluator has its own few-shot examples, its own pass/fail definition, and checks exactly one thing.</p><p>Let&#8217;s see all of this in action with a hands-on demo.</p><h2>Demo</h2><p><a href="https://aligneval.com/">AlignEval</a> is an open-source tool created by Eugene Yan that embodies the error analysis framework we have been discussing. Its tagline: <em>&#8220;Making Evals Easy, Fun, and Semi-Automated&#8221;.</em> It provides a streamlined interface for the exact workflow this article teaches: look at your data, label it, evaluate outputs, and optimize your evaluators.</p><p><strong>Here is an end-to-end demo of how to label your dataset and build a binary LLM Judge with it</strong> &#8595;</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;46816ad5-acf6-4bc2-b62a-1956a44f0c3e&quot;,&quot;duration&quot;:null}"></div><p>The tool is open source and available at <a href="https://aligneval.com/">aligneval.com</a>, with the source code on GitHub (<a href="https://github.com/eugeneyan/align-app">eugeneyan/align-app</a>). You can try it for free with your own data or use the prompt below to quickly generate a CSV similar to the one from the demo:</p><pre><code>I want you to generate a CSV file with the following characteristics:
"""
* The CSV file must include the following columns:
   * id: Unique identifier for each row
   * input: Context used to generate output
   * output: Generated text to be evaluated
   * label: Ground truth (values optional but counts towards XP)
   * explanation: A one-sentence explanation on why we labeled the row as 0 (PASS) or 1 (FAIL)
* &#128680; The label column only accepts binary labels, either 0 or 1.
   * 0: Output PASSES your evaluation
   * 1: Output FAILS your evaluation
"""
that contains 100 rows

The goal of the CSV file is to implement a dataset to build an LLM Judge evaluator. 

We want to create some mock, synthetic data to conceptually show how labeling, evaluating and optimizing the LLM judge would look like, based on this tool: https://aligneval.com/

Let's say that we collected data from a vertical assistant agent specialized in answering work emails and Slack messages. Thus, create 100 scenarios based on these dimensions:
* feature: email/slack
* scenario: executive, manager, colleague, spam email, phishing email, friend (as an exception)
* label: success/failure of properly answering the message

Where the input is a single email or Slack message or an email or Slack thread, but the output will ALWAYS be just the generated reply, whether it's email or Slack.

Make the labels a 50%/50% split between passes and fails.

Also, note that NO REPLY is an expected behavior for SPAM and phishing emails. Also, for non-essential emails or toxic or slack messages.</code></pre><p>We used Claude Opus 4.6 within the Claude app to generate it.</p><h2>Next Steps</h2><p>Building an evals dataset is not a one-time task. It is a continuous flywheel driven by the error analysis framework. Start small, let the data guide you, and grow your dataset and evaluators iteratively.</p><p>The full flywheel is: create dataset &#8594; label with pass/fail and critiques &#8594; fix errors &#8594; build evaluators iteratively &#8594; run error analysis &#8594; create specialized evaluators &#8594; repeat. <strong>The key principle is: </strong><em><strong>&#8220;Put the data and metrics first, not preset criteria or LLMs.&#8221;</strong></em></p><p>Now that you know how to build and grow an evals dataset from real data, the next article will show you how to generate synthetic test examples, extremely useful before going to production or when you don't have enough users.</p><p>Also, remember that this article is part of a <strong><a href="https://www.decodingai.com/t/ai-evals-and-observability">7-piece series on AI Evals &amp; Observability</a></strong>. <strong>Here is what&#8217;s ahead:</strong></p><ol><li><p><a href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app">Integrating AI Evals Into Your AI App</a> </p></li><li><p><strong>Build an AI Evals Dataset from Scratch</strong> &#8592; <em>You just finished this one</em></p></li><li><p><a href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals">Generate Synthetic Datasets for AI Evals</a></p></li><li><p><a href="https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures">How to Design Evaluators</a></p></li><li><p><a href="https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge">How to Evaluate the Evaluator</a></p></li><li><p><a href="https://www.decodingai.com/p/rag-evaluation-6-metrics-framework">RAG Evaluation: The Only 6 Metrics You Need</a></p></li><li><p><a href="https://www.decodingai.com/p/behind-the-scenes-of-ai-observability">Lessons from 6 Months of Evals on a Production AI Companion</a></p></li></ol><p>See you next Tuesday.</p><p><a href="https://www.pauliusztin.ai/">Paul Iusztin</a></p><div><hr></div><p><em>What&#8217;s your opinion? Do you agree, disagree, or is there something I missed?</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to share our work.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h2>Go Deeper</h2><p><strong>Go from zero to production-grade AI agents</strong> with the <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Agentic AI Engineering self-paced course</a>. Built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Towards AI</a>.</p><p>Across <strong>34 lessons</strong> (articles, videos, and a lot of code), you&#8217;ll design, build, evaluate, and deploy production-grade AI agents end to end. By the final lesson, you&#8217;ll have <strong>built a multi-agent system</strong> and a <strong>capstone project</strong> where you apply everything you&#8217;ve learned on your own.</p><p><em>Three portfolio projects and a certificate to showcase in interviews. Plus a Discord community where you have direct access to other industry experts and me.</em></p><p>Rated 4.9/5 &#11088;&#65039; by 290+ early students &#8212; <em>&#8220;Every AI Engineer needs a course like this.&#8221;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&quot;,&quot;text&quot;:&quot;Learn more&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31"><span>Learn more</span></a></p><p><em>Not ready to commit?</em> We also prepared a free 6-day email course to reveal the <em><strong>6 critical mistakes that silently destroy agentic systems. </strong><a href="https://email-course.towardsai.net/?ref=b3ab31">Get the free email course.</a></em></p><div><hr></div><p><em>Thanks again to <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a> for sponsoring the series and keeping it free!</em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oSDm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 424w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 848w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1272w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png" width="1456" height="364" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:364,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Opik Banner&quot;,&quot;title&quot;:&quot;Opik Banner&quot;,&quot;type&quot;:null,&quot;href&quot;:&quot;https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Opik Banner" title="Opik Banner" srcset="https://substackcdn.com/image/fetch/$s_!oSDm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 424w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 848w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1272w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Try Opik for free here</a> (25k spans/month free)</figcaption></figure></div><p><strong>If you want to monitor, evaluate and optimize your AI workflows and agents:</strong></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;text&quot;:&quot;Try Opik for free&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul"><span>Try Opik for free</span></a></p><div><hr></div><h2>References</h2><ol><li><p>Husain, H., &amp; Shankar, S. (2024, January 29). Evals Are Not All You Need. O&#8217;Reilly Radar. <a href="https://www.oreilly.com/radar/evals-are-not-all-you-need/">https://www.oreilly.com/radar/evals-are-not-all-you-need/</a></p></li><li><p>Husain, H. (2024, May 6). Why is error analysis so important in LLM evals and how is it performed?. Hamel&#8217;s Blog. <a href="https://hamel.dev/blog/posts/evals-faq/why-is-error-analysis-so-important-in-llm-evals-and-how-is-it-performed.html">https://hamel.dev/blog/posts/evals-faq/why-is-error-analysis-so-important-in-llm-evals-and-how-is-it-performed.html</a></p></li><li><p>Anthropic. (n.d.). Demystifying Evals for AI Agents. Anthropic. <a href="https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents">https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents</a></p></li><li><p>Lenny&#8217;s Podcast. (2024, June 16). Why AI evals are the hottest new skill for product builders | Hamel Husain &amp; Shreya Shankar. YouTube. <a href="https://www.youtube.com/watch?v=BsWxPI9UM4c">https://www.youtube.com/watch?v=BsWxPI9UM4c</a></p></li><li><p>Husain, H. (2024, May 14). Building Eval Systems That Improve. Lenny&#8217;s Newsletter. <a href="https://www.lennysnewsletter.com/p/building-eval-systems-that-improve?hide_intro_popup=true">https://www.lennysnewsletter.com/p/building-eval-systems-that-improve</a></p></li><li><p>Iusztin, P. (2025, February 11). The 5-Star Lie: You Are Doing AI Evals Wrong. Decoding AI Magazine. <a href="https://www.decodingai.com/p/the-5-star-lie-you-are-doing-ai-evals">https://www.decodingai.com/p/the-5-star-lie-you-are-doing-ai-evals</a></p></li><li><p>Iusztin, P. (2025, February 18). The Real Battle-Tested RAG Playbook. Decoding AI Magazine. <a href="https://www.decodingai.com/p/the-real-battle-tested-rag-playbook">https://www.decodingai.com/p/the-real-battle-tested-rag-playbook</a></p></li><li><p>Husain, H. (2024, May 22). Using LLM-as-a-Judge For Evaluation: A Complete Guide. Hamel&#8217;s Blog. <a href="https://hamel.dev/blog/posts/llm-judge/">https://hamel.dev/blog/posts/llm-judge/</a></p></li><li><p>Iusztin, P. (2025, February 25). The Mirage of Generic AI Metrics. Decoding AI Magazine. <a href="https://www.decodingai.com/p/the-mirage-of-generic-ai-metrics">https://www.decodingai.com/p/the-mirage-of-generic-ai-metrics</a></p></li></ol><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[Integrating AI Evals Into Your AI App]]></title><description><![CDATA[The holistic guide: From optimization to production monitoring]]></description><link>https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app</link><guid isPermaLink="false">https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Tue, 10 Feb 2026 15:38:14 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!giir!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ebc005-6ff0-4d5d-97bf-98602b1a1bd1_1200x796.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Welcome to the <strong><a href="https://www.decodingai.com/t/ai-evals-and-observability">AI Evals &amp; Observability series</a></strong>: A 7-part journey from shipping AI apps to systematically improving them. Made by busy people. For busy people.</em></p><p>&#129488; Everyone says you need AI evals. Few explain how to actually build them and answer questions such as&#8230;</p><p>How do we avoid creating evals that waste our time and resources? How do we build datasets and design evaluators that matter? How do we adapt them for RAG? ...and most importantly, how do we stop &#8220;vibe checking&#8221; and leverage evals to actually track and optimize our app?</p><p><em>This 7-article series breaks it all down from first principles:</em></p><ol><li><p><strong>Integrating AI Evals Into Your AI App </strong>&#8592; <em>You are here</em></p></li><li><p><a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Build an AI Evals Dataset from Scratch</a></p></li><li><p><a href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals">Generate Synthetic Datasets for AI Evals</a></p></li><li><p><a href="https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures">How to Design Evaluators</a></p></li><li><p><a href="https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge">How to Evaluate the Evaluator</a></p></li><li><p><a href="https://www.decodingai.com/p/rag-evaluation-6-metrics-framework">RAG Evaluation: The Only 6 Metrics You Need</a></p></li><li><p><a href="https://www.decodingai.com/p/behind-the-scenes-of-ai-observability">Lessons from 6 Months of Evals on a Production AI Companion</a></p></li></ol><p>By the end, you&#8217;ll know how to integrate AI evals that actually track and improve the performance of your AI product. No vibe checking required!</p><p><strong>Let&#8217;s get started.</strong></p><div><hr></div><h1>Integrating AI Evals Into Your AI App</h1><p>Understanding where AI Evals and Observability fit into the broader scheme of things can be daunting. It certainly was for me. At first, it was confusing because you can use AI evals in so many places within your application. Also, everyone seemed to have a different definition. </p><p>But it does not have to be that complicated. With this article, we want to finally connect the dots on where AI Evals fit in your AI app holistically. </p><p>But first, let&#8217;s understand WHY AI Evals are so essential.</p><p>A few months ago, I had to completely rewrite <strong>Brown</strong>, a writer agent I built as one of the capstone projects for my <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Agentic AI Engineering course</a>. The first version worked but was slow and expensive. So, I redesigned the architecture from scratch.</p><p>Immediately, I hit a wall. How do I know this new version is at least as good as the old one? I had spent months fine-tuning the original and could not afford to lose that progress silently.</p><p>That is when AI evals saved me. I wrote evaluators that scored the agent on dimensions tied to our actual business requirements. With those evals, every code change generated a clear signal indicating whether I was on track. Without them, the rewrite would have been a coin flip.</p><p>You likely shipped the first version of your app. You got this far by <em>&#8220;vibe checking&#8221;</em> if the app works fine. Up to this point, everything is fine.</p><p>However, once you start adding new features, you realize old features break. Once you start having real users, they interact with the app in unexpected ways. If you have only 10 users, vibe checking works.</p><p>But as this scales, you get overwhelmed. You try to improve current features, and it is incredibly hard to tell if your changes have any effect. Manually managing all of this is a living hell.</p><p>&#128161;<em> The solution is a structured way of measuring how well your app performs. This is known as AI Evals.</em></p><p><strong>In this article, we will cover:</strong></p><ol><li><p>A holistic view of the AI Evals lifecycle.</p></li><li><p>How to use evals for optimization during development.</p></li><li><p>How to use evals for regression testing in CI pipelines.</p></li><li><p>How to monitor production quality using sampling.</p></li><li><p>Common misconceptions regarding guardrails and benchmarks.</p></li><li><p>The recommended tech stack for implementing this system.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Y_0d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Y_0d!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 424w, https://substackcdn.com/image/fetch/$s_!Y_0d!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 848w, https://substackcdn.com/image/fetch/$s_!Y_0d!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 1272w, https://substackcdn.com/image/fetch/$s_!Y_0d!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Y_0d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png" width="1200" height="1075" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1075,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:100684,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/187091808?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Y_0d!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 424w, https://substackcdn.com/image/fetch/$s_!Y_0d!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 848w, https://substackcdn.com/image/fetch/$s_!Y_0d!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 1272w, https://substackcdn.com/image/fetch/$s_!Y_0d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image 1: The holistic view of AI Evals within the AI application development lifecycle.</figcaption></figure></div><p><em>Before digging into the article, a quick word from our sponsor, Opik.</em> &#8595;</p><div><hr></div><h2><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik: Open-Source Observability for Your Multimodal AI Agents (Sponsored</a>)</h2><p>This <em><strong>AI Evals &amp; Observability series</strong></em> is brought to you by <em><strong><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a></strong></em>, the LLMOps open-source platform used by Uber, Netflix, Etsy, and more.</p><p>We&#8217;re proud to partner with a tool we actually use daily across our open-source courses and real-world AI products. Why? <em>Because it makes evaluating multimodal AI apps as easy as evaluating text ones.</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://www.comet.com/docs/opik/evaluation/evaluate_multimodal?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DvE9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89e180b-7253-4edd-91d3-5ef78afccff7_3020x1516.png 424w, https://substackcdn.com/image/fetch/$s_!DvE9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89e180b-7253-4edd-91d3-5ef78afccff7_3020x1516.png 848w, https://substackcdn.com/image/fetch/$s_!DvE9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89e180b-7253-4edd-91d3-5ef78afccff7_3020x1516.png 1272w, https://substackcdn.com/image/fetch/$s_!DvE9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89e180b-7253-4edd-91d3-5ef78afccff7_3020x1516.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DvE9!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89e180b-7253-4edd-91d3-5ef78afccff7_3020x1516.png" width="1200" height="602.4725274725274" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c89e180b-7253-4edd-91d3-5ef78afccff7_3020x1516.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:731,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:&quot;https://www.comet.com/docs/opik/evaluation/evaluate_multimodal?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DvE9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89e180b-7253-4edd-91d3-5ef78afccff7_3020x1516.png 424w, https://substackcdn.com/image/fetch/$s_!DvE9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89e180b-7253-4edd-91d3-5ef78afccff7_3020x1516.png 848w, https://substackcdn.com/image/fetch/$s_!DvE9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89e180b-7253-4edd-91d3-5ef78afccff7_3020x1516.png 1272w, https://substackcdn.com/image/fetch/$s_!DvE9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc89e180b-7253-4edd-91d3-5ef78afccff7_3020x1516.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Monitoring traces that contain generated videos, such as when using OpenAI Sora. <a href="https://www.comet.com/docs/opik/evaluation/evaluate_multimodal?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Learn more about monitoring multimodal traces with Opik</a> or about <a href="https://www.comet.com/docs/opik/integrations/openai#video-generation-sora?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">hooking Opik to OpenAI Sora</a>.</figcaption></figure></div><p><em>AI apps are no longer just text-in, text-out.</em> They process images, generate videos, parse PDFs, and more. Monitoring and evaluating all of that used to be painful. With Opik, it&#8217;s not. Here is why we love it:</p><ul><li><p><strong>Trace everything</strong> &#8212; Opik renders images, videos and PDFs directly inside your traces. No more guessing what your model actually saw or generated. We use this daily, and it changed how we debug multimodal pipelines.</p></li><li><p><strong>Zero-friction multimodal evals</strong> &#8212; Add image URLs or upload files directly in the UI, then run LLM-as-a-Judge evaluations on them. Opik auto-detects vision-capable models (GPT-4o, Claude 3+, Gemini) and warns you if the model doesn&#8217;t support vision.</p></li><li><p><strong>Video generation? Traced automatically</strong> &#8212; Wrap your OpenAI client in one line, and Opik tracks the full Sora workflow: creation, polling, download, and logs the generated video as an attachment. Full visibility, minimal setup. <a href="https://www.comet.com/docs/opik/integrations/openai#video-generation-sora?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Guide here</a>.</p></li></ul><p><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a> is fully open-source and works with custom code or most AI frameworks. You can also use the managed version for free (with 25K spans/month on their generous free tier). <em>Learn more about evaluating multimodal traces:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.comet.com/docs/opik/evaluation/evaluate_multimodal?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;text&quot;:&quot;Evaluate multimodal traces&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.comet.com/docs/opik/evaluation/evaluate_multimodal?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul"><span>Evaluate multimodal traces</span></a></p><p>&#8595; <em>Now, let&#8217;s move back to the article.</em></p><div><hr></div><h2>The Holistic View of AI Evals</h2><p>At their heart, AI Evals are systematic data analytics on your LLM application. You look at the data flowing through your app, create metrics for what matters, and use those metrics to measure what is happening. This allows you to iterate, experiment, and improve with confidence rather than guess <a href="https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents">[1]</a>, <a href="https://hamel.dev/blog/posts/evals-faq/">[2]</a>.</p><p>Without evals, every prompt change is a coin flip. With them, you have a concrete feedback signal to iterate against <a href="https://youtube.com/watch?v=BsWxPI9UM4c&amp;si=Zn5CgOvM_uqtTrF6">[3]</a>.</p><p>In this article, we will focus on the <em>where</em>, <em>when</em>, <em>why</em>, and <em>what</em>. In future articles from the series, we will focus on the <em>how</em>. There are <em><strong>three core scenarios</strong> where AI Evals play a central role</em>:</p><ol><li><p><strong>Optimization:</strong> During development, we use evals to measure and optimize current or new features.</p></li><li><p><strong>Regression:</strong> During development, when changing the code, we use evals to ensure our changes do not break previous features. This is conceptually similar to classic software tests.</p></li><li><p><strong>Production Monitoring:</strong> In production, we use evals to detect potential performance issues caused by unexpected user behavior or drift.</p></li></ol><p>Beyond these three, two complementary signals round out the picture. We touch on them briefly later, but they are not the focus of this article:</p><ol start="4"><li><p><strong>User Feedback:</strong> These are direct signals from users that bypass all our predefined datasets and evaluation strategies. They are the most valuable signal you can get.</p></li><li><p><strong>A/B Testing:</strong> This ensures new code changes perform as expected, tested on real user behavior rather than predefined datasets.</p></li></ol><p>This is illustrated in Image 1, which maps these concepts to the development lifecycle.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Y_0d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Y_0d!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 424w, https://substackcdn.com/image/fetch/$s_!Y_0d!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 848w, https://substackcdn.com/image/fetch/$s_!Y_0d!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 1272w, https://substackcdn.com/image/fetch/$s_!Y_0d!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Y_0d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png" width="1200" height="1075" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1075,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:100684,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/187091808?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Y_0d!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 424w, https://substackcdn.com/image/fetch/$s_!Y_0d!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 848w, https://substackcdn.com/image/fetch/$s_!Y_0d!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 1272w, https://substackcdn.com/image/fetch/$s_!Y_0d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e3e3f14-390a-4fcf-b449-d41b3e050fd8_1200x1075.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image 1: The holistic view of AI Evals within the AI application development lifecycle.</figcaption></figure></div><p>Now that we have the big picture, let&#8217;s dig deeper into each of the three core scenarios, starting with optimization.</p><h2>Using Evals for Optimization</h2><p>The first major use case for AI Evals is optimizing your application on a specific feature during development.</p><p>To keep costs under control and run tests multiple times during development, we run the AI evals only on a <em>subset</em> of the dataset that targets the feature we want to optimize. This is usually triggered manually by the developer while testing new code.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ChCe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7176d199-2fb7-484d-9590-a07e1a631dc7_1200x773.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ChCe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7176d199-2fb7-484d-9590-a07e1a631dc7_1200x773.gif 424w, https://substackcdn.com/image/fetch/$s_!ChCe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7176d199-2fb7-484d-9590-a07e1a631dc7_1200x773.gif 848w, https://substackcdn.com/image/fetch/$s_!ChCe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7176d199-2fb7-484d-9590-a07e1a631dc7_1200x773.gif 1272w, https://substackcdn.com/image/fetch/$s_!ChCe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7176d199-2fb7-484d-9590-a07e1a631dc7_1200x773.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ChCe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7176d199-2fb7-484d-9590-a07e1a631dc7_1200x773.gif" width="1200" height="773" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7176d199-2fb7-484d-9590-a07e1a631dc7_1200x773.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:773,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:112306,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/187091808?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7176d199-2fb7-484d-9590-a07e1a631dc7_1200x773.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ChCe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7176d199-2fb7-484d-9590-a07e1a631dc7_1200x773.gif 424w, https://substackcdn.com/image/fetch/$s_!ChCe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7176d199-2fb7-484d-9590-a07e1a631dc7_1200x773.gif 848w, https://substackcdn.com/image/fetch/$s_!ChCe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7176d199-2fb7-484d-9590-a07e1a631dc7_1200x773.gif 1272w, https://substackcdn.com/image/fetch/$s_!ChCe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7176d199-2fb7-484d-9590-a07e1a631dc7_1200x773.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image 2: Evaluate one feature at a time during optimization.</figcaption></figure></div><p>This makes development guided by concrete numbers that can be measured against a baseline, rather than just vibe-checking.</p><p>Suppose you have a customer support bot and you want to improve how it handles refund requests. You do not run your evals on the entire dataset, which may also cover shipping questions, account issues, and technical support. Instead, you filter your evals dataset down to just the refund-related examples and iterate on that subset.</p><p>You tweak the prompt, run the evals, check if your &#8220;refund accuracy&#8221; metric improved compared to the baseline, tweak again, and repeat. This keeps each cycle fast and cheap, so you can iterate multiple times in a single session <a href="https://www.oreilly.com/radar/evals-are-not-all-you-need/">[4]</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GmAN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcba3b70-a65a-4fca-9b71-d656112e2376_1200x1047.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GmAN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcba3b70-a65a-4fca-9b71-d656112e2376_1200x1047.png 424w, https://substackcdn.com/image/fetch/$s_!GmAN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcba3b70-a65a-4fca-9b71-d656112e2376_1200x1047.png 848w, https://substackcdn.com/image/fetch/$s_!GmAN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcba3b70-a65a-4fca-9b71-d656112e2376_1200x1047.png 1272w, https://substackcdn.com/image/fetch/$s_!GmAN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcba3b70-a65a-4fca-9b71-d656112e2376_1200x1047.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GmAN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcba3b70-a65a-4fca-9b71-d656112e2376_1200x1047.png" width="1200" height="1047" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bcba3b70-a65a-4fca-9b71-d656112e2376_1200x1047.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1047,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:96307,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/187091808?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcba3b70-a65a-4fca-9b71-d656112e2376_1200x1047.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GmAN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcba3b70-a65a-4fca-9b71-d656112e2376_1200x1047.png 424w, https://substackcdn.com/image/fetch/$s_!GmAN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcba3b70-a65a-4fca-9b71-d656112e2376_1200x1047.png 848w, https://substackcdn.com/image/fetch/$s_!GmAN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcba3b70-a65a-4fca-9b71-d656112e2376_1200x1047.png 1272w, https://substackcdn.com/image/fetch/$s_!GmAN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcba3b70-a65a-4fca-9b71-d656112e2376_1200x1047.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image 3: A flowchart detailing the iterative process of using AI Evals for optimization during development.</figcaption></figure></div><p>But what happens when your optimization work accidentally breaks something that was already working? That is where regression testing comes in.</p><h2>Using Evals for Regression</h2><p>Regression testing is used to catch potential errors introduced by your new changes before they reach production. Unlike optimization, which focuses on a subset, regression testing runs on the <strong>whole evaluation dataset</strong>.</p><p>This typically happens within the CI pipeline. Because running AI Evals on the entire dataset is costly (some evaluators rely on LLM calls to grade outputs), we try to avoid running them on every single commit, as we do with standard software tests. A common approach is to run this suite when you think you are &#8220;done&#8221; with your feature, right before merging the feature branch, to ensure you are not introducing new bugs.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!giir!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ebc005-6ff0-4d5d-97bf-98602b1a1bd1_1200x796.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!giir!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ebc005-6ff0-4d5d-97bf-98602b1a1bd1_1200x796.png 424w, https://substackcdn.com/image/fetch/$s_!giir!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ebc005-6ff0-4d5d-97bf-98602b1a1bd1_1200x796.png 848w, https://substackcdn.com/image/fetch/$s_!giir!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ebc005-6ff0-4d5d-97bf-98602b1a1bd1_1200x796.png 1272w, https://substackcdn.com/image/fetch/$s_!giir!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ebc005-6ff0-4d5d-97bf-98602b1a1bd1_1200x796.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!giir!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ebc005-6ff0-4d5d-97bf-98602b1a1bd1_1200x796.png" width="1200" height="796" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e0ebc005-6ff0-4d5d-97bf-98602b1a1bd1_1200x796.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:796,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:70010,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/187091808?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ebc005-6ff0-4d5d-97bf-98602b1a1bd1_1200x796.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!giir!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ebc005-6ff0-4d5d-97bf-98602b1a1bd1_1200x796.png 424w, https://substackcdn.com/image/fetch/$s_!giir!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ebc005-6ff0-4d5d-97bf-98602b1a1bd1_1200x796.png 848w, https://substackcdn.com/image/fetch/$s_!giir!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ebc005-6ff0-4d5d-97bf-98602b1a1bd1_1200x796.png 1272w, https://substackcdn.com/image/fetch/$s_!giir!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0ebc005-6ff0-4d5d-97bf-98602b1a1bd1_1200x796.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image 4: Evaluate all the features when computing regression scores.</figcaption></figure></div><p>Continuing the customer support bot example: after you optimized refund handling during the optimization phase, you now want to merge your changes. Before merging, your CI pipeline runs the full eval suite. This includes refunds, shipping questions, account issues, and escalation scenarios.</p><p>This catches regressions. Maybe your prompt change for refunds accidentally made the bot worse at routing shipping complaints to the right team. If any metric drops below the baseline threshold, the pipeline fails and blocks the merge until you fix it. This is similar to how Anthropic&#8217;s Claude Code team and Bolt&#8217;s AI team run separate eval suites for quality benchmarking and regression testing on each change <a href="https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents">[1]</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!20y0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7c58f2-f6eb-400c-958a-a36da6161c5a_1200x1019.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!20y0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7c58f2-f6eb-400c-958a-a36da6161c5a_1200x1019.png 424w, https://substackcdn.com/image/fetch/$s_!20y0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7c58f2-f6eb-400c-958a-a36da6161c5a_1200x1019.png 848w, https://substackcdn.com/image/fetch/$s_!20y0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7c58f2-f6eb-400c-958a-a36da6161c5a_1200x1019.png 1272w, https://substackcdn.com/image/fetch/$s_!20y0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7c58f2-f6eb-400c-958a-a36da6161c5a_1200x1019.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!20y0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7c58f2-f6eb-400c-958a-a36da6161c5a_1200x1019.png" width="1200" height="1019" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eb7c58f2-f6eb-400c-958a-a36da6161c5a_1200x1019.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1019,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:91753,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/187091808?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7c58f2-f6eb-400c-958a-a36da6161c5a_1200x1019.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!20y0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7c58f2-f6eb-400c-958a-a36da6161c5a_1200x1019.png 424w, https://substackcdn.com/image/fetch/$s_!20y0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7c58f2-f6eb-400c-958a-a36da6161c5a_1200x1019.png 848w, https://substackcdn.com/image/fetch/$s_!20y0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7c58f2-f6eb-400c-958a-a36da6161c5a_1200x1019.png 1272w, https://substackcdn.com/image/fetch/$s_!20y0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7c58f2-f6eb-400c-958a-a36da6161c5a_1200x1019.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image 5: A flowchart illustrating how AI Evals are used for regression testing within the CI pipeline, emphasizing an automated, pre-merge process on the entire dataset.</figcaption></figure></div><p>Optimization and regression testing happen during development, but what about after you deploy? Let&#8217;s look at how AI Evals work in production.</p><h2>Using Evals for Production Monitoring</h2><p>Production monitoring is similar to regression testing, but instead of running it offline on your AI Evals dataset, we aim to catch issues in the production environment using live traces tracked by your LLMOps platform. The final scope is to identify potential pitfalls in our system and generate alarms or warnings.</p><p>To keep costs under control, we apply smart live sampling techniques within your LLMOps platform (e.g., <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a>). You rarely want to evaluate 100% of production traffic with an LLM judge. Instead, you use:</p><ul><li><p><strong>Random sampling:</strong> Evaluate a fixed percentage of all traces (e.g., 5-10%) to get an unbiased baseline of overall quality.</p></li><li><p><strong>Stratified sampling:</strong> Divide traces into meaningful subgroups (by feature, user segment, query type) and sample from each proportionally, ensuring no critical category is overlooked.</p></li><li><p><strong>Signal-based sampling:</strong> Prioritize traces that show suspicious signals. These include long exchanges, repeated questions, user frustration indicators (thumbs-down, drop-offs from your user feedback pipe), low confidence scores, or anomalous latency/cost spikes. These are the highest-value traces to review.</p></li></ul><p>You should run these as soon as practical. This can be near real-time or on a batch schedule (e.g., nightly), depending on risk tolerance and cost.</p><p>Suppose your customer support bot is live with thousands of conversations per day. Even with a 95% success rate, that still amounts to dozens of failures daily. That is far too many to review manually. Your LLMOps platform samples a percentage of live traces and automatically runs evaluators on them. For instance, you might check if the bot hallucinated a return policy that does not exist or if it escalated appropriately when the user was frustrated.</p><p>These evaluators flag problematic traces and feed dashboards that track failure rates over time. When you see a spike, you catch it within hours instead of waiting for user complaints to pile up. Perhaps a new product launch causes questions the bot wasn&#8217;t trained on <a href="https://humanloop.com/blog/why-your-product-needs-evals">[5]</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!h8bx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ac7959f-6130-4759-af62-8917c3437323_1200x918.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!h8bx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ac7959f-6130-4759-af62-8917c3437323_1200x918.png 424w, https://substackcdn.com/image/fetch/$s_!h8bx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ac7959f-6130-4759-af62-8917c3437323_1200x918.png 848w, https://substackcdn.com/image/fetch/$s_!h8bx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ac7959f-6130-4759-af62-8917c3437323_1200x918.png 1272w, https://substackcdn.com/image/fetch/$s_!h8bx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ac7959f-6130-4759-af62-8917c3437323_1200x918.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!h8bx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ac7959f-6130-4759-af62-8917c3437323_1200x918.png" width="1200" height="918" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6ac7959f-6130-4759-af62-8917c3437323_1200x918.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:918,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:84213,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/187091808?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ac7959f-6130-4759-af62-8917c3437323_1200x918.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!h8bx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ac7959f-6130-4759-af62-8917c3437323_1200x918.png 424w, https://substackcdn.com/image/fetch/$s_!h8bx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ac7959f-6130-4759-af62-8917c3437323_1200x918.png 848w, https://substackcdn.com/image/fetch/$s_!h8bx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ac7959f-6130-4759-af62-8917c3437323_1200x918.png 1272w, https://substackcdn.com/image/fetch/$s_!h8bx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ac7959f-6130-4759-af62-8917c3437323_1200x918.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image 6: A flowchart depicting the process of AI Evals for production monitoring on live traffic.</figcaption></figure></div><p>Beyond evaluators, you have two complementary signals. <strong>User Feedback</strong> (thumbs-up/down, comments) is the most valuable quality signal because it reflects real satisfaction, not proxy metrics. <strong>A/B Testing</strong> validates that improvements measured offline actually hold up under real user behavior by routing traffic to different variants.</p><p>Now that we understand where and when to run AI Evals, let&#8217;s clear up some common misconceptions that trip up most teams.</p><h2>Looking at Common Misconceptions</h2><p>There are three major areas where terminology gets confusing: guardrails, benchmarks, and software tests.</p><h3>Guardrails vs. Evaluators</h3><p>Guardrails run on the inputs and outputs of the LLM or other components of your AI app. These should be very fast to avoid adding extra latency. Their role is to flag inputs/outputs as valid or not, or mask sensitive data. Evaluators, on the other hand, are used to compute metrics on your AI app components.</p><p>While you can use evaluators as guardrails if they detect adherence to business outcomes, this is the exception, not the rule. Evaluators are usually designed for accuracy rather than low latency.</p><p>For example, a <strong>guardrail</strong> in your customer support bot checks every user message in real time. If the user pastes a credit card number, the guardrail masks it instantly. On the output side, another guardrail blocks any response that promises a refund above a certain threshold. These must run in milliseconds.</p><p>An <strong>evaluator</strong> runs after the fact (or offline on sampled traces). It measures whether the bot&#8217;s refund responses are actually accurate, helpful, and aligned with company policy. The evaluator can take seconds or even minutes per trace because it is not in the user&#8217;s critical path.</p><h3>App Evaluators vs. LLM Evaluators</h3><p><strong>App Evaluators</strong> measure your whole app as a unit (LLM calls + everything around them). They focus on ensuring the performance of your business use case.</p><p><strong>LLM Evaluators</strong> measure only the performance of the LLM itself, rarely considering your business use case. Popular benchmarks like the LLM arena evaluate only the LLM in isolation. That is why benchmarks are deceiving and should never be your only criterion when picking an LLM. They are often a marketing strategy for foundational model companies.</p><p>Examples like Chatbot Arena (LMSYS) or MMLU tell you which LLM is &#8220;generally smarter.&#8221; But they say nothing about whether that LLM will handle your specific refund policy correctly, escalate frustrated users at the right moment, or respect your company&#8217;s tone of voice. You need app-level evaluators grounded in your business use case, not generic benchmark scores <a href="https://www.oreilly.com/radar/evals-are-not-all-you-need/">[4]</a>.</p><h3>Evaluator vs. Classic Software Tests</h3><p>When running evaluators as regression tests, they are conceptually similar to classic software tests. Their purpose is to ensure everything still works after you change the code. However, the implementation is vastly different.</p><p>Classic software tests are deterministic. For a given state of the database and a given input, you almost always get the same output. It is also much cheaper and easier to run because the code itself is cheap to run, and the outputs are structured and easy to validate.</p><p>AI Evaluators must assess the quality of LLM calls operating in a non-deterministic environment, often with unstructured data. Instead of writing unit and integration test cases, AI evals cases are operated as eval datasets, reflecting the AI-centric approach.</p><p>With a clear mental model of what AI Evals are and aren&#8217;t, let&#8217;s look at the tools you need to put this into practice.</p><h2>So What Is the Tech Stack?</h2><p>To run AI Evals effectively, you need two core tool families: an annotation tool and an LLMOps platform.</p><p>First, <strong>should you build a custom annotation tool or use an off-the-shelf tool?</strong> Since your data is always custom, we recommend building the annotation tool from scratch. With current AI coding tools such as Claude Code, Cursor, or Lovable, doing this is extremely easy. You want to make annotation effortless, adding zero resistance to how your data is displayed. As your data is custom, no pre-defined tool can do that perfectly for you. Most LLMOps platforms will have a feature around this, but a custom lightweight tool often wins on speed and usability.</p><p>Second, you need an <strong>LLMOps platform</strong>. Our favorite vendor is <strong><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a></strong>. It is what we recommend and use in all our products. It is open source, constantly updated with new features, works out of the box with popular LLM APIs and AI Frameworks, and offers a generous freemium plan.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ulUk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694c3e3-3ada-4393-99c6-bbc515ec4041_2984x1841.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ulUk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694c3e3-3ada-4393-99c6-bbc515ec4041_2984x1841.png 424w, https://substackcdn.com/image/fetch/$s_!ulUk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694c3e3-3ada-4393-99c6-bbc515ec4041_2984x1841.png 848w, https://substackcdn.com/image/fetch/$s_!ulUk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694c3e3-3ada-4393-99c6-bbc515ec4041_2984x1841.png 1272w, https://substackcdn.com/image/fetch/$s_!ulUk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694c3e3-3ada-4393-99c6-bbc515ec4041_2984x1841.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ulUk!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694c3e3-3ada-4393-99c6-bbc515ec4041_2984x1841.png" width="1200" height="740.3485254691689" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e694c3e3-3ada-4393-99c6-bbc515ec4041_2984x1841.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:1841,&quot;width&quot;:2984,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:844784,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/187091808?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9812df52-c552-4c09-bb79-75dc15e62617_2984x1930.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ulUk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694c3e3-3ada-4393-99c6-bbc515ec4041_2984x1841.png 424w, https://substackcdn.com/image/fetch/$s_!ulUk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694c3e3-3ada-4393-99c6-bbc515ec4041_2984x1841.png 848w, https://substackcdn.com/image/fetch/$s_!ulUk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694c3e3-3ada-4393-99c6-bbc515ec4041_2984x1841.png 1272w, https://substackcdn.com/image/fetch/$s_!ulUk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694c3e3-3ada-4393-99c6-bbc515ec4041_2984x1841.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image 7: Tracking multimodal traces with <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a>.</figcaption></figure></div><p>Other strong options include <strong>LangSmith</strong>, which is best for the LangChain ecosystem, and <strong>LangFuse</strong>, another solid open-source alternative. We have also heard good things about Braintrust and Arize.</p><p>The reality is that most of the time, you should pick the best tool for your current setup. We use Opik, but most of these tools have overlapping features. Choose the one that best fits your ecosystem and connections.</p><h2>Next Steps</h2><p>AI Evals are not optional. They are a structured, repeatable way to ensure your AI app actually works, both during development and in production.</p><p>Now that we understand the <em>where</em>, <em>when</em>, <em>why</em>, and <em>what</em> of AI Evals, the <a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">next article</a> will focus on the <em>how</em>. Specifically, we will dive into how to gradually build an evals dataset.</p><p>Also, remember that this article is part of a <strong><a href="https://www.decodingai.com/t/ai-evals-and-observability">7-piece series on AI Evals &amp; Observability</a></strong>. <strong>Here is what&#8217;s ahead:</strong></p><ol><li><p><strong>Integrating AI Evals Into Your AI App</strong> &#8592; <em>You just finished this one</em></p></li><li><p><a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Build an AI Evals Dataset from Scratch</a></p></li><li><p><a href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals">Generate Synthetic Datasets for AI Evals</a></p></li><li><p><a href="https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures">How to Design Evaluators</a></p></li><li><p><a href="https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge">How to Evaluate the Evaluator</a></p></li><li><p><a href="https://www.decodingai.com/p/rag-evaluation-6-metrics-framework">RAG Evaluation: The Only 6 Metrics You Need</a></p></li><li><p><a href="https://www.decodingai.com/p/behind-the-scenes-of-ai-observability">Lessons from 6 Months of Evals on a Production AI Companion</a></p></li></ol><p>See you next Tuesday.</p><p><a href="https://www.pauliusztin.ai/">Paul Iusztin</a></p><div><hr></div><p><em>What&#8217;s your opinion? Do you agree, disagree, or is there something I missed?</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to share our work.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h2>Go Deeper</h2><p><strong>Go from zero to production-grade AI agents</strong> with the <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Agentic AI Engineering self-paced course</a>. Built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Towards AI</a>.</p><p>Across <strong>34 lessons</strong> (articles, videos, and a lot of code), you&#8217;ll design, build, evaluate, and deploy production-grade AI agents end to end. By the final lesson, you&#8217;ll have <strong>built a multi-agent system</strong> and a <strong>capstone project</strong> where you apply everything you&#8217;ve learned on your own.</p><p><em>Three portfolio projects and a certificate to showcase in interviews. Plus a Discord community where you have direct access to other industry experts and me.</em></p><p>Rated 4.9/5 &#11088;&#65039; by 290+ early students &#8212; <em>&#8220;Every AI Engineer needs a course like this.&#8221;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&quot;,&quot;text&quot;:&quot;Learn more&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31"><span>Learn more</span></a></p><p><em>Not ready to commit?</em> We also prepared a free 6-day email course to reveal the <em><strong>6 critical mistakes that silently destroy agentic systems. </strong><a href="https://email-course.towardsai.net/?ref=b3ab31">Get the free email course.</a></em></p><div><hr></div><p><em>Thanks again to <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a> for sponsoring the series and keeping it free!</em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oSDm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 424w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 848w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1272w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png" width="1456" height="364" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:364,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Opik Banner&quot;,&quot;title&quot;:&quot;Opik Banner&quot;,&quot;type&quot;:null,&quot;href&quot;:&quot;https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Opik Banner" title="Opik Banner" srcset="https://substackcdn.com/image/fetch/$s_!oSDm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 424w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 848w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1272w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Try Opik for free here</a> (25k spans/month free)</figcaption></figure></div><p><strong>If you want to monitor, evaluate and optimize your AI workflows and agents:</strong></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;text&quot;:&quot;Try Opik for free&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul"><span>Try Opik for free</span></a></p><div><hr></div><h2>References</h2><ol><li><p>Anthropic. (n.d.). Demystifying evals for AI agents. Anthropic. <a href="https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents">https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents</a></p></li><li><p>Husain, H. (n.d.). LLM Evals: Everything You Need to Know (Evals FAQ). Hamel&#8217;s Blog. <a href="https://hamel.dev/blog/posts/evals-faq/">https://hamel.dev/blog/posts/evals-faq/</a></p></li><li><p>Lenny&#8217;s Podcast. (n.d.). Why AI evals are the hottest new skill for product builders | Hamel Husain &amp; Shreya Shankar. YouTube. <a href="https://www.youtube.com/watch?v=BsWxPI9UM4c">link</a></p></li><li><p>Reganti, A. N., &amp; Badam, K. (2025, January 28). Evals Are NOT All You Need. O&#8217;Reilly. <a href="https://www.oreilly.com/radar/evals-are-not-all-you-need/">https://www.oreilly.com/radar/evals-are-not-all-you-need/</a></p></li><li><p>Habib, R. (2024, March 14). Why Your AI Product Needs Evals with Hamel Husain. Humanloop Blog. <a href="https://humanloop.com/blog/why-your-product-needs-evals">https://humanloop.com/blog/why-your-product-needs-evals</a></p></li></ol><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[Behind the Scenes of AI Observability in Production]]></title><description><![CDATA[What actually works after 6 months of trial and error]]></description><link>https://www.decodingai.com/p/behind-the-scenes-of-ai-observability</link><guid isPermaLink="false">https://www.decodingai.com/p/behind-the-scenes-of-ai-observability</guid><dc:creator><![CDATA[Alejandro Aboy]]></dc:creator><pubDate>Tue, 03 Feb 2026 12:00:34 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!0pKO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Welcome to the <strong><a href="https://www.decodingai.com/t/ai-evals-and-observability">AI Evals &amp; Observability series</a></strong>: A 7-part journey from shipping AI apps to systematically improving them. Made by busy people. For busy people.</em></p><p>&#129488; Everyone says you need AI evals. Few explain how to actually build them and answer questions such as&#8230;</p><p>How do we avoid creating evals that waste our time and resources? How do we build datasets and design evaluators that matter? How do we adapt them for RAG? ...and most importantly, how do we stop &#8220;vibe checking&#8221; and leverage evals to actually track and optimize our app?</p><p><em>This 7-article series breaks it all down from first principles:</em></p><ol><li><p><a href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app">Integrating AI Evals Into Your AI App</a><strong> </strong></p></li><li><p><a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Build an AI Evals Dataset from Scratch  </a></p></li><li><p><a href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals">Generate Synthetic Datasets for AI Evals</a></p></li><li><p><a href="https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures">How to Design Evaluators</a></p></li><li><p><a href="https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge">How to Evaluate the Evaluator </a></p></li><li><p><a href="https://www.decodingai.com/p/rag-evaluation-6-metrics-framework">RAG Evaluation: The Only 6 Metrics You Need</a></p></li><li><p><strong>Lessons from 6 Months of Evals on a Production AI Companion</strong> &#8592; <em>You are here</em></p></li></ol><p>By the end, you&#8217;ll know how to integrate AI evals that actually track and improve the performance of your AI product. No vibe checking required!</p><p><strong>Let&#8217;s get started.</strong></p><div><hr></div><h1>Lessons from 6 Months of Evals on a Production AI Companion</h1><p><strong>Paul:</strong> Today, the stage belongs to <a href="https://substack.com/profile/22949723-alejandro-aboy?utm_source=global-search">Alejandro Aboy</a>, Senior Data Engineer at Workpath. Alejandro owns the entire data stack and is the architect behind the <em>Workpath AI Companion</em>.</p><p>When he isn&#8217;t shipping, he&#8217;s sharing engineering deep-dives at his publication, <a href="https://thepipeandtheline.substack.com/?utm_campaign=profile_chips">The Pipe &amp; The Line</a>.</p><p><em>Enough chitchat. Let&#8217;s get into it </em>&#128064; &#8595;</p><div><hr></div><p><strong>Alejandro:</strong> Over the last year, I&#8217;ve been working on different agent projects, but one took most of the attention.</p><p>We are talking about the main AI feature of the company SaaS, the companion that guides you through everything you can do and can also do it for you.</p><p>It can call around 50 different tools, do RAG searches based on documentation and also save workflows in memory so you don&#8217;t have to start from scratch. It can even get contextual messages from the frontend to know what you see when you see it on the platform.</p><p>As you can imagine, there&#8217;s a lot it can happen when you give too much power to Agents. These are some of the things I noticed:</p><ul><li><p>Agent complies with formatting but hallucinates documentation links after fake searching the docs</p></li><li><p>Agent infers the wrong user information and makes the wrong tool calls or argues no information can be found</p></li><li><p>Agent suggests follow up with actions that are outside scope, like running tool calls that don&#8217;t exist</p></li></ul><p>There are lots of Agent demos out there, and 2026 won&#8217;t stop the hype probably, but we barely discuss these nuances and how to address them in production environments to keep improving our Agents with awareness of what&#8217;s happening after we do a prompt change or add a new tool.</p><p><em><strong>To address that, here is what we will learn in this article:</strong></em></p><blockquote><ul><li><p><strong>Problems with AI Observability</strong></p><ul><li><p>Falling For Classic Metrics Or Trying To Define What&#8217;s &#8220;Good&#8221; or &#8220;Bad&#8221;</p></li><li><p>Not Going Through Manual Annotations</p></li><li><p>Not Treating Your AI Agents As A Data Product</p></li></ul></li><li><p><strong>Implementing AI Observability</strong></p><ul><li><p>Opik Overview</p></li><li><p>Opik MCP Server</p></li><li><p>Figuring Out Evaluation Criteria</p></li><li><p>Refining Evaluation Criteria</p></li><li><p>Running Annotation Sessions</p></li><li><p>Making Sense Of Annotated Feedback</p></li></ul></li><li><p><strong>Backstory Of This Framework</strong></p></li></ul></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7ae_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe933f21-cabc-42a3-8663-a5e83cd57c86_3680x4468.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7ae_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe933f21-cabc-42a3-8663-a5e83cd57c86_3680x4468.png 424w, https://substackcdn.com/image/fetch/$s_!7ae_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe933f21-cabc-42a3-8663-a5e83cd57c86_3680x4468.png 848w, https://substackcdn.com/image/fetch/$s_!7ae_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe933f21-cabc-42a3-8663-a5e83cd57c86_3680x4468.png 1272w, https://substackcdn.com/image/fetch/$s_!7ae_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe933f21-cabc-42a3-8663-a5e83cd57c86_3680x4468.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7ae_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe933f21-cabc-42a3-8663-a5e83cd57c86_3680x4468.png" width="1456" height="1768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/be933f21-cabc-42a3-8663-a5e83cd57c86_3680x4468.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1768,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7ae_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe933f21-cabc-42a3-8663-a5e83cd57c86_3680x4468.png 424w, https://substackcdn.com/image/fetch/$s_!7ae_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe933f21-cabc-42a3-8663-a5e83cd57c86_3680x4468.png 848w, https://substackcdn.com/image/fetch/$s_!7ae_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe933f21-cabc-42a3-8663-a5e83cd57c86_3680x4468.png 1272w, https://substackcdn.com/image/fetch/$s_!7ae_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe933f21-cabc-42a3-8663-a5e83cd57c86_3680x4468.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>But first, a quick word from our sponsor, Opik</em> &#8595;</p><div><hr></div><h2><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik: Open-Source LLMOps Platform (Sponsored)</a></h2><p>This <strong>AI Evals &amp; Observability</strong> series is brought to you by <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a>, the LLMOps open-source platform used by Uber, Etsy, Netflix, and more. </p><p>But most importantly, we are incredibly grateful to be supported by a tool that we personally love and keep returning to for all our open-source courses and real-world AI products. <em>Why?</em> Because it makes escaping the PoC purgatory possible!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yeD8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 424w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 848w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1272w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yeD8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png" width="1200" height="400" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/deaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:400,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Opik Banner&quot;,&quot;title&quot;:&quot;Opik Banner&quot;,&quot;type&quot;:null,&quot;href&quot;:&quot;https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Opik Banner" title="Opik Banner" srcset="https://substackcdn.com/image/fetch/$s_!yeD8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 424w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 848w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1272w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Try Opik for free here</a> (25k spans/month free)</figcaption></figure></div><p>Here is how Opik helps us ship AI workflows and agents to production:</p><ul><li><p><em>We see everything</em> - Visualize complete traces of LLM calls, including costs and latency breakdowns at each reasoning step.</p></li><li><p><em>Easily optimize our system</em> - Measure our performance using custom LLM judges, run experiments, compare results and pick the best configuration.</p></li><li><p><em>Catch issues quickly - Plug in the LLM Judge metrics into production traces and receive</em> on-demand alarms.</p></li><li><p><em>Stop manual prompt engineering</em> - Their prompt versioning and optimization features allow us to track and improve our system automatically. The future of AutoAI.</p></li></ul><p><em><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a> is fully open-source and works with custom code or most AI frameworks. You can also use the managed version for free (w/ 25K spans/month on their generous free tier).</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;text&quot;:&quot;Try Opik for free&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul"><span>Try Opik for free</span></a></p><p><em>&#8595;</em>  <em>Now, let&#8217;s move back to the article.</em></p><div><hr></div><h2>Problems With AI Observability</h2><p>As mentioned, there are many things you don&#8217;t see on how you Agent is working, and depending on usage and scale, you are likely to miss most of them.</p><blockquote><p><em>The problem won&#8217;t be gone by reviewing everything manually or finding the perfect sample to evaluate.</em></p></blockquote><p>There are some problems I learned about while implementing it at work. These have been mentioned by great experts on the topic, but I summarised how I experienced them like this:</p><h3>Falling for Classic Metrics or Trying to Define What&#8217;s &#8220;Good&#8221; or &#8220;Bad&#8221;</h3><p>Tools like Opik or Langfuse come by usual metrics out of the box Usual metrics cover <em>Hallucination</em>, <em>AnswerRelevance</em>, <em>ContextPrecision</em> and <em>ContextRecall</em>.</p><p>You can configure them and start evaluating right away, but we you get that the hallucination score is 1, then what?</p><p>You need to find what to evaluate to improve your use case, otherwise you can&#8217;t really improve it.</p><p>By this point, if you used LLMs for anything, you might have learn about determinism and its nuances. You can run the same thing over and over and get different outputs.</p><p>Same with evaluations; you need to find a standard way of evaluating without struggling with ambiguity every single time.</p><p>That&#8217;s why trying to evaluate nuances and all flavours of a metric might cause confusion.</p><p>You need to keep it lean, the best test is asking different people what they think and if they answer the same, you got a good metric.</p><blockquote><p><em>For example, I wanted to evaluate if the agent was hallucinating documentation links. My evaluation task was: Check if </em><code>search_knowledge</code><em> tool was called and verify if the URL in the output matches the tool output the agent used. That&#8217;s binary and can&#8217;t be misinterpreted even if you run the LLM As Judge multiple times.</em></p></blockquote><h3>Not Going Through Manual Annotations</h3><p>Unfortunately you can&#8217;t get away with murder without reviewing Agent responses manually.</p><p>If you automate 100% of those reviews with LLM as Judge you are likely to miss the substance because LLMs might hide interesting finding behind isolated prompts that could be phrased badly.</p><blockquote><p><em>For example, all my metric scores were doing mostly fine, but when I started seeing some conversations manually I noticed the agent suggesting things it can do at all, or answering things outside scope. That surface more evaluation opportunities, but also more things to add to the backlog that we could not see otherwise.</em></p></blockquote><p>With all the MCP servers out there, you can easily cluster all the manual feedback you collected and prepare documents for further analysis, so there&#8217;s no excuse for <em>&#8220;who will review all these manual comments later&#8221;?</em></p><h3>Not Treating Your AI Agents as a Data Product</h3><p>The worst mistake someone can do is just evaluating an AI Agent and not take anything out of it besides good or bad metric scores.</p><p>Each conversation its writing an invisible roadmap you need to materialise.</p><p>Whether if its a JIRA ticket to address a bug, or a new feature development plan for a brand new Agent capability.</p><p>I personally developed a lot of Product Analytics use cases to cover for internal analytics done by CS teams, which allowed a lot of reverse engineering to discover new use cases.</p><p>If you just create synthetic data to evaluate a prompt and leave observability at that point, you are missing out on real production data that can scream improvements right to your face.</p><p>So shift your mindset and own the process to define the success of each one of your AI Agent projects ;)</p><p><strong>Recommended</strong>:</p><ul><li><p><a href="https://www.decodingai.com/p/the-5-star-lie-you-are-doing-ai-evals">Decoding AI Magazine - The 5-Star Lie: You Are Doing AI Evals Wrong</a></p></li><li><p><a href="https://www.decodingai.com/p/the-mirage-of-generic-ai-metrics">Decoding AI Magazine - The Mirage of Generic AI Metrics</a></p></li></ul><div><hr></div><h2>Implementing AI Observability</h2><p>In the upcoming sections of this article, we will talk about <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a>, but these fundamentals are relevant for any other observability tool.</p><p>It&#8217;s quite common in Software Engineering to use tools such as Sentry for error logging.</p><h3>Opik Overview</h3><p>Tools like Opik follow the same principles: <strong>traces &amp; spans.</strong></p><p>You trace every single LLM interaction (<strong>trace</strong>), whether is an AI Agent or just a simple LLM call you want to audit.</p><p>You can break it down to see every thing that got used along the way (<strong>span</strong>), which translate into agent tools, such as retrieving documents with RAG.</p><p>All traces are grouped into <strong>threads</strong>, which equal to a chat or conversation putting messages together. When reviewing LLMs, it&#8217;s relevant to look at the thread level to get a grasp of how it worked considering the full context.</p><p>Beyond the trace &lt;&gt; span pair we talked about, with Opik you can:</p><ul><li><p>Do <strong><a href="https://www.comet.com/docs/opik/prompt_engineering/prompt_management?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">prompt versioning</a></strong>, which comes in handy to associate metric scores evolving after a particular prompt version.</p></li><li><p>Connect your <strong><a href="https://www.comet.com/docs/opik/configuration/ai_providers?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">AI provider</a></strong> (OpenAI, Anthropic, Azure, etc) in order to run <strong><a href="https://www.comet.com/docs/opik/production/rules?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Online Evaluations</a></strong> every time a new trace comes in, so you get fresh metrics on how your Agent is working on production.</p></li><li><p>Use its <strong><a href="https://www.comet.com/docs/opik/contributing/guides/python-sdk?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">SDK</a></strong><a href="https://www.comet.com/docs/opik/contributing/guides/python-sdk"> &amp; </a><strong><a href="https://www.comet.com/docs/opik/contributing/guides/python-sdk?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">REST API</a></strong> to scale any of its processes, from trace processing to running on demand evaluations using LLMs.</p></li></ul><p>I am just scratching the surface here, you can do a lot more and I encourage you to take a look!</p><p><strong>Recommended</strong>:</p><ul><li><p><a href="https://www.decodingai.com/p/observability-for-rag-agents">Decoding AI Magazine - Observability for RAG Agents</a></p></li></ul><h3>Opik MCP Server</h3><p>Using the MCP Server is optional as any other MCP, but I found it super powerful to do things that would have take me weeks or even months.</p><p>I found ways of leveraging to coming up with metric ideas based on my prompts and traces data, also picking up comments to quickly write documentation to review for potential product roadmaps.</p><p>You can get creative pretty easily and I invite you to try it.</p><p>Take a look on <a href="https://github.com/comet-ml/opik-mcp">Github - opik-mcp</a> to get started.</p><p>For the upcoming sections we will be using some slash commands that wrap some workflows to enhance the annotation process to make it more streamlined and enriched.</p><p>You can find them here: <a href="https://thoracic-hellebore-9a3.notion.site/WIP-Opik-MCP-Commands-2d880c67261e803cb314c9d8185300e7?pvs=74">Opik MCP Commands</a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KExG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd6b74d-dd35-45a1-8531-810c28168e3b_869x794.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KExG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd6b74d-dd35-45a1-8531-810c28168e3b_869x794.png 424w, https://substackcdn.com/image/fetch/$s_!KExG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd6b74d-dd35-45a1-8531-810c28168e3b_869x794.png 848w, https://substackcdn.com/image/fetch/$s_!KExG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd6b74d-dd35-45a1-8531-810c28168e3b_869x794.png 1272w, https://substackcdn.com/image/fetch/$s_!KExG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd6b74d-dd35-45a1-8531-810c28168e3b_869x794.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KExG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd6b74d-dd35-45a1-8531-810c28168e3b_869x794.png" width="724" height="661.5143843498274" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dbd6b74d-dd35-45a1-8531-810c28168e3b_869x794.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:794,&quot;width&quot;:869,&quot;resizeWidth&quot;:724,&quot;bytes&quot;:114522,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thepipeandtheline.substack.com/i/182844040?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd6b74d-dd35-45a1-8531-810c28168e3b_869x794.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!KExG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd6b74d-dd35-45a1-8531-810c28168e3b_869x794.png 424w, https://substackcdn.com/image/fetch/$s_!KExG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd6b74d-dd35-45a1-8531-810c28168e3b_869x794.png 848w, https://substackcdn.com/image/fetch/$s_!KExG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd6b74d-dd35-45a1-8531-810c28168e3b_869x794.png 1272w, https://substackcdn.com/image/fetch/$s_!KExG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbd6b74d-dd35-45a1-8531-810c28168e3b_869x794.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><em>&#9888;&#65039; <strong>Disclaimer</strong>: Opik MCP tools return a bunch of data. You are likely to see this message &#8220;&#9888; Large MCP response (~10.6k tokens), this can fill up context quickly&#8221;.</em></p><p><em>The </em><code>CLAUDE.md</code><em> instructions are meant to tell the LLM how to use tool calling to sample data gradually to avoid filling up the context quite quickly. The ideal workflow is too sample some threads and keep going as needed and saving checkpoints of the analysis to .md files for future reference.</em></p></blockquote><h3>Figuring Out Evaluation Criteria</h3><p>On Opik, you can run evaluations on the trace or thread levels, this means you can evaluate the single turn (user message &gt; agent message) or multi turn interactions.</p><p>But first we need to understand what our Agent is actually doing.</p><p>If we don&#8217;t know the use cases its covering, its really hard to know how to evaluate it accordingly.</p><blockquote><p><em>Note: All these commands are useful with real production traces, not synthetic data. You want to know what the agent is being used for, not what you want it to be used for.</em></p></blockquote><p>To achieve this, you need a prompt created on Opik so you can use Claude Code or Cursor to reverse engineer it to figure what are the best evaluation criteria principles to cluster.</p><p>When you run <code>opik-eval you get something like this</code>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0pKO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0pKO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png 424w, https://substackcdn.com/image/fetch/$s_!0pKO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png 848w, https://substackcdn.com/image/fetch/$s_!0pKO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png 1272w, https://substackcdn.com/image/fetch/$s_!0pKO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0pKO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png" width="1456" height="1589" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1589,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:5060204,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thepipeandtheline.substack.com/i/182844040?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!0pKO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png 424w, https://substackcdn.com/image/fetch/$s_!0pKO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png 848w, https://substackcdn.com/image/fetch/$s_!0pKO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png 1272w, https://substackcdn.com/image/fetch/$s_!0pKO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6ea9e7d-8f97-45ec-b767-391bab951a08_3680x4016.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now that you have your prompt fully analysed, you will proceed to break it down on how to evaluate it based on usage.</p><p>Then it will propose metrics for Trace and Thread levels, looking like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1rBT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3c28960-9138-4c01-b2e1-6bc1c42cf0c8_3372x3748.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1rBT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3c28960-9138-4c01-b2e1-6bc1c42cf0c8_3372x3748.png 424w, https://substackcdn.com/image/fetch/$s_!1rBT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3c28960-9138-4c01-b2e1-6bc1c42cf0c8_3372x3748.png 848w, https://substackcdn.com/image/fetch/$s_!1rBT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3c28960-9138-4c01-b2e1-6bc1c42cf0c8_3372x3748.png 1272w, https://substackcdn.com/image/fetch/$s_!1rBT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3c28960-9138-4c01-b2e1-6bc1c42cf0c8_3372x3748.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1rBT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3c28960-9138-4c01-b2e1-6bc1c42cf0c8_3372x3748.png" width="1456" height="1618" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c3c28960-9138-4c01-b2e1-6bc1c42cf0c8_3372x3748.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1618,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:4535996,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thepipeandtheline.substack.com/i/182844040?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3c28960-9138-4c01-b2e1-6bc1c42cf0c8_3372x3748.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!1rBT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3c28960-9138-4c01-b2e1-6bc1c42cf0c8_3372x3748.png 424w, https://substackcdn.com/image/fetch/$s_!1rBT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3c28960-9138-4c01-b2e1-6bc1c42cf0c8_3372x3748.png 848w, https://substackcdn.com/image/fetch/$s_!1rBT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3c28960-9138-4c01-b2e1-6bc1c42cf0c8_3372x3748.png 1272w, https://substackcdn.com/image/fetch/$s_!1rBT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3c28960-9138-4c01-b2e1-6bc1c42cf0c8_3372x3748.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The commands are expected to propose around 5 metrics for trace evaluation, but you can optimise it to your sweet spot depending on how many evaluation metrics you want to handle simultaneously.</p><blockquote><p><em>Also, automating evaluation on the thread level can be tricky, since its hard to evaluate without too much ambiguity. Based on my experience, I will only evaluate threads manually since it&#8217;s where most of the subjective value lives.</em></p></blockquote><p>You can iterate and use the template as much as you can to come up with the most suitable evaluation angles for your project. When you are ready, you can go to configure your <a href="https://www.comet.com/docs/opik/production/rules?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Online Evaluations</a>. This means deciding on:</p><ul><li><p><em><strong>scope</strong></em>: Trace or thread.</p></li><li><p><em><strong>model</strong></em>: What you configured in your <a href="https://www.comet.com/docs/opik/configuration/ai_providers?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">AI provider</a>.</p></li><li><p><strong>prompt type</strong>: In here you will paste the proposed output of the Claude Commands, but you can use other default ones Opik provides, such as Hallucination or Answer Relevance.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zsQa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77a062cc-9e7d-4ca9-97eb-1d0d2dddff3a_772x598.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zsQa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77a062cc-9e7d-4ca9-97eb-1d0d2dddff3a_772x598.png 424w, https://substackcdn.com/image/fetch/$s_!zsQa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77a062cc-9e7d-4ca9-97eb-1d0d2dddff3a_772x598.png 848w, https://substackcdn.com/image/fetch/$s_!zsQa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77a062cc-9e7d-4ca9-97eb-1d0d2dddff3a_772x598.png 1272w, https://substackcdn.com/image/fetch/$s_!zsQa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77a062cc-9e7d-4ca9-97eb-1d0d2dddff3a_772x598.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zsQa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77a062cc-9e7d-4ca9-97eb-1d0d2dddff3a_772x598.png" width="728" height="563.9170984455959" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/77a062cc-9e7d-4ca9-97eb-1d0d2dddff3a_772x598.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:598,&quot;width&quot;:772,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:57819,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thepipeandtheline.substack.com/i/182844040?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77a062cc-9e7d-4ca9-97eb-1d0d2dddff3a_772x598.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!zsQa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77a062cc-9e7d-4ca9-97eb-1d0d2dddff3a_772x598.png 424w, https://substackcdn.com/image/fetch/$s_!zsQa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77a062cc-9e7d-4ca9-97eb-1d0d2dddff3a_772x598.png 848w, https://substackcdn.com/image/fetch/$s_!zsQa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77a062cc-9e7d-4ca9-97eb-1d0d2dddff3a_772x598.png 1272w, https://substackcdn.com/image/fetch/$s_!zsQa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77a062cc-9e7d-4ca9-97eb-1d0d2dddff3a_772x598.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You can add new variables, such as <code>context</code> (seen in the example output) by writing {{context}} in the prompt.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!21KD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F842f9a64-64d4-4f8c-a44c-fb2599100ef4_770x369.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!21KD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F842f9a64-64d4-4f8c-a44c-fb2599100ef4_770x369.png 424w, https://substackcdn.com/image/fetch/$s_!21KD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F842f9a64-64d4-4f8c-a44c-fb2599100ef4_770x369.png 848w, https://substackcdn.com/image/fetch/$s_!21KD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F842f9a64-64d4-4f8c-a44c-fb2599100ef4_770x369.png 1272w, https://substackcdn.com/image/fetch/$s_!21KD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F842f9a64-64d4-4f8c-a44c-fb2599100ef4_770x369.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!21KD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F842f9a64-64d4-4f8c-a44c-fb2599100ef4_770x369.png" width="724" height="346.95584415584415" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/842f9a64-64d4-4f8c-a44c-fb2599100ef4_770x369.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:369,&quot;width&quot;:770,&quot;resizeWidth&quot;:724,&quot;bytes&quot;:41266,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thepipeandtheline.substack.com/i/182844040?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F842f9a64-64d4-4f8c-a44c-fb2599100ef4_770x369.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!21KD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F842f9a64-64d4-4f8c-a44c-fb2599100ef4_770x369.png 424w, https://substackcdn.com/image/fetch/$s_!21KD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F842f9a64-64d4-4f8c-a44c-fb2599100ef4_770x369.png 848w, https://substackcdn.com/image/fetch/$s_!21KD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F842f9a64-64d4-4f8c-a44c-fb2599100ef4_770x369.png 1272w, https://substackcdn.com/image/fetch/$s_!21KD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F842f9a64-64d4-4f8c-a44c-fb2599100ef4_770x369.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>When you click on &#8220;Select a key from recent trace&#8221;, it will display, based on your tracked traces, all the possibilities depending on your framework, since the keys listed will depend on what you use, since Langchain will differ from OpenAI SDK or other frameworks.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AedY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5167d40-3089-4e29-83ef-159332bda4ac_763x251.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AedY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5167d40-3089-4e29-83ef-159332bda4ac_763x251.png 424w, https://substackcdn.com/image/fetch/$s_!AedY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5167d40-3089-4e29-83ef-159332bda4ac_763x251.png 848w, https://substackcdn.com/image/fetch/$s_!AedY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5167d40-3089-4e29-83ef-159332bda4ac_763x251.png 1272w, https://substackcdn.com/image/fetch/$s_!AedY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5167d40-3089-4e29-83ef-159332bda4ac_763x251.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AedY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5167d40-3089-4e29-83ef-159332bda4ac_763x251.png" width="728" height="239.4862385321101" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b5167d40-3089-4e29-83ef-159332bda4ac_763x251.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:251,&quot;width&quot;:763,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:44289,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thepipeandtheline.substack.com/i/182844040?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5167d40-3089-4e29-83ef-159332bda4ac_763x251.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!AedY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5167d40-3089-4e29-83ef-159332bda4ac_763x251.png 424w, https://substackcdn.com/image/fetch/$s_!AedY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5167d40-3089-4e29-83ef-159332bda4ac_763x251.png 848w, https://substackcdn.com/image/fetch/$s_!AedY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5167d40-3089-4e29-83ef-159332bda4ac_763x251.png 1272w, https://substackcdn.com/image/fetch/$s_!AedY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5167d40-3089-4e29-83ef-159332bda4ac_763x251.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Later on, you will define the score definition with a name, a short description on what the score is about and its metric type (boolean or categorical).</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ROET!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F613a04db-1ee9-492f-9128-01cf1f1391a2_774x199.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ROET!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F613a04db-1ee9-492f-9128-01cf1f1391a2_774x199.png 424w, https://substackcdn.com/image/fetch/$s_!ROET!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F613a04db-1ee9-492f-9128-01cf1f1391a2_774x199.png 848w, https://substackcdn.com/image/fetch/$s_!ROET!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F613a04db-1ee9-492f-9128-01cf1f1391a2_774x199.png 1272w, https://substackcdn.com/image/fetch/$s_!ROET!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F613a04db-1ee9-492f-9128-01cf1f1391a2_774x199.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ROET!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F613a04db-1ee9-492f-9128-01cf1f1391a2_774x199.png" width="728" height="187.17312661498707" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/613a04db-1ee9-492f-9128-01cf1f1391a2_774x199.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:199,&quot;width&quot;:774,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:23253,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thepipeandtheline.substack.com/i/182844040?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F925dcaa3-ed45-41b4-823e-b697f4e8c077_774x566.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!ROET!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F613a04db-1ee9-492f-9128-01cf1f1391a2_774x199.png 424w, https://substackcdn.com/image/fetch/$s_!ROET!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F613a04db-1ee9-492f-9128-01cf1f1391a2_774x199.png 848w, https://substackcdn.com/image/fetch/$s_!ROET!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F613a04db-1ee9-492f-9128-01cf1f1391a2_774x199.png 1272w, https://substackcdn.com/image/fetch/$s_!ROET!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F613a04db-1ee9-492f-9128-01cf1f1391a2_774x199.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Lastly, you can configure how many traces you want to evaluate, since you might want to sample a representative amount instead of just evaluating everything. Filters on different trace dimensions are allowed, which can help you clean up and avoid evaluation pollution.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!r6Ep!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef0691b-411d-4213-b5e7-d52a9391ecb3_767x356.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!r6Ep!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef0691b-411d-4213-b5e7-d52a9391ecb3_767x356.png 424w, https://substackcdn.com/image/fetch/$s_!r6Ep!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef0691b-411d-4213-b5e7-d52a9391ecb3_767x356.png 848w, https://substackcdn.com/image/fetch/$s_!r6Ep!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef0691b-411d-4213-b5e7-d52a9391ecb3_767x356.png 1272w, https://substackcdn.com/image/fetch/$s_!r6Ep!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef0691b-411d-4213-b5e7-d52a9391ecb3_767x356.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!r6Ep!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef0691b-411d-4213-b5e7-d52a9391ecb3_767x356.png" width="713" height="330.9361147327249" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aef0691b-411d-4213-b5e7-d52a9391ecb3_767x356.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:356,&quot;width&quot;:767,&quot;resizeWidth&quot;:713,&quot;bytes&quot;:37492,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thepipeandtheline.substack.com/i/182844040?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef0691b-411d-4213-b5e7-d52a9391ecb3_767x356.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!r6Ep!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef0691b-411d-4213-b5e7-d52a9391ecb3_767x356.png 424w, https://substackcdn.com/image/fetch/$s_!r6Ep!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef0691b-411d-4213-b5e7-d52a9391ecb3_767x356.png 848w, https://substackcdn.com/image/fetch/$s_!r6Ep!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef0691b-411d-4213-b5e7-d52a9391ecb3_767x356.png 1272w, https://substackcdn.com/image/fetch/$s_!r6Ep!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef0691b-411d-4213-b5e7-d52a9391ecb3_767x356.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><em>Based on all my experiments, the ideal evaluation metric focus about ONE thing in the prompt and ONE score definition. Metric type can be flexible as long as you avoid subjective outcomes.</em></p></blockquote><p>Going back to the Claude Command output, you will get a summary of what cannot be evaluated based on the trace information it has found, this is helpful for extra brainstorming on what to actually evaluate:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6QNn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95b087da-bdd3-422a-b5e1-71ca94e3aa44_3680x2848.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6QNn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95b087da-bdd3-422a-b5e1-71ca94e3aa44_3680x2848.png 424w, https://substackcdn.com/image/fetch/$s_!6QNn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95b087da-bdd3-422a-b5e1-71ca94e3aa44_3680x2848.png 848w, https://substackcdn.com/image/fetch/$s_!6QNn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95b087da-bdd3-422a-b5e1-71ca94e3aa44_3680x2848.png 1272w, https://substackcdn.com/image/fetch/$s_!6QNn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95b087da-bdd3-422a-b5e1-71ca94e3aa44_3680x2848.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6QNn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95b087da-bdd3-422a-b5e1-71ca94e3aa44_3680x2848.png" width="1456" height="1127" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/95b087da-bdd3-422a-b5e1-71ca94e3aa44_3680x2848.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1127,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:4059620,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thepipeandtheline.substack.com/i/182844040?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95b087da-bdd3-422a-b5e1-71ca94e3aa44_3680x2848.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!6QNn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95b087da-bdd3-422a-b5e1-71ca94e3aa44_3680x2848.png 424w, https://substackcdn.com/image/fetch/$s_!6QNn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95b087da-bdd3-422a-b5e1-71ca94e3aa44_3680x2848.png 848w, https://substackcdn.com/image/fetch/$s_!6QNn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95b087da-bdd3-422a-b5e1-71ca94e3aa44_3680x2848.png 1272w, https://substackcdn.com/image/fetch/$s_!6QNn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95b087da-bdd3-422a-b5e1-71ca94e3aa44_3680x2848.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Refining Evaluation Criteria</h3><p>Now that you have your online evaluations defined and active, your traces will start getting some evaluated outputs.</p><p>Versioning Online Evaluations is not possible, so you need to document somewhere that you did that change.</p><p>For example, my &#8220;<em>Response Format Compliance</em>&#8221; evaluates if the agent is following the prompt formatting standards properly.</p><p>When I saw this pattern, I check 10-15 traces randomly and realised the condition for scoring 1 was &#8220;to match at least 2 of 5 formatting standards&#8221;, so I refined it to be &#8220;to match ALL&#8221;.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yATe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb0af29-7344-4bb5-b97d-661683f26876_755x368.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yATe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb0af29-7344-4bb5-b97d-661683f26876_755x368.png 424w, https://substackcdn.com/image/fetch/$s_!yATe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb0af29-7344-4bb5-b97d-661683f26876_755x368.png 848w, https://substackcdn.com/image/fetch/$s_!yATe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb0af29-7344-4bb5-b97d-661683f26876_755x368.png 1272w, https://substackcdn.com/image/fetch/$s_!yATe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb0af29-7344-4bb5-b97d-661683f26876_755x368.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yATe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb0af29-7344-4bb5-b97d-661683f26876_755x368.png" width="755" height="368" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4bb0af29-7344-4bb5-b97d-661683f26876_755x368.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:368,&quot;width&quot;:755,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:39711,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thepipeandtheline.substack.com/i/182844040?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb0af29-7344-4bb5-b97d-661683f26876_755x368.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!yATe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb0af29-7344-4bb5-b97d-661683f26876_755x368.png 424w, https://substackcdn.com/image/fetch/$s_!yATe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb0af29-7344-4bb5-b97d-661683f26876_755x368.png 848w, https://substackcdn.com/image/fetch/$s_!yATe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb0af29-7344-4bb5-b97d-661683f26876_755x368.png 1272w, https://substackcdn.com/image/fetch/$s_!yATe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bb0af29-7344-4bb5-b97d-661683f26876_755x368.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>If after some days the score is fixed to 1 and after further review the score makes sense, I would conclude this metric its either saturated or not relevant if its always going to be ok.</p><p>I like to go through this manually since Opik dashboards are quite quick to scan through and know where to deep dive, but you can iterate through this within a conversation with your AI IDE.</p><blockquote><p><em>Once you have the rationale for these changes, I recommend leaving some notes on Google Docs, Notion or Confluence pages to track some sort of version control over time.</em></p></blockquote><h3>Running Annotation Sessions</h3><p>Annotation Sessions can be painful and boring.</p><p>After manually reviewing 500 conversations, I decided to come up with a slash command to setup a mindset on how the last days would look like before I even jumped to review.</p><p>This helped me a lot because I was always to walk through the UI and filter some patterns to better conduct my reviews.</p><blockquote><p><em>Note that you need to have <a href="https://www.comet.com/docs/opik/configuration/configuration/feedback_definitions?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Feedback Definitions</a> in place, which are different from Online Evaluations<br><br>While Online Evaluations are triggered as traces or threads arrive, Feedback Definitions are used when running manual annotation to give feedback on how the Agents are performing.</em></p></blockquote><p>Running <code>opik-weekly gets you an output like this one</code>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!syEd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5c340ca-8411-4b04-b65e-44dda471f4dd_3332x5096.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!syEd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5c340ca-8411-4b04-b65e-44dda471f4dd_3332x5096.png 424w, https://substackcdn.com/image/fetch/$s_!syEd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5c340ca-8411-4b04-b65e-44dda471f4dd_3332x5096.png 848w, https://substackcdn.com/image/fetch/$s_!syEd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5c340ca-8411-4b04-b65e-44dda471f4dd_3332x5096.png 1272w, https://substackcdn.com/image/fetch/$s_!syEd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5c340ca-8411-4b04-b65e-44dda471f4dd_3332x5096.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!syEd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5c340ca-8411-4b04-b65e-44dda471f4dd_3332x5096.png" width="1456" height="2227" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f5c340ca-8411-4b04-b65e-44dda471f4dd_3332x5096.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2227,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:5204817,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thepipeandtheline.substack.com/i/182844040?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5c340ca-8411-4b04-b65e-44dda471f4dd_3332x5096.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!syEd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5c340ca-8411-4b04-b65e-44dda471f4dd_3332x5096.png 424w, https://substackcdn.com/image/fetch/$s_!syEd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5c340ca-8411-4b04-b65e-44dda471f4dd_3332x5096.png 848w, https://substackcdn.com/image/fetch/$s_!syEd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5c340ca-8411-4b04-b65e-44dda471f4dd_3332x5096.png 1272w, https://substackcdn.com/image/fetch/$s_!syEd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5c340ca-8411-4b04-b65e-44dda471f4dd_3332x5096.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You have a first glance on the feedback score overview, any latency outliers and token usage to further deep dive.</p><p>Then you get some anomaly analysis on thread level with some WoW (Week Over Week) comparisons:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7ae_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe933f21-cabc-42a3-8663-a5e83cd57c86_3680x4468.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7ae_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe933f21-cabc-42a3-8663-a5e83cd57c86_3680x4468.png 424w, https://substackcdn.com/image/fetch/$s_!7ae_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe933f21-cabc-42a3-8663-a5e83cd57c86_3680x4468.png 848w, https://substackcdn.com/image/fetch/$s_!7ae_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe933f21-cabc-42a3-8663-a5e83cd57c86_3680x4468.png 1272w, https://substackcdn.com/image/fetch/$s_!7ae_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe933f21-cabc-42a3-8663-a5e83cd57c86_3680x4468.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7ae_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe933f21-cabc-42a3-8663-a5e83cd57c86_3680x4468.png" width="1456" height="1768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/be933f21-cabc-42a3-8663-a5e83cd57c86_3680x4468.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1768,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:5309919,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thepipeandtheline.substack.com/i/182844040?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe933f21-cabc-42a3-8663-a5e83cd57c86_3680x4468.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!7ae_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe933f21-cabc-42a3-8663-a5e83cd57c86_3680x4468.png 424w, https://substackcdn.com/image/fetch/$s_!7ae_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe933f21-cabc-42a3-8663-a5e83cd57c86_3680x4468.png 848w, https://substackcdn.com/image/fetch/$s_!7ae_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe933f21-cabc-42a3-8663-a5e83cd57c86_3680x4468.png 1272w, https://substackcdn.com/image/fetch/$s_!7ae_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe933f21-cabc-42a3-8663-a5e83cd57c86_3680x4468.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>With the truncated messages you can go to Opik Threads or Trace views and filter by message so its quite quick to analyse some anomalies or outliers you found interesting.</p><h3>Making Sense of Annotated Feedback</h3><p>You can get a quick start before jumping into annotations.</p><p>Running opik-annotation-review gets you an output like this one:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zt-V!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf425b11-41a0-4ca9-94da-eaa6b1a109fe_3680x6716.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zt-V!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf425b11-41a0-4ca9-94da-eaa6b1a109fe_3680x6716.png 424w, https://substackcdn.com/image/fetch/$s_!zt-V!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf425b11-41a0-4ca9-94da-eaa6b1a109fe_3680x6716.png 848w, https://substackcdn.com/image/fetch/$s_!zt-V!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf425b11-41a0-4ca9-94da-eaa6b1a109fe_3680x6716.png 1272w, https://substackcdn.com/image/fetch/$s_!zt-V!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf425b11-41a0-4ca9-94da-eaa6b1a109fe_3680x6716.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zt-V!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf425b11-41a0-4ca9-94da-eaa6b1a109fe_3680x6716.png" width="1456" height="2657" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cf425b11-41a0-4ca9-94da-eaa6b1a109fe_3680x6716.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2657,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:7145311,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thepipeandtheline.substack.com/i/182844040?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf425b11-41a0-4ca9-94da-eaa6b1a109fe_3680x6716.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!zt-V!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf425b11-41a0-4ca9-94da-eaa6b1a109fe_3680x6716.png 424w, https://substackcdn.com/image/fetch/$s_!zt-V!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf425b11-41a0-4ca9-94da-eaa6b1a109fe_3680x6716.png 848w, https://substackcdn.com/image/fetch/$s_!zt-V!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf425b11-41a0-4ca9-94da-eaa6b1a109fe_3680x6716.png 1272w, https://substackcdn.com/image/fetch/$s_!zt-V!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf425b11-41a0-4ca9-94da-eaa6b1a109fe_3680x6716.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This can give you an overview of what&#8217;s been commented last time, also useful to connect with another MCP such as Atlassian and create tickets on JIRA or update documents on Confluence.</p><p>It&#8217;s also helpful to quickly check backlogs to validate if the most concerning highlights are considered for further developments.</p><h2>Backstory of This Framework</h2><p>It&#8217;s been almost a year since I implemented the first agent projects at work.</p><p>I am using PostgreSQL for memory and session tracking for conversation history, so it's quite easy to run some wrapped SQL queries in order to get an idea of usage for Product Analytics use cases.</p><p>I did that&#8230; for around 6 months.</p><blockquote><p><em>But just saying that we have 100 messages a day or 20 new users every week, I wasn&#8217;t getting any input on how to improve the agents.</em></p></blockquote><p>Motivated by frustration, I started researching and experimenting with AI Observability.</p><p>Read a lot, followed a lot of projects implementing tools like Opik or Langfuse.</p><p>Saw perfect evaluation scores on top of synthetic datasets, tried sample projects with really nice constructed conversation threads.</p><p>And then I started hitting walls:</p><ul><li><p>I built scripts (meant to run locally) to run LLM As Judge evaluations at scale.</p></li><li><p>I did Prompt Versioning manually, copy pasting my changed prompt.</p></li><li><p>Wondered how to create golden datasets while I had a bunch of real data to actually reverse engineer the process.</p></li><li><p>Guessing what to evaluate, without actually looking at how my prompts were reacting to user messages.</p></li><li><p>Now knowing what threads to actually annotate because I had no system in place, I ended up annotating everything.</p></li></ul><p>I got super frustrated because I was getting value out of it, but at the cost of being exhausted from a lot of manual back n forth.</p><blockquote><p><em>All those tutorials out there were showing 1% of running AI Observability as an iterative process, when its a cyclic process that continuously evolves.</em></p></blockquote><p>Then I discovered Online Evaluations, started experimented with Claude Code and Opik MCP and everything clicked.</p><p>At some point, I was lucky to go through all the manual process because I got a lot of input that helped figuring out evaluation criteria faster, but today I would approach completely different leveraging the right tools and saving a lot of time.</p><p><em>All these pains led me to come up with these <a href="https://thoracic-hellebore-9a3.notion.site/WIP-Opik-MCP-Commands-2d880c67261e803cb314c9d8185300e7">Opik MCP Commands in Notion</a> to:</em></p><ul><li><p>Make a data driven, reverse engineering process of the prompt you want to evaluate to come up with solid evaluation criteria.</p></li><li><p>Have an annotation agenda before sitting down to review manually, so the process is streamlined and not easily sidetracked by distractions.</p></li><li><p>Know more about what your agent does, without having to be everywhere, and getting time back to draw your roadmap treating it as a product and not just another thing you have to maintain.</p></li></ul><blockquote><p><em>Nothing about this framework is written in stone, and its meant to be super flexible. Before getting it to the Notion page I changed a bunch of times and it will keep evolving as the AI landscape gets more powerful.</em></p></blockquote><p>Also, remember that this article is part of a <strong><a href="https://www.decodingai.com/t/ai-evals-and-observability">7-piece series on AI Evals &amp; Observability</a></strong>. <strong>Here is the full series:</strong></p><ol><li><p><a href="https://www.decodingai.com/p/integrating-ai-evals-into-your-ai-app">Integrating AI Evals Into Your AI App</a><strong> </strong></p></li><li><p><a href="https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis">Build an AI Evals Dataset from Scratch  </a></p></li><li><p><a href="https://www.decodingai.com/p/generate-synthetic-datasets-for-ai-evals">Generate Synthetic Datasets for AI Evals</a></p></li><li><p><a href="https://www.decodingai.com/p/how-to-design-ai-evaluators-that-catch-failures">How to Design Evaluators</a></p></li><li><p><a href="https://www.decodingai.com/p/how-to-evaluate-the-evaluator-validate-llm-judge">How to Evaluate the Evaluator </a></p></li><li><p><a href="https://www.decodingai.com/p/rag-evaluation-6-metrics-framework">RAG Evaluation: The Only 6 Metrics You Need</a></p></li><li><p><strong>Lessons from 6 Months of Evals on a Production AI Companion</strong> &#8592; <em>You just finished this one</em></p></li></ol><p>&#8216;Till next time</p><p><a href="https://substack.com/@alejandroaboy?utm_source=global-search">Alejandro Aboy</a></p><div><hr></div><h3>Do you want more articles like this from Alejandro?</h3><p>If yes, consider subscribing to his amazing Substack, where he talks about data and AI from his real-world experience &#8595; </p><div class="embedded-publication-wrap" data-attrs="{&quot;id&quot;:1196229,&quot;name&quot;:&quot;The Pipe &amp; The Line&quot;,&quot;logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!vmrQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d5b2131-da28-4621-ad6f-9574cbc41a1e_500x500.png&quot;,&quot;base_url&quot;:&quot;https://thepipeandtheline.substack.com&quot;,&quot;hero_text&quot;:&quot;Hands-on guides, tools, and experiments to sharpen your Data &amp; AI Engineering skills from someone who learned it all in the wild.&quot;,&quot;author_name&quot;:&quot;Alejandro Aboy&quot;,&quot;show_subscribe&quot;:true,&quot;logo_bg_color&quot;:&quot;#131826&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="EmbeddedPublicationToDOMWithSubscribe"><div class="embedded-publication show-subscribe"><a class="embedded-publication-link-part" native="true" href="https://thepipeandtheline.substack.com?utm_source=substack&amp;utm_campaign=publication_embed&amp;utm_medium=web"><img class="embedded-publication-logo" src="https://substackcdn.com/image/fetch/$s_!vmrQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d5b2131-da28-4621-ad6f-9574cbc41a1e_500x500.png" width="56" height="56" style="background-color: rgb(19, 24, 38);"><span class="embedded-publication-name">The Pipe &amp; The Line</span><div class="embedded-publication-hero-text">Hands-on guides, tools, and experiments to sharpen your Data &amp; AI Engineering skills from someone who learned it all in the wild.</div><div class="embedded-publication-author-name">By Alejandro Aboy</div></a><form class="embedded-publication-subscribe" method="GET" action="https://thepipeandtheline.substack.com/subscribe?"><input type="hidden" name="source" value="publication-embed"><input type="hidden" name="autoSubmit" value="true"><input type="email" class="email-input" name="email" placeholder="Type your email..."><input type="submit" class="button primary" value="Subscribe"></form></div></div><div><hr></div><p><em>What&#8217;s your opinion? Do you agree, disagree, or is there something I missed?</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/behind-the-scenes-of-ai-observability/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/behind-the-scenes-of-ai-observability/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to share our work.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/behind-the-scenes-of-ai-observability?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/behind-the-scenes-of-ai-observability?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h2>Go Deeper</h2><p>Everything you learned in this article, from building evals datasets to evaluators, comes from the AI Evals &amp; Observability module of our <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Agentic AI Engineering self-paced course</a>.</p><p><strong>Your path to agentic AI for production. </strong>Built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Towards AI</a>.</p><p>Across <strong>34 lessons</strong> (articles, videos, and a lot of code), you&#8217;ll design, build, evaluate, and deploy production-grade AI agents end to end. By the final lesson, you&#8217;ll have built a multi-agent system that orchestrates <strong>Nova</strong> (a deep research agent) and <strong>Brown</strong> (a full writing workflow), plus a <strong>capstone project</strong> where you apply everything on your own. </p><p><em>Three portfolio projects and a certificate to show off in interviews. Plus a Discord community where you have direct access to other industry experts and me.</em></p><p>Rated 4.9/5 &#11088;&#65039; by 190+ early students &#8212; &#8220;Every AI Engineer needs a course like this.&#8221;</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&quot;,&quot;text&quot;:&quot;Learn more&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31"><span>Learn more</span></a></p><p><em>Not ready to commit?</em> We also prepared a free 6-day email course to reveal the <em><strong>6 critical mistakes that silently destroy agentic systems. </strong><a href="https://email-course.towardsai.net/?ref=b3ab31">Get the free email course.</a></em></p><div><hr></div><p><em>Thanks again to <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a> for sponsoring the series and keeping it free!</em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oSDm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 424w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 848w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1272w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png" width="1456" height="364" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:364,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Opik Banner&quot;,&quot;title&quot;:&quot;Opik Banner&quot;,&quot;type&quot;:null,&quot;href&quot;:&quot;https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Opik Banner" title="Opik Banner" srcset="https://substackcdn.com/image/fetch/$s_!oSDm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 424w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 848w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1272w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Try Opik for free here</a> (25k spans/month free)</figcaption></figure></div><p><strong>If you want to monitor, evaluate and optimize your AI workflows and agents:</strong></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;text&quot;:&quot;Try Opik for free&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul"><span>Try Opik for free</span></a></p><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[Your Agent's Reasoning Is Fine - Its Memory Isn't]]></title><description><![CDATA[Using GraphRAG to build a Production Engineer agent that knows dependencies, incidents, and ownership.]]></description><link>https://www.decodingai.com/p/designing-production-engineer-agent-graphrag</link><guid isPermaLink="false">https://www.decodingai.com/p/designing-production-engineer-agent-graphrag</guid><dc:creator><![CDATA[Anca Ioana Muscalagiu]]></dc:creator><pubDate>Tue, 20 Jan 2026 12:02:18 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!99Cg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07aa4d04-2e5a-4fa4-b735-a01d5a3f5ff6_980x986.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The pager goes off at 02:13.</p><p>One service is down. Then another 10 follow. Restarts begin. Logs and dashboards come up side by side.</p><p>The graph looks familiar. The error rate spikes, rolls back, spikes again. Someone pastes a dashboard link into Slack. Someone else replies with a half-sentence: <em>&#8220;Didn&#8217;t we see this two weeks ago?&#8221;</em> No ticket is linked. No postmortem is found. The incident feels known, but undocumented.</p><p>A workaround exists. Everyone knows that much. Nobody knows <strong>why</strong>.</p><p>It lives in a shell script, wrapped in a cron job, guarded by a comment that says <em>&#8220;DO NOT REMOVE&#8221;</em>. The person who wrote it left two companies ago. The context left with them.</p><p>Slack then becomes an archaeological dig. You scroll past emojis, past renamed channels, past a debate that ends mid-thread. Somewhere in 2021, an engineer explains the underlying problem, but no action is taken for now. </p><p>This is how enterprise systems decay. Not through broken code, but through forgotten understanding.</p><p>And this is how production engineers start carrying all this history in their heads.</p><p>They know which alert is real. Which one needs a manual nudge. Where each dependency lies. Which rollback will make things worse. Over time, the entire system is held together not by documentation or dashboards, but by the accumulated memory of our production engineers.</p><p>This works until it doesn&#8217;t. When the wrong person is asleep. When someone leaves. When the system grows just large enough that no single mind can hold it all anymore.</p><p>In this article, I want to walk through a very concrete use case and design it together with you.</p><p>We will design <strong>a Production Engineer agent </strong>that reacts to alerts by <strong>identifying the affected services and teams</strong>, <strong>understanding how the issue propagates </strong>through the system, and <strong>surfacing the context </strong>that<strong> </strong>engineers need to act quickly.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JtZg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0329c0f2-a267-4386-bacd-2e825e94c8d6_1614x1476.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JtZg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0329c0f2-a267-4386-bacd-2e825e94c8d6_1614x1476.png 424w, https://substackcdn.com/image/fetch/$s_!JtZg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0329c0f2-a267-4386-bacd-2e825e94c8d6_1614x1476.png 848w, https://substackcdn.com/image/fetch/$s_!JtZg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0329c0f2-a267-4386-bacd-2e825e94c8d6_1614x1476.png 1272w, https://substackcdn.com/image/fetch/$s_!JtZg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0329c0f2-a267-4386-bacd-2e825e94c8d6_1614x1476.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JtZg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0329c0f2-a267-4386-bacd-2e825e94c8d6_1614x1476.png" width="1456" height="1332" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0329c0f2-a267-4386-bacd-2e825e94c8d6_1614x1476.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1332,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:393908,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/183048385?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0329c0f2-a267-4386-bacd-2e825e94c8d6_1614x1476.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JtZg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0329c0f2-a267-4386-bacd-2e825e94c8d6_1614x1476.png 424w, https://substackcdn.com/image/fetch/$s_!JtZg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0329c0f2-a267-4386-bacd-2e825e94c8d6_1614x1476.png 848w, https://substackcdn.com/image/fetch/$s_!JtZg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0329c0f2-a267-4386-bacd-2e825e94c8d6_1614x1476.png 1272w, https://substackcdn.com/image/fetch/$s_!JtZg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0329c0f2-a267-4386-bacd-2e825e94c8d6_1614x1476.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: Architecting a Production Engineer Agent using GraphRAG</figcaption></figure></div><p>By the end, you will have a clear understanding of how to design an agent for monitoring and reasoning about production systems, and more importantly,<strong> how to design the right kind of memory</strong> to support it. </p><p>The twist is that the agent&#8217;s real superpower isn&#8217;t reasoning&#8212;it&#8217;s<strong> GraphRAG</strong>.</p><p><strong>Let&#8217;s unpack why.</strong></p><p><em>But first, a quick word from our sponsor, Opik</em> &#8595;</p><div><hr></div><h2><a href="https://www.encodeclub.com/programmes/comet-resolution-v2-hackathon">AI Agents Virtual Hackathon With $30,000 in Prizes (Sponsored)</a></h2><p>Want to get motivated to build that AI agent you had in mind in the past 12 months while having fun, meeting cool people and potentially earning up to $10,000 (in cash per team)?</p><p><em><a href="https://www.encodeclub.com/programmes/comet-resolution-v2-hackathon">Hackathons</a> are the best way to do that.</em> </p><p><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a> is hosting a free one, together with Google DeepMind and Vercel, offering $30,000 in prizes with a single goal: building and shipping AI agents.</p><p><em>But wait.</em> That sounds like a scam. <em>Not really.</em> The catch is that you have to put in the work during the hackathon to convince the judge that your AI app is worth the prize.</p><p>The worst that can happen? You have the chance, <strong>for free</strong>, to access:</p><ul><li><p><strong>Expert Workshops</strong>: Learn observability, evaluation, and agent optimization from Comet&#8217;s team</p></li><li><p><strong>Premium Tools</strong>: Credits and support from Google, Vercel, and other partners</p></li><li><p><strong>Direct Mentorship</strong>: Technical support throughout the hackathon via Discord</p></li></ul><p><strong>Prizes</strong> will be awarded based on the 6 challenges below. You can win one category plus the best use of Opik, <strong>totaling $10,000</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://www.encodeclub.com/programmes/comet-resolution-v2-hackathon" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!t_B1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2700a07a-99a2-4964-b580-fff1a8f0460f_1562x400.jpeg 424w, https://substackcdn.com/image/fetch/$s_!t_B1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2700a07a-99a2-4964-b580-fff1a8f0460f_1562x400.jpeg 848w, https://substackcdn.com/image/fetch/$s_!t_B1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2700a07a-99a2-4964-b580-fff1a8f0460f_1562x400.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!t_B1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2700a07a-99a2-4964-b580-fff1a8f0460f_1562x400.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!t_B1!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2700a07a-99a2-4964-b580-fff1a8f0460f_1562x400.jpeg" width="1200" height="307.4175824175824" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2700a07a-99a2-4964-b580-fff1a8f0460f_1562x400.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:373,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:380760,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:&quot;https://www.encodeclub.com/programmes/comet-resolution-v2-hackathon&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/183048385?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2700a07a-99a2-4964-b580-fff1a8f0460f_1562x400.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!t_B1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2700a07a-99a2-4964-b580-fff1a8f0460f_1562x400.jpeg 424w, https://substackcdn.com/image/fetch/$s_!t_B1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2700a07a-99a2-4964-b580-fff1a8f0460f_1562x400.jpeg 848w, https://substackcdn.com/image/fetch/$s_!t_B1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2700a07a-99a2-4964-b580-fff1a8f0460f_1562x400.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!t_B1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2700a07a-99a2-4964-b580-fff1a8f0460f_1562x400.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">&#8594; <a href="https://www.encodeclub.com/programmes/comet-resolution-v2-hackathon">Apply to Hackathon</a> &#8592;</figcaption></figure></div><p><strong>My take?</strong> The money is just a bonus, as this is a fantastic opportunity to learn for free from industry experts while building:</p><ul><li><p>LLM judges to evaluate your custom business use case.</p></li><li><p>Agentic RAG applications or coding agents.</p></li><li><p>Automated prompt tuning loops and guardrails.</p></li></ul><p>Or whatever AI agent moves and motivates you!</p><h3>Who Should Join?</h3><p>Developers with:</p><ul><li><p>Basic understanding of LLMs and AI agents</p></li><li><p>Experience building software applications</p></li><li><p>Python or TypeScript knowledge</p></li></ul><p><em>Sounds like this is for you? Then register here:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.encodeclub.com/programmes/comet-resolution-v2-hackathon&quot;,&quot;text&quot;:&quot;Apply to Hackathon&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.encodeclub.com/programmes/comet-resolution-v2-hackathon"><span>Apply to Hackathon</span></a></p><p><strong>Be quick!</strong> You can register at any time during the event, up to the project submission on February 9.</p><div><hr></div><p><em>&#8595;</em>  <em>Now, let&#8217;s move back to our article.</em></p><h2>1. Our use case:  The Production Engineer</h2><p>As companies grow, their systems rarely fail in isolation. Services depend on other services, teams depend on teams, and ownership is spread across layers that are hard to see from any single place.</p><p>When something breaks upstream, the downstream impact is often unclear. Engineers are left stitching together context from dashboards, Slack threads, old announcements, and workarounds buried in Confluence.</p><p><strong>The problem is clear.</strong> Most issues are fixable. What slows teams down is figuring out what is really happening and how far it reaches before time is lost.</p><p>This is where the Production Engineer agent comes in.</p><p>The feature we want to design is simple, but powerful. When an alert is triggered, the agent should be able to:</p><ul><li><p>understand how the failure propagates across the organization, including which services are affected and which teams are responsible</p></li><li><p>diagnose likely causes using known dependencies and patterns from past incidents</p></li><li><p>surface the relevant context that engineers usually spend hours reconstructing from dashboards, Slack threads, and internal documentation</p></li></ul><p><strong>Why is this important?</strong></p><p>Because most production incidents are not slowed down by the lack of a fix. They are slowed down by the lack of clarity. Engineers need to understand what is happening, how far the issue reaches, and who needs to be involved before they can act with confidence.</p><p>The goal is to shorten the time between detection and action, especially in large enterprises where context is fragmented, and knowledge is spread all over.</p><p>To build something like this reliably, we need to carefully design the system and how information flows through it. That starts with <strong>the architecture.</strong></p><h2>2. Designing the architecture </h2><p>Now, let's break down exactly how this system would work when that 02:13 pager goes off.</p><p>At that point, the goal is not just to raise an alert, but to immediately attach the right context so engineers don&#8217;t have to reconstruct it manually. The architecture is designed to do exactly that.</p><p>The interface of our <strong>Production Engineer Agent</strong> is intentionally straightforward:</p><ul><li><p><strong>Input</strong> comes from monitoring systems via webhook API: Prometheus sends POST requests to our FastAPI endpoint, each payload including service name, error type, severity, and timestamp. The webhook fires the moment something crosses a threshold.</p></li><li><p><strong>Output</strong> goes to all affected teams via Slack: Structured incident reports are posted to relevant team channels, tagged with the right on-call engineers and linked to runbooks, past incidents, and related documentation.</p></li></ul><p>The diagram below shows how GraphRAG and MCP wire this together:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IPOT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65ee18b8-6440-4cee-b021-8b8462ae7ba0_1614x1476.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IPOT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65ee18b8-6440-4cee-b021-8b8462ae7ba0_1614x1476.png 424w, https://substackcdn.com/image/fetch/$s_!IPOT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65ee18b8-6440-4cee-b021-8b8462ae7ba0_1614x1476.png 848w, https://substackcdn.com/image/fetch/$s_!IPOT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65ee18b8-6440-4cee-b021-8b8462ae7ba0_1614x1476.png 1272w, https://substackcdn.com/image/fetch/$s_!IPOT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65ee18b8-6440-4cee-b021-8b8462ae7ba0_1614x1476.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IPOT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65ee18b8-6440-4cee-b021-8b8462ae7ba0_1614x1476.png" width="1456" height="1332" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/65ee18b8-6440-4cee-b021-8b8462ae7ba0_1614x1476.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1332,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:393908,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/183048385?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65ee18b8-6440-4cee-b021-8b8462ae7ba0_1614x1476.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IPOT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65ee18b8-6440-4cee-b021-8b8462ae7ba0_1614x1476.png 424w, https://substackcdn.com/image/fetch/$s_!IPOT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65ee18b8-6440-4cee-b021-8b8462ae7ba0_1614x1476.png 848w, https://substackcdn.com/image/fetch/$s_!IPOT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65ee18b8-6440-4cee-b021-8b8462ae7ba0_1614x1476.png 1272w, https://substackcdn.com/image/fetch/$s_!IPOT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65ee18b8-6440-4cee-b021-8b8462ae7ba0_1614x1476.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 2: Architecting a Production Engineer using GraphRAG</figcaption></figure></div><p>At a high level, the system is decomposed into five components, each with a clearly defined responsibility and a well-scoped interface:</p><p><strong>#1. Alerting System<br><br></strong>This is a standard component in any enterprise environment. Prometheus detects threshold breaches and emits alerts, which are routed through Alertmanager. From there, alerts are delivered via webhook to our FastAPI server. This webhook is the <strong>entry point for every incident</strong> and the only way external signals enter the agent system.</p><p><strong>#2. Agent Component</strong></p><p>The Agent Component orchestrates the entire flow. The FastAPI server receives the alert payload and forwards it to th<strong>e Agent Controller,</strong> which is responsible for handling the agent loop:</p><ul><li><p>invoking tools via the MCP Client (used to communicate with external systems),</p></li><li><p>querying the GraphRAG component to retrieve contextual knowledge about teams, system dependencies, and ownership.</p></li><li><p>preparing the prompt and additional context, sending it to the LLM through the LLM Gateway, which handles all direct interaction with the Gemini API. </p></li></ul><p>It serves as the coordination layer between context retrieval, tool execution, and model inference.</p><p><strong>#3. GraphRAG Component</strong> </p><p>GraphRAG serves as the agent&#8217;s long-term, structured memory. It is built on a Neo4j graph database that models services, teams, and other dependencies as a property graph, with vector embeddings attached to nodes.</p><p>The Graph Query Engine performs graph traversals to retrieve the most relevant entities and their dependencies based on the incident context. The graph itself is populated offline by a Graph Extractor, which ingests organizational data from sources such as Confluence documentation.</p><p><strong>#4. MCP Servers</strong></p><p>MCP servers provide real-time, external context. A global MCP router forwards requests to specialized servers:</p><ul><li><p><strong>Confluence MCP</strong> retrieves documentation,</p></li><li><p><strong>GitHub MCP</strong> fetches recent code changes and release information,</p></li><li><p><strong>Slack MCP</strong> searches historical discussions and posts updates to support channels,</p></li><li><p><strong>Prometheus MCP</strong> retrieves live metrics.</p></li></ul><p>This separation allows each data source to evolve independently while presenting a uniform interface to the agent.</p><p><strong>#5. Observability</strong></p><p>Observability is handled through Opik. Prompt Monitoring tracks the agent&#8217;s questions, tool usage, and retrieval strategies, while Trace Logging records executed queries and their latency. </p><p>Together, these signals provide visibility into how the agent reasons, what it accesses, and where time is spent.</p><p>Now that we understand each component, let&#8217;s walk through<strong> the data flow:</strong></p><ol><li><p>Prometheus detects a threshold breach and fires a webhook to the Alerting Manager with service name, error type, severity, and timestamp.</p></li><li><p>The Alerting Manager routes the alert to our FastAPI Server endpoint, which hands it to the Agent Controller.</p></li><li><p>The Agent Controller queries the GraphRAG Component for related context. The Graph Query Engine does a semantic search to retrieve the nodes closest (&#8220;communities&#8221;) to the query and fetches them along with their dependencies.</p></li><li><p>With graph context assembled, the Agent Controller sends a plan to Gemini specifying which MCP servers to call for real-time data.</p></li><li><p>Gemini returns the required tools. The Agent Controller invokes the Global MCP Server with the list.</p></li><li><p>Each requested MCP server is called: GitHub for fetching recent code changes/releases, Slack searches incident discussions, Confluence retrieves documentation, and Prometheus pulls current metrics.</p></li><li><p>The Agent Controller sends all context to Gemini and fresh MCP data.</p></li><li><p>Gemini synthesizes everything into a structured incident report with impact summary, pattern recognition, recent changes, current state, and recommended actions.</p></li><li><p>The report goes to the Slack MCP Server, which posts it to affected team channels.</p></li></ol><blockquote><p><em><strong>Note:</strong> Gemini may request additional MCP tool calls as needed, meaning steps 6-11 can loop until the complete context is gathered.</em></p></blockquote><p>And just like that, the incident is contextualized and actionable in seconds, before the engineer opens their laptop. </p><p>But this speed does not come from the agent loop alone.</p><p>What makes this possible is the memory behind it, the way organizational knowledge is stored, connected, and retrieved at the moment it is needed. </p><p>This is where GraphRAG enters the picture.</p><div><hr></div><h2>3.  Graph RAG as our organization's knowledge</h2><p>Before we talk about <em>GraphRAG</em>, we need to clarify the foundation it builds upon: <strong>the knowledge graph</strong>.</p><h3>What Is a Knowledge Graph?</h3><p>A knowledge graph is a structured way of representing information as a system of connected components:</p><ul><li><p><strong>Nodes</strong> represent entities such as documents, services, concepts, teams, incidents, or decisions</p></li><li><p><strong>Edges</strong> represent relationships between those entities</p></li><li><p><strong>Properties</strong> on nodes and edges store metadata like summaries, timestamps, ownership, or importance</p></li></ul><p>Unlike flat document stores or pure vector databases, a knowledge graph explicitly captures <strong>how information is connected</strong>, not just where it appears.</p><p>From a production perspective, this matters because it enables things we rely on every day:</p><ul><li><p>Tracing relationships (&#8220;what depends on this service?&#8221;)</p></li><li><p>Aggregating information across multiple sources</p></li><li><p>Preserving institutional knowledge beyond individual files, tickets, or dashboards</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!99Cg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07aa4d04-2e5a-4fa4-b735-a01d5a3f5ff6_980x986.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!99Cg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07aa4d04-2e5a-4fa4-b735-a01d5a3f5ff6_980x986.png 424w, https://substackcdn.com/image/fetch/$s_!99Cg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07aa4d04-2e5a-4fa4-b735-a01d5a3f5ff6_980x986.png 848w, https://substackcdn.com/image/fetch/$s_!99Cg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07aa4d04-2e5a-4fa4-b735-a01d5a3f5ff6_980x986.png 1272w, https://substackcdn.com/image/fetch/$s_!99Cg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07aa4d04-2e5a-4fa4-b735-a01d5a3f5ff6_980x986.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!99Cg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07aa4d04-2e5a-4fa4-b735-a01d5a3f5ff6_980x986.png" width="980" height="986" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/07aa4d04-2e5a-4fa4-b735-a01d5a3f5ff6_980x986.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:986,&quot;width&quot;:980,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:143568,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/183048385?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07aa4d04-2e5a-4fa4-b735-a01d5a3f5ff6_980x986.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!99Cg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07aa4d04-2e5a-4fa4-b735-a01d5a3f5ff6_980x986.png 424w, https://substackcdn.com/image/fetch/$s_!99Cg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07aa4d04-2e5a-4fa4-b735-a01d5a3f5ff6_980x986.png 848w, https://substackcdn.com/image/fetch/$s_!99Cg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07aa4d04-2e5a-4fa4-b735-a01d5a3f5ff6_980x986.png 1272w, https://substackcdn.com/image/fetch/$s_!99Cg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07aa4d04-2e5a-4fa4-b735-a01d5a3f5ff6_980x986.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 3: Example of a knowledge graph</figcaption></figure></div><p>In practice, a knowledge graph becomes a <strong>shared semantic layer for the organization</strong>: a living map of how systems, documentation, decisions, and operational knowledge relate to one another.</p><h3>What Is GraphRAG?</h3><p><strong>GraphRAG is Retrieval-Augmented Generation (RAG) using a Knowledge Graph.</strong></p><p>At a high level, the difference is simple:</p><ul><li><p><strong>Traditional RAG</strong> retrieves the most semantically similar text chunks from a vector database.</p></li><li><p><strong>GraphRAG</strong> retrieves <em>connected knowledge</em> by traversing relationships in a graph.</p></li></ul><p>For us, the most useful way to think about GraphRAG is this:</p><blockquote><p><strong>GraphRAG is a set of RAG patterns where retrieval is guided by graph structure, not just similarity scores.</strong></p></blockquote><p>Each pattern depends on having the right graph representation in place.</p><h4>Why GraphRAG for Organizational Knowledge?</h4><p>In production environments, we rarely ask questions that can be answered by a handful of similar text snippets.</p><p>Instead, we ask questions like:</p><ul><li><p><em>&#8220;What do we know about this issue across teams and services?&#8221;</em></p></li><li><p><em>&#8220;Summarize everything related to this initiative and its downstream dependencies.&#8221;</em></p></li></ul><p>These are <strong>coverage and synthesis questions</strong>, not similarity questions.</p><p>A similarity-based retriever might return a few highly relevant chunks, but still miss:</p><ul><li><p>Entire systems</p></li><li><p>Related incidents</p></li><li><p>Important but differently worded documentation</p></li></ul><p>GraphRAG is designed specifically for this class of problems.</p><h3>The GraphRAG Approach</h3><p>At a high level, GraphRAG works in <strong>two phases</strong>:</p><ol><li><p><strong>Graph Generation</strong> &#8211; turning raw organizational knowledge into a structured, navigable graph</p></li><li><p><strong>Query Answering</strong> &#8211; using that structure to retrieve a <em>complete and connected context</em>, not just similar text</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!i7wz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dad049a-15ac-406a-bc11-4b616b8779c5_1182x1196.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!i7wz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dad049a-15ac-406a-bc11-4b616b8779c5_1182x1196.png 424w, https://substackcdn.com/image/fetch/$s_!i7wz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dad049a-15ac-406a-bc11-4b616b8779c5_1182x1196.png 848w, https://substackcdn.com/image/fetch/$s_!i7wz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dad049a-15ac-406a-bc11-4b616b8779c5_1182x1196.png 1272w, https://substackcdn.com/image/fetch/$s_!i7wz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dad049a-15ac-406a-bc11-4b616b8779c5_1182x1196.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!i7wz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dad049a-15ac-406a-bc11-4b616b8779c5_1182x1196.png" width="1182" height="1196" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7dad049a-15ac-406a-bc11-4b616b8779c5_1182x1196.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1196,&quot;width&quot;:1182,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:163296,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/183048385?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dad049a-15ac-406a-bc11-4b616b8779c5_1182x1196.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!i7wz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dad049a-15ac-406a-bc11-4b616b8779c5_1182x1196.png 424w, https://substackcdn.com/image/fetch/$s_!i7wz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dad049a-15ac-406a-bc11-4b616b8779c5_1182x1196.png 848w, https://substackcdn.com/image/fetch/$s_!i7wz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dad049a-15ac-406a-bc11-4b616b8779c5_1182x1196.png 1272w, https://substackcdn.com/image/fetch/$s_!i7wz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dad049a-15ac-406a-bc11-4b616b8779c5_1182x1196.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 4: Pipelines in the GraphRAG approach</figcaption></figure></div><h3>Phase 1: Graph Generation</h3><h4>1. Source Documents &#8594; Text Chunks</h4><p>GraphRAG starts by breaking source documents into smaller text chunks. This is a practical necessity. Large documents are difficult to index, reason over, and retrieve from reliably.</p><p>In our case, these source documents include internal architecture documentation, runbooks, postmortems, and operational notes. Chunking ensures that knowledge about a single service, failure mode, or mitigation strategy can be surfaced independently.</p><p>At this stage, we still have unstructured text. The graph does not exist yet.</p><h4>2. Text Chunks &#8594; Element Instances (Entities &amp; Relationships)</h4><p>Each chunk is then analyzed to extract what GraphRAG calls <em>elements</em>. An element is not yet a node or an edge in the graph. It is an intermediate representation that captures either:</p><ul><li><p>an entity, such as a service, team, incident, or concept</p></li><li><p>or a relationship between two entities, such as &#8220;depends on,&#8221; &#8220;owned by,&#8221; or &#8220;caused by&#8221;</p></li></ul><p>You can think of elements as structured facts derived from text.</p><p>For example, from a paragraph in a postmortem, the system might extract:</p><ul><li><p>an entity element for <code>Service A</code></p></li><li><p>an entity element for <code>Service B</code></p></li><li><p>a relationship element expressing that <code>Service A depends on Service B</code></p></li></ul><p>At this point, these are still logical units, not graph objects. They are normalized, deduplicated, and reconciled across chunks.</p><h4>3. Element Instances &#8594; Element Summaries</h4><p>Once elements are extracted, the LLM generates concise summaries for each one. These summaries give semantic meaning to otherwise low-level identifiers.</p><p>In practice, this means that a service or incident is no longer just a name, but a short, human-readable description that captures its role, behavior, and operational significance.</p><p>Only after this step do elements become concrete graph objects.</p><p>Entity elements become nodes.<br>Relationship elements become edges.<br>Summaries and metadata become node and edge properties.</p><p>This is the point where the knowledge graph is actually formed.</p><h4>4. Element Summaries &#8594; Graph Communities</h4><p>As the graph grows, individual nodes become less useful in isolation. </p><p>GraphRAG addresses this by clustering the graph into communities using graph algorithms such as hierarchical Leiden. Each community represents a tightly connected subgraph that corresponds to a coherent topic or domain.</p><p>For our organization, these communities naturally align with real operational boundaries: a platform area, a group of interdependent services, or a recurring class of incidents. This clustering emerges from the data itself rather than being manually defined, which is important in production environments where systems evolve faster than documentation.</p><h4>5. Graph Communities &#8594; Community Summaries</h4><p>After communities are formed, the LLM generates a summary for each one. These summaries describe what the community is about and how its elements relate to each other, effectively creating a higher-level index over the graph.</p><p>In our use case, these <strong>community summaries become the primary unit of retrieval.</strong> </p><h3>Phase 2: Answering the Query</h3><p>When a query is issued, our Agentic GraphRAG does the following:</p><ul><li><p>identifies the relevant communities</p></li><li><p>generates intermediate answers from each</p></li><li><p>merges them into a single global response</p></li></ul><p>For production engineers, this means questions like &#8220;What do we know about failures related to service X?&#8221; or &#8220;What systems are involved in this incident class?&#8221; are answered using a complete, connected view of organizational knowledge.</p><p>Now to make GraphRAG work for our use case, <strong>we need a graph that reflects how on-call engineers</strong> already think: services, dependencies, ownership, incidents, and the artifacts that explain them.</p><p>A simple schema is enough to start:</p><p><strong>Nodes</strong></p><ul><li><p><strong>Service</strong> (name, domain, tier, repo, tags, embedding)</p></li><li><p><strong>Team</strong> (name, oncall channel, owners, embedding)</p></li><li><p><strong>Incident</strong> (id, timestamp, severity, summary, embedding)</p></li><li><p><strong>Runbook</strong> (url, title, steps summary, embedding)</p></li><li><p><strong>Doc</strong> (source, url, title, embedding)</p></li><li><p><strong>Release/PR</strong> (id, timestamp, author, summary, embedding)</p></li></ul><p><strong>Relationships</strong></p><ul><li><p><code>DEPENDS_ON</code> (Service &#8594; Service)</p></li><li><p><code>OWNED_BY</code> (Service &#8594; Team)</p></li><li><p><code>AFFECTED</code> (Incident &#8594; Service)</p></li><li><p><code>RESPONDED_BY</code> (Incident &#8594; Team)</p></li><li><p><code>HAS_RUNBOOK</code> (Service &#8594; Runbook)</p></li><li><p><code>DOCUMENTED_IN</code> (Service/Incident &#8594; Doc)</p></li><li><p><code>RELATED_TO</code> (Incident &#8596; Incident)</p></li><li><p><code>INTRODUCED_BY</code> (Incident/Service &#8594; Release/PR)</p></li></ul><p>Each node carries a vector embedding derived from its LLM-generated summary.</p><ul><li><p>A <code>Service</code> node is embedded from its service summary.</p></li><li><p>An <code>Incident</code> node is embedded from its incident description.</p></li><li><p>A <code>Runbook</code> node is embedded from its condensed operational steps.</p></li></ul><p>Embeddings are not created from raw documents, but from these normalized, human-readable representations stored on the nodes.</p><p>At query time, the agent first uses these embeddings to locate the most relevant nodes in the graph. From there, Neo4j expands outward through edges like DEPENDS_ON and OWNED_BY, assembling the full dependency radius around the incident.</p><p>In Neo4j, this gives us two retrieval modes: semantic search over embeddings to find the right entry points, and graph traversal to expand through dependencies and ownership until we have the full dependency radius and sufficient context within the loop.</p><blockquote><p><strong>Note</strong>: This schema is also known as an <strong>ontology</strong>. It defines the vocabulary of the system and the rules by which concepts relate to each other. A well-designed ontology makes the graph predictable, extensible, and aligned with how engineers already think about production systems.</p></blockquote><p>Now that we have the graph schema in place, let&#8217;s see how to maintain it.</p><h4>How often does the graph need updating? </h4><p>To keep things simple, build the graph once at the start by scraping your documentation sources: Confluence pages with runbooks, incident postmortems, architecture diagrams, and service dependencies. Wherever your operational knowledge lives, pull it in during initial setup.</p><p>After that, update the graph daily. Run a scheduled job each night to catch any new documentation, updated runbooks, or organizational changes. Production topology changes slowly, so daily syncs are sufficient.</p><h4>What if conflicting information comes from the graph vs the MCP?</h4><p>Real-time data like current metrics and active incidents comes through the MCP servers instead. The graph holds structure and history. The MCP layer holds what&#8217;s happening right now.</p><p>MCP data takes priority for the current incident. The graph holds documented structure, but MCP shows what&#8217;s actually happening right now. If GitHub MCP reports a deployment 10 minutes ago that&#8217;s not in the graph yet, the agent uses the GitHub data.</p><p>This priority can be encoded in the system prompt:</p><pre><code><code>When assembling incident context, treat information sources in this order:
1. MCP servers provide current state (deployments, metrics, discussions)
2. Graph provides historical patterns and documented structure
3. If they conflict, use MCP data and note the discrepancy in your report

Flag discrepancies explicitly: "The graph shows no dependency between 
service A and B, but recent deployments suggest otherwise."</code></code></pre><h3>An end-to-end example of applying GraphRAG</h3><p>So far, GraphRAG can still feel abstract. Let&#8217;s make it concrete with a small example.</p><h4>Step 1: Start with a real piece of operational text</h4><p>Imagine this snippet lives in Confluence under a runbook page:</p><blockquote><p><strong>Payments API &#8212; 5xx spike after deploy</strong><br>Symptoms: increased 5xx on <code>payments-api</code>, elevated latency.<br>Recent incidents suggest the downstream cause is usually <code>auth-service</code> throttling.<br><code>payments-api</code> depends on <code>auth-service</code> and <code>ledger-service</code>.<br>Owner: Payments Platform team.<br>Mitigation: rollback <code>payments-api</code> to previous release; if error rate persists, check <code>auth-service</code> rate limits.</p></blockquote><p>This is exactly the kind of &#8220;human memory&#8221; we want to keep outside people&#8217;s heads.</p><h4>Step 2: Extract entities and relationships</h4><p>From this text, the Graph Extractor would produce a minimal set of elements:</p><p><strong>Entities</strong></p><ul><li><p>Service: <code>payments-api</code></p></li><li><p>Service: <code>auth-service</code></p></li><li><p>Service: <code>ledger-service</code></p></li><li><p>Team: <code>Payments Platform</code></p></li><li><p>Runbook: <code>Payments API &#8212; 5xx spike after deploy</code></p></li></ul><p><strong>Relationships</strong></p><ul><li><p><code>payments-api</code> <code>DEPENDS_ON</code> <code>auth-service</code></p></li><li><p><code>payments-api</code> <code>DEPENDS_ON</code> <code>ledger-service</code></p></li><li><p><code>payments-api</code> <code>OWNED_BY</code> <code>Payments Platform</code></p></li><li><p><code>payments-api</code> <code>HAS_RUNBOOK</code> <code>Payments API &#8212; 5xx spike after deploy</code></p></li></ul><p>Each entity also gets a short LLM-generated summary, which is what we embed.</p><h4>Step 3: Materialize it into the graph</h4><p>At this point, the &#8220;elements&#8221; become real graph objects.</p><p>In Neo4j, the result looks like this (simplified):</p><ul><li><p><code>(:Service {name: "payments-api", summary: "...", embedding: [...]})</code></p></li><li><p><code>(:Team {name: "Payments Platform", oncall: "#payments-oncall", ...})</code></p></li><li><p><code>(:Runbook {title: "Payments API &#8212; 5xx spike after deploy", url: "...", ...})</code></p></li></ul><p>Connected by explicit edges like <code>DEPENDS_ON</code>, <code>OWNED_BY</code>, and <code>HAS_RUNBOOK</code>.</p><h4>Step 4: What retrieval looks like at query time</h4><p>Now, say an alert arrives:</p><ul><li><p>service = <code>payments-api</code></p></li><li><p>symptom = <code>5xx spike</code></p></li><li><p>timestamp = now</p></li></ul><p>GraphRAG typically runs in two phases:</p><h4>1) Semantic &#8220;entry point&#8221; lookup</h4><p>We first find the most relevant nodes using embeddings (e.g., &#8220;payments-api 5xx spike&#8221;).</p><p>Even if the wording differs, embeddings anchor us to the right place in the graph: <code>payments-api</code>, its runbook, and nearby incidents.</p><h4>2) Graph expansion through dependencies and ownership</h4><p>Once we have an entry node, we expand outward to capture blast radius and context.</p><p>A simple Cypher query to pull dependencies and ownership might look like:</p><pre><code><code>MATCH (s:Service {name: "payments-api"})
OPTIONAL MATCH (s)-[:DEPENDS_ON]-&gt;(dep:Service)
OPTIONAL MATCH (s)-[:OWNED_BY]-&gt;(t:Team)
OPTIONAL MATCH (s)-[:HAS_RUNBOOK]-&gt;(r:Runbook)
RETURN s, collect(dep) AS dependencies, t AS owner, collect(r) AS runbooks
</code></code></pre><p>If we want to bound expansion by hops (to avoid exploding the subgraph), we can do:</p><pre><code><code>MATCH (s:Service {name: "payments-api"})-[:DEPENDS_ON*1..2]-&gt;(dep:Service)
RETURN s, collect(DISTINCT dep) AS deps_2_hops
</code></code></pre><p>This is the point where GraphRAG stops behaving like a search system and starts behaving like an operational model of the organization.</p><p>We are no longer pulling &#8220;relevant documents.&#8221; We are reconstructing a slice of the system: which services are involved, how far the blast radius extends, who owns what, and which operational knowledge applies. </p><p>The retrieval step becomes an act of <em>structural reasoning</em> over the organization itself.</p><div><hr></div><h2>4.  Deep dive into our tech stack</h2><p>By now, the behavior of the system should be clear.</p><p>The difference between a diagram and a working system is the tooling. This section covers the tools we chose to implement the agent, and the tradeoffs behind them.</p><p>The guiding principle is simple:</p><blockquote><p>Each component should solve one problem well and expose a stable boundary to the rest of the system.</p></blockquote><h3>Application Serving &amp; Orchestration</h3><p>The agent runs inside a <strong>FastAPI</strong> application, which serves as the entry point for all incoming alerts.</p><p>FastAPI is used because it is async by default and well-suited for I/O-heavy workloads. Other frameworks like Flask or Django could work, but FastAPI offers a better balance of simplicity and reliability for this pattern.</p><p>The application itself is intentionally thin.</p><p>It handles request validation and transport, then immediately hands execution to the Agent controller. <strong>No business logic lives in the web layer</strong>. </p><p>Agent behavior is coordinated by a custom Agent Controller, rather than a general-purpose agent framework.</p><p>This is a deliberate choice. Frameworks like LangChain or LangGraph are useful for prototyping, but they often hide execution order and error handling behind abstractions that become liabilities in production.</p><p>Here, <strong>the agent loop is explicit.</strong></p><p>The controller decides when to retrieve context, when to call tools, when to invoke the model, and when to stop. Retries and limits are owned by the application itself, making behavior predictable and easier to debug during incidents.</p><h3>GraphRAG Retrieval </h3><p>Context retrieval happens through a <strong>Graph query engine </strong>built on <strong>Neo4j</strong>. Instead of returning isolated documents, the system retrieves connected subgraphs: clusters of services and teams along with their dependencies.</p><p>The retrieval layer can be implemented using <strong>LlamaIndex&#8217;s PropertyGraph</strong>, which provides built-in support for agentic GraphRAG queries. This gives a solid starting point for production use, while still allowing you to customize the graph schema and retrieval logic to fit incident response workflows.</p><blockquote><p><strong>Check out the reference implementation</strong> &#8594; <a href="https://developers.llamaindex.ai/python/examples/property_graph/agentic_graph_rag_vertex/">GraphRAG with LlamaIndex</a></p></blockquote><h3>Memory and storage</h3><p>Long-term memory is implemented using <strong>Neo4j vector store</strong>, with vector embeddings attached to graph nodes.</p><p>As described in the previous section, nodes and relationships only become graph objects after they are summarized. Those summaries are what we embed and store in Neo4j, such as service descriptions, incident summaries, and condensed runbook steps.</p><p>This choice reflects the nature of the data. Production knowledge is relational. Services depend on services. Teams own systems. Incidents recur in patterns. A graph database models this directly.</p><p>A pure vector database would retrieve similar text, but it cannot express ownership or dependency chains. A relational database would require rigid schemas for information that evolves constantly. Neo4j provides the right balance of flexibility and structure for this domain.</p><p>In practice, this gives us two retrieval modes:</p><ul><li><p>semantic search over embeddings to find the right entry points</p></li><li><p>graph traversal to expand through dependencies and ownership</p></li></ul><p>The embeddings tell the system <em>where to start</em>.<br>The graph tells it <em>what else matters</em>.</p><h3>The Language Model </h3><p>The language model used is <strong>Gemini</strong>, accessed through the <strong>LLM Gateway</strong>.</p><p>Gemini was chosen primarily for pragmatic reasons. The free tier provides enough room for experimentation, and the stateful chat API makes it well-suited for multi-step workflows that involve tool calls and iterative reasoning.</p><p>The gateway abstracts away model-specific details, handling prompt construction, retries, and configuration. This makes it possible to swap Gemini for another provider later without rewriting the agent logic.</p><h3>Observability &amp; Evaluation</h3><p>As usual, we recommend using <strong><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a></strong> for observability and evaluation.</p><p>Opik captures prompt traces, retrieval steps, tool calls, and model outputs as a single execution trace, making agent behavior inspectable end to end.</p><p>For a GraphRAG-based on-call agent, this is essential. It allows us to see what context was retrieved from the graph, which tools were invoked, and how the final incident report was produced. Opik also supports replaying and comparing runs, which helps evaluate changes to prompts, retrieval strategies, or graph structure.</p><p>This makes Opik a natural fit for operating and iterating on agents in production.</p><div><hr></div><h2>Conclusion</h2><p>On-call is not hard because engineers cannot fix things.</p><p>It is hard because context is scattered. Ownership is unclear. Dependencies are implicit. The &#8220;why&#8221; lives in old Slack threads and half-written runbooks.</p><p>This is exactly the gap a Production Engineer agent can close.</p><p>Not by becoming a smarter reasoner, but by being grounded in the right kind of context. GraphRAG turns organizational knowledge into a connected system the agent can traverse: services, teams, dependencies, incidents, and the artifacts that explain them.</p><p>The rest is engineering discipline.</p><p>Keep orchestration explicit. Use a graph database when the domain is relational.  Instrument everything with LLMOps from day one.</p><p>Start simple. Build the graph. Wire retrieval. Add only the tools you need.</p><p>That is how you get from a 02:13 pager to an actionable incident report&#8212;before the on-call engineer opens their laptop.</p><p>See you next Tuesday.</p><p>Anca Muscalagiu</p><p></p><p>The views expressed are my own and do not represent my employer.</p><div><hr></div><p><em>What&#8217;s your opinion? Do you agree, disagree, or is there something I missed?</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/designing-production-engineer-agent-graphrag/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/designing-production-engineer-agent-graphrag/comments"><span>Leave a comment</span></a></p><div><hr></div><p>Enjoyed the article? The most sincere compliment is to share our work.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/designing-production-engineer-agent-graphrag?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/designing-production-engineer-agent-graphrag?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h2>Go Deeper</h2><p>Everything you learned in this article, from building evals datasets to evaluators, comes from the AI Evals &amp; Observability module of our <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Agentic AI Engineering self-paced course</a>.</p><p><strong>Your path to agentic AI for production. </strong>Built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Towards AI</a>.</p><p>Across <strong>34 lessons</strong> (articles, videos, and a lot of code), you&#8217;ll design, build, evaluate, and deploy production-grade AI agents end to end. By the final lesson, you&#8217;ll have built a multi-agent system that orchestrates <strong>Nova</strong> (a deep research agent) and <strong>Brown</strong> (a full writing workflow), plus a <strong>capstone project</strong> where you apply everything on your own. </p><p><em>Three portfolio projects and a certificate to show off in interviews. Plus a Discord community where you have direct access to other industry experts and me.</em></p><p>Rated 4.9/5 &#11088;&#65039; by 190+ early students &#8212; &#8220;Every AI Engineer needs a course like this.&#8221;</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&quot;,&quot;text&quot;:&quot;Learn more&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31"><span>Learn more</span></a></p><p><em>Not ready to commit?</em> We also prepared a free 6-day email course to reveal the <em><strong>6 critical mistakes that silently destroy agentic systems. </strong><a href="https://email-course.towardsai.net/?ref=b3ab31">Get the free email course.</a></em></p><div><hr></div><p><em>Thanks again to <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a> (by Comet) for sponsoring the series and keeping it free!</em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oSDm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 424w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 848w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1272w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png" width="1456" height="364" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:364,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Opik Banner&quot;,&quot;title&quot;:&quot;Opik Banner&quot;,&quot;type&quot;:null,&quot;href&quot;:&quot;https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Opik Banner" title="Opik Banner" srcset="https://substackcdn.com/image/fetch/$s_!oSDm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 424w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 848w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1272w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Try Opik for free here</a> (25k spans/month free)</figcaption></figure></div><p><strong>If you want to monitor, evaluate and optimize your AI workflows and agents:</strong></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;text&quot;:&quot;Try Opik for Free&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul"><span>Try Opik for Free</span></a></p><div><hr></div><h2>References</h2><ol><li><p>Zilliz Learn. (n.d.). <em>GraphRAG explained: Enhancing RAG with knowledge graphs</em>. Medium.<br><a href="https://medium.com/%40zilliz_learn/graphrag-explained-enhancing-rag-with-knowledge-graphs-3312065f99e1">https://medium.com/%40zilliz_learn/graphrag-explained-enhancing-rag-with-knowledge-graphs-3312065f99e1</a></p></li><li><p>LlamaIndex. (n.d.). <em>Agentic GraphRAG with property graphs</em>. LlamaIndex Documentation.<br><a href="https://developers.llamaindex.ai/python/examples/property_graph/agentic_graph_rag_vertex/">https://developers.llamaindex.ai/python/examples/property_graph/agentic_graph_rag_vertex/</a></p></li><li><p>JingleMind. (n.d.). <em>Mastering advanced RAG methods: GraphRAG with Neo4j implementation using LangChain</em>. Medium.<br><a href="https://medium.com/@jinglemind.dev/mastering-advanced-rag-methods-graphrag-with-neo4j-implementation-with-langchain-42b8f1d05246">https://medium.com/@jinglemind.dev/mastering-advanced-rag-methods-graphrag-with-neo4j-implementation-with-langchain-42b8f1d05246</a></p></li><li><p>Comet ML. (n.d.). Evaluate your LLM application | Opik Documentation. Comet. <a href="https://www.comet.com/docs/opik/evaluation/evaluate_your_llm">https://www.comet.com/docs/opik/evaluation/evaluate_your_llm</a></p></li><li><p>Comet ML. (n.d.). Open&#8209;source LLM Evaluation Platform | Opik by Comet. Comet. <a href="https://www.comet.com/site/products/opik/">https://www.comet.com/site/products/opik/</a></p></li></ol><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[How to Design Python AI Projects That Don't Fall Apart]]></title><description><![CDATA[A framework-agnostic approach to modular agents, workflows, and LLM apps using the pragmatic clean architecture design.]]></description><link>https://www.decodingai.com/p/how-to-design-python-ai-projects</link><guid isPermaLink="false">https://www.decodingai.com/p/how-to-design-python-ai-projects</guid><dc:creator><![CDATA[Paul Iusztin]]></dc:creator><pubDate>Tue, 13 Jan 2026 12:18:08 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/7d11d17d-0235-40f8-8e87-d97cd2e60634_1196x1082.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I was a true believer in the Clean Architecture, thinking that would help me figure out how to structure Python AI projects. I spent countless hours obsessively dividing my projects into <code>domain</code>, <code>application</code>, <code>infrastructure</code>, and <code>interface</code> folders. I forced every single piece of code to fit into these four buckets. It felt &#8220;correct.&#8221; But in reality, it was a nightmare to maintain. It wasn&#8217;t until I realized that these layers should be <strong>virtual concepts</strong> that I finally cracked it. </p><p>Because Python is incredibly flexible, it allows you to build anything. That&#8217;s why it&#8217;s so confusing on how to properly design and structure your Python code. All the responsibility is on the developer. But that flexibility often leads to <em>&#8220;spaghetti code&#8221;</em> when building complex AI apps such as agents and workflows.</p><p>Most recommendations on how to design Python projects fall into two extremes. They are either highly specific to a tool, such as a FastAPI template or a LangGraph starter kit. Or they follow the Clean Architecture pattern too rigidly (which was originally developed for Java and doesn&#8217;t map one-to-one to Python). I have been a victim of this dogmatism as well.</p><p>Still, we need a middle ground to avoid spaghetti code while keeping our code simple. We need a tool- and framework-agnostic approach that provides structure without bloat. Even in a world where the code is purely generated by AI, understanding how to design your code base is probably one of the most important skills.</p><p>In this article, I presented a pragmatic, &#8220;loose&#8221; version of Clean Architecture applied to building AI projects, such as AI agents, workflows, or LLM apps. We won&#8217;t follow the book letter-by-letter. Instead, we will inherit only the principles that make code modular, flexible, testable, and maintainable.</p><p><em>Also known as <strong>the</strong> <strong>pragmatic clean architecture.</strong></em></p><p>Here is what we will cover:</p><ul><li><p>Define the four virtual layers required for modularity.</p></li><li><p>Structure an AI project to separate business logic from infrastructure and serving layers.</p></li><li><p>Implement a scalable folder structure.</p></li><li><p>Avoid the three biggest mistakes engineers make when structuring Python apps.</p></li></ul><p><em>But first, a quick word from our sponsor, Opik</em> &#8595;</p><div><hr></div><h2><a href="https://www.encodeclub.com/programmes/comet-resolution-v2-hackathon">AI Agents Virtual Hackathon With $30,000 in Prizes (Sponsored)</a></h2><p>Want to get motivated to build that AI agent you had in mind in the past 12 months while having fun, meeting cool people and potentially earning up to $10,000 (in cash per team)?</p><p><em><a href="https://www.encodeclub.com/programmes/comet-resolution-v2-hackathon">Hackathons</a> are the best way to do that.</em> </p><p><a href="https://www.comet.com/site/products/opik/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a> is hosting a free one, together with Google DeepMind and Vercel, offering $30,000 in prizes with a single goal: building and shipping AI agents.</p><p><em>But wait.</em> That sounds like a scam. <em>Not really.</em> The catch is that you have to put in the work during the hackathon to convince the judge that your AI app is worth the prize.</p><p>The worst that can happen? You have the chance, <strong>for free</strong>, to access:</p><ul><li><p><strong>Expert Workshops</strong>: Learn observability, evaluation, and agent optimization from Comet&#8217;s team</p></li><li><p><strong>Premium Tools</strong>: Credits and support from Google, Vercel, and other partners</p></li><li><p><strong>Direct Mentorship</strong>: Technical support throughout the hackathon via Discord</p></li></ul><p><strong>Prizes</strong> will be awarded based on the 6 challenges below. You can win one category plus the best use of Opik, <strong>totaling $10,000</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://www.encodeclub.com/programmes/comet-resolution-v2-hackathon" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!t_B1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2700a07a-99a2-4964-b580-fff1a8f0460f_1562x400.jpeg 424w, https://substackcdn.com/image/fetch/$s_!t_B1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2700a07a-99a2-4964-b580-fff1a8f0460f_1562x400.jpeg 848w, https://substackcdn.com/image/fetch/$s_!t_B1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2700a07a-99a2-4964-b580-fff1a8f0460f_1562x400.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!t_B1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2700a07a-99a2-4964-b580-fff1a8f0460f_1562x400.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!t_B1!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2700a07a-99a2-4964-b580-fff1a8f0460f_1562x400.jpeg" width="1200" height="307.4175824175824" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2700a07a-99a2-4964-b580-fff1a8f0460f_1562x400.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:373,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:380760,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:&quot;https://www.encodeclub.com/programmes/comet-resolution-v2-hackathon&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/183048385?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2700a07a-99a2-4964-b580-fff1a8f0460f_1562x400.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!t_B1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2700a07a-99a2-4964-b580-fff1a8f0460f_1562x400.jpeg 424w, https://substackcdn.com/image/fetch/$s_!t_B1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2700a07a-99a2-4964-b580-fff1a8f0460f_1562x400.jpeg 848w, https://substackcdn.com/image/fetch/$s_!t_B1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2700a07a-99a2-4964-b580-fff1a8f0460f_1562x400.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!t_B1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2700a07a-99a2-4964-b580-fff1a8f0460f_1562x400.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">&#8594; <a href="https://www.encodeclub.com/programmes/comet-resolution-v2-hackathon">Apply to Hackathon</a> &#8592;</figcaption></figure></div><p><strong>My take?</strong> The money is just a bonus, as this is a fantastic opportunity to learn for free from industry experts while building:</p><ul><li><p>LLM judges to evaluate your custom business use case.</p></li><li><p>Agentic RAG applications or coding agents.</p></li><li><p>Automated prompt tuning loops and guardrails.</p></li></ul><p>Or whatever AI agent moves and motivates you!</p><h3>Who Should Join?</h3><p>Developers with:</p><ul><li><p>Basic understanding of LLMs and AI agents</p></li><li><p>Experience building software applications</p></li><li><p>Python or TypeScript knowledge</p></li></ul><p><em>Sounds like this is for you? Then register here:</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.encodeclub.com/programmes/comet-resolution-v2-hackathon&quot;,&quot;text&quot;:&quot;Apply to Hackathon&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.encodeclub.com/programmes/comet-resolution-v2-hackathon"><span>Apply to Hackathon</span></a></p><p><strong>Be quick!</strong> Registration opens on Jan 13 (Today!), and the project submission is on Feb 9. You can register at any time during the event, up to Feb 9.</p><div><hr></div><p><em>&#8595;</em>  <em>Now, let&#8217;s move back to our article.</em></p><h2>What You Need to Know From the Clean Architecture</h2><p>Before we dive into folders and files, we need to understand the mental model. The Clean Architecture organizes a system into four concentric circles, or layers.</p><ol><li><p><strong>The Domain Layer (The &#8220;What&#8221;):</strong> This is the inner core. It defines your business objects and the specific units of work your AI performs. It doesn&#8217;t know about databases, Application Programming Interfaces (APIs), or LLMs.</p></li><li><p><strong>The Application Layer (The &#8220;How&#8221;):</strong> This layer orchestrates the domain elements. It defines the workflows and use cases. It connects the steps required to achieve a business result.</p></li><li><p><strong>The Infrastructure Layer (The &#8220;External Dependencies&#8221;):</strong> This contains the concrete implementations of your external dependencies. It includes the specific LLM providers like OpenAI or Gemini, database dependencies (Postgres, MongoDB, Qdrant), storage strategies (local, S3) or any other API or tooling that your system depends on.</p></li><li><p><strong>The Serving Layer (The &#8220;Interface&#8221;):</strong> This is the outermost layer. It defines how the outside world interacts with your application. This could be through a Command Line Interface (CLI), a REST API, or a Model Context Protocol (MCP) server.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3rsv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90f6306d-8b15-4188-b7e9-00c409af00b1_935x815.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3rsv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90f6306d-8b15-4188-b7e9-00c409af00b1_935x815.png 424w, https://substackcdn.com/image/fetch/$s_!3rsv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90f6306d-8b15-4188-b7e9-00c409af00b1_935x815.png 848w, https://substackcdn.com/image/fetch/$s_!3rsv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90f6306d-8b15-4188-b7e9-00c409af00b1_935x815.png 1272w, https://substackcdn.com/image/fetch/$s_!3rsv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90f6306d-8b15-4188-b7e9-00c409af00b1_935x815.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3rsv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90f6306d-8b15-4188-b7e9-00c409af00b1_935x815.png" width="935" height="815" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/90f6306d-8b15-4188-b7e9-00c409af00b1_935x815.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:815,&quot;width&quot;:935,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:69748,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/184308448?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90f6306d-8b15-4188-b7e9-00c409af00b1_935x815.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3rsv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90f6306d-8b15-4188-b7e9-00c409af00b1_935x815.png 424w, https://substackcdn.com/image/fetch/$s_!3rsv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90f6306d-8b15-4188-b7e9-00c409af00b1_935x815.png 848w, https://substackcdn.com/image/fetch/$s_!3rsv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90f6306d-8b15-4188-b7e9-00c409af00b1_935x815.png 1272w, https://substackcdn.com/image/fetch/$s_!3rsv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90f6306d-8b15-4188-b7e9-00c409af00b1_935x815.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image 1: Clean Architecture Diagram showing inward-only dependencies</figcaption></figure></div><p>The most important rule to remember is the <strong>Dependency Rule</strong>: dependencies must always point inward.</p><p>The outer layers know about the inner layers. But the <strong>application and domain layers must never be aware of the infrastructure and serving layers</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QGPK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c28d7db-872e-48a3-819e-3e1e99103b8e_684x649.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QGPK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c28d7db-872e-48a3-819e-3e1e99103b8e_684x649.png 424w, https://substackcdn.com/image/fetch/$s_!QGPK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c28d7db-872e-48a3-819e-3e1e99103b8e_684x649.png 848w, https://substackcdn.com/image/fetch/$s_!QGPK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c28d7db-872e-48a3-819e-3e1e99103b8e_684x649.png 1272w, https://substackcdn.com/image/fetch/$s_!QGPK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c28d7db-872e-48a3-819e-3e1e99103b8e_684x649.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QGPK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c28d7db-872e-48a3-819e-3e1e99103b8e_684x649.png" width="684" height="649" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0c28d7db-872e-48a3-819e-3e1e99103b8e_684x649.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:649,&quot;width&quot;:684,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:24451,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/184308448?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c28d7db-872e-48a3-819e-3e1e99103b8e_684x649.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QGPK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c28d7db-872e-48a3-819e-3e1e99103b8e_684x649.png 424w, https://substackcdn.com/image/fetch/$s_!QGPK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c28d7db-872e-48a3-819e-3e1e99103b8e_684x649.png 848w, https://substackcdn.com/image/fetch/$s_!QGPK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c28d7db-872e-48a3-819e-3e1e99103b8e_684x649.png 1272w, https://substackcdn.com/image/fetch/$s_!QGPK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c28d7db-872e-48a3-819e-3e1e99103b8e_684x649.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image 2: Diagram illustrating inward-only dependencies in clean architecture, showing flow from outer to inner layers.</figcaption></figure></div><p>This isolation is the <strong>key advantage</strong>. By keeping your core AI logic &#8220;pure&#8221; in the app and domain layers, you can reuse the exact same AI agent across different platforms. You can run it on a CLI, a web application, or a VS Code extension. Also, you can easily swap saving your documents from your local disk to S3 or your model from Gemini to a local model running on your device. You do this without changing a single line of business logic.</p><p>Another advantage is in properly separating your domain and application layers. Every class or function from your domain layer should work on its own. Should be a unit of &#8220;work&#8221;. In fancy words, it should be orthogonal. In the meantime, your app layer should compose components from your domain layer into different business use cases. In other words, your domain layer is your LEGO blocks, while your app layer sticks them together into real-world business problems.</p><p>I love comparing the clean architecture to running a <strong>professional kitchen</strong>. The <strong>domain</strong> represents your ingredients. The <strong>application layer</strong> is the recipe. It is the step-by-step process of cooking. The <strong>infrastructure</strong> is your equipment, like the stove or blender. Finally, the <strong>serving layer</strong> is how the customer gets the food. It could be plated in a dining room, packed in a takeout box, or served at a buffet. You can swap a gas stove for an electric one without changing the recipe. You can switch from dining in to takeout without changing how you cook the food.</p><h2>The Project Structure of an AI App</h2><p>Now let&#8217;s apply these abstract layers to a concrete AI application.</p><h3>The Four-Layer Architectural Model in AI</h3><ol><li><p><strong>The Domain Layer (The &#8220;What&#8221;):</strong> In an AI project, this layer holds your <strong>Entities</strong> and <strong>Nodes</strong>. We typically define Entities using Pydantic models, such as a <code>Context</code> or an <code>Article</code>. Nodes are the specific units of AI logic. For example, an <code>ArticleWriterNode</code> contains the prompt and the logic required to generate text. A <code>ReviewerNode</code> contains the logic to evaluate a given piece of content. These nodes can be reused in different business use cases.</p></li><li><p><strong>The Application Layer (The &#8220;How&#8221;):</strong> <strong>Orchestration</strong> happens here. We use tools like <strong>LangGraph</strong>, <strong>DBOS</strong>, or <strong>Prefect</strong> to stitch the Nodes together into a coherent workflow. This layer dictates the sequence. For example: <em>&#8220;First research, then write, then review.&#8221;</em> We isolate this layer from how the data is stored or how the user triggers the workflow.</p></li><li><p><strong>The Infrastructure Layer (The &#8220;External Dependencies&#8221;):</strong> This layer houses the <strong>External Dependencies</strong>. It contains the concrete code that talks to Gemini or OpenAI. It connects to SQLite or Postgres databases to store the memory. It also loads documents and images from local disks or S3 buckets.</p></li><li><p><strong>The Serving Layer (The &#8220;Interface&#8221;):</strong> This exposes your AI logic to the world. In our writing workflow project, we serve it as both an MCP server and a CLI app. This allows the same backend to be used directly from the CLI or by MCP Clients like Cursor and Claude.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!a16g!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e6f3cae-874a-46e6-bdb3-38909f962609_1200x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!a16g!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e6f3cae-874a-46e6-bdb3-38909f962609_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!a16g!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e6f3cae-874a-46e6-bdb3-38909f962609_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!a16g!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e6f3cae-874a-46e6-bdb3-38909f962609_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!a16g!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e6f3cae-874a-46e6-bdb3-38909f962609_1200x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!a16g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e6f3cae-874a-46e6-bdb3-38909f962609_1200x1200.png" width="1200" height="1200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0e6f3cae-874a-46e6-bdb3-38909f962609_1200x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1200,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:144475,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/184308448?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e6f3cae-874a-46e6-bdb3-38909f962609_1200x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!a16g!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e6f3cae-874a-46e6-bdb3-38909f962609_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!a16g!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e6f3cae-874a-46e6-bdb3-38909f962609_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!a16g!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e6f3cae-874a-46e6-bdb3-38909f962609_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!a16g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e6f3cae-874a-46e6-bdb3-38909f962609_1200x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image 3: A four-layer architectural diagram for an AI application.</figcaption></figure></div><h3>Polymorphism and Decoupling Infrastructure</h3><p>The <strong>core benefit</strong> of this architecture is <strong>Polymorphism</strong>. Instead of hard-coding <code>GeminiModel</code> directly into your application layer, it should interact with an <strong>interface</strong>. This is an abstraction like <code>BaseLLM</code>.</p><p>This allows for effortless experimentation. You can switch from a live Gemini model to a <em>&#8220;Fake Model&#8221;</em> that returns static text for debugging. You do this simply by changing a configuration file. The core workflow code doesn&#8217;t know the difference. In our writing agent example, we use this to easily swap between:</p><ul><li><p>A local markdown file loader and a cloud-based S3 loader.</p></li><li><p>An in-memory memory storage class to SQLite or PostgreSQL.</p></li><li><p>Calling Gemini or a fake model during testing.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!15-g!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe22469fb-e7b2-4370-a1a9-114989a52e40_755x264.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!15-g!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe22469fb-e7b2-4370-a1a9-114989a52e40_755x264.png 424w, https://substackcdn.com/image/fetch/$s_!15-g!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe22469fb-e7b2-4370-a1a9-114989a52e40_755x264.png 848w, https://substackcdn.com/image/fetch/$s_!15-g!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe22469fb-e7b2-4370-a1a9-114989a52e40_755x264.png 1272w, https://substackcdn.com/image/fetch/$s_!15-g!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe22469fb-e7b2-4370-a1a9-114989a52e40_755x264.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!15-g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe22469fb-e7b2-4370-a1a9-114989a52e40_755x264.png" width="755" height="264" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e22469fb-e7b2-4370-a1a9-114989a52e40_755x264.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:264,&quot;width&quot;:755,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:217356,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/184308448?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F688981de-78f8-4a5b-9708-979d4945c9b5_755x264.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!15-g!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe22469fb-e7b2-4370-a1a9-114989a52e40_755x264.png 424w, https://substackcdn.com/image/fetch/$s_!15-g!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe22469fb-e7b2-4370-a1a9-114989a52e40_755x264.png 848w, https://substackcdn.com/image/fetch/$s_!15-g!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe22469fb-e7b2-4370-a1a9-114989a52e40_755x264.png 1272w, https://substackcdn.com/image/fetch/$s_!15-g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe22469fb-e7b2-4370-a1a9-114989a52e40_755x264.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image 4: Diagram illustrating polymorphism and decoupling infrastructure in an AI project.</figcaption></figure></div><h3>The Data Flow</h3><p>To see how these layers communicate, let&#8217;s look at a single tool request. When a user in Cursor asks the AI to <em>&#8220;generate a report,&#8221;</em> the request hits the <strong>MCP Server</strong> (Serving Layer). The server triggers a <strong>Builder</strong>. This builder looks at the config and instantiates the necessary infrastructure components. This includes the Gemini model, Markdown loader and memory client. The builder injects these concrete tools into the <strong>Orchestrator</strong> (Application Layer). The Orchestrator then runs the workflow. It passes data through the <strong>Domain Nodes</strong>. These nodes use the injected infrastructure to do the work. Finally, the result bubbles back up to the user.</p><p>Now, let&#8217;s quickly see how this looks within a Python folder structure and then let&#8217;s design a concrete Python AI app using these principles.</p><h2>The Folder Structure of an AI app</h2><p>A clean architecture is useless if the folder structure is messy. We advocate for a structure that balances modern tooling with the separation of concerns we just discussed.</p><p>First, we use <strong>uv</strong> for managing virtual environments and dependencies. This keeps our environment definition clean and fast. We define this environment in <code>pyproject.toml</code>. We also use a <code>Makefile</code> for command shortcuts and a <code>configs</code> directory for YAML configurations. We place any documents and guidance files in <code>inputs</code>.</p><p>For the directory structure itself, we strongly recommend using the <code>src/&lt;package_name&gt;/</code> layout. This ensures that we install the application code as a proper Python package. It prevents import errors and makes the code extensible. We place CLI entry points in <code>scripts/</code> and experimentation files in <code>notebooks/</code>. We keep unit and integration tests in <code>tests/</code>. These folders import the core package but contain no business logic themselves.</p><p>Here is how the writing agent project implements this:</p><pre><code><code>writing-agent/
&#9500;&#9472;&#9472; pyproject.toml         # Dependency management (UV)
&#9500;&#9472;&#9472; Makefile               # Command shortcuts (e.g., brown generate)
&#9500;&#9472;&#9472; configs/               # YAML configurations for models/debug
&#9500;&#9472;&#9472; inputs/                # Markdown research and guidance files
&#9500;&#9472;&#9472; scripts/               # CLI entry points (import brown)
&#9500;&#9472;&#9472; notebooks/             # Exploratory Notebooks (import brown)
&#9500;&#9472;&#9472; tests/                 # Unit and integration tests
&#9492;&#9472;&#9472; src/
    &#9492;&#9472;&#9472; brown/             # Main Python package
        &#9500;&#9472;&#9472; entities/      # Domain: Pydantic data models
        &#9500;&#9472;&#9472; nodes/         # Domain: Actionable AI units (Prompt + Logic)
        &#9500;&#9472;&#9472; workflows/     # Application: Orchestrators (LangGraph logic)
        &#9500;&#9472;&#9472; models/        # Infrastructure: Gemini/OpenAI implementations
        &#9500;&#9472;&#9472; memory/        # Infrastructure: Memory/storage implementations
        &#9500;&#9472;&#9472; mcp/           # Serving: MCP server interface
        &#9500;&#9472;&#9472; evals/         # Application: Evaluation logic
        &#9500;&#9472;&#9472; observability/ # Infrastructure: Monitoring and tracing
        &#9500;&#9472;&#9472; utils/         # Shared utility functions
        &#9500;&#9472;&#9472; base.py        # Interfaces: Abstract base classes
        &#9500;&#9472;&#9472; builders.py    # Application: Dependency injection
        &#9500;&#9472;&#9472; loaders.py     # Infrastructure: File/data loaders
        &#9500;&#9472;&#9472; renderers.py   # Infrastructure: Content renderers
        &#9500;&#9472;&#9472; config.py      # Configuration
        &#9492;&#9472;&#9472; config_app.py  # Application configuration</code></code></pre><p>Now, let&#8217;s design a concrete Python app using the principles we have learnt so far.</p><h2>Designing a Python AI App</h2><p>To bring this all together, let&#8217;s look at the design of a concrete AI app, such as a <strong>writing agent</strong>. We want to build a system that takes a topic, researches it, and writes an article.</p><p>In the <strong>domain Layer</strong>, we define our <code>Article</code> and <code>Research</code> entities using Pydantic. We also define our <code>ArticleWriterNode</code>. This node is a self-contained unit. It holds the system prompt for the writer persona and the logic to call the LLM. It doesn&#8217;t know <em>which</em> LLM it is calling. It just knows it needs to generate text.</p><p>In the <strong>application Layer</strong>, we use LangGraph to define the <code>generate_article</code> workflow. This orchestrator connects the nodes. It says: &#8220;<em>Take the output from the Loader, pass it to the Research Node, then pass that result to the Writer Node.&#8221;</em> It expects an object that satisfies the <code>LLM</code> interface. But it doesn&#8217;t care if it&#8217;s Gemini or a mock.</p><p>In the <strong>infrastructure Layer</strong>, we implement a <code>GeminiModel</code> class that wraps the Google GenAI SDK. We also implement a <code>MarkdownLoader</code> that reads guidance files from the disk. These implementations adhere to interfaces defined in the application.</p><p>Finally, in the <strong>serving Layer</strong>, we build an <strong>MCP Server</strong>. This server imports the <code>generate_article</code> workflow. When a client connects, the server reads the configuration. It instantiates the <code>GeminiModel</code> and <code>MarkdownLoader</code>. Then it injects them into the workflow. This injection process connects the application's abstract needs with the infrastructure's concrete tools.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Jt1y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6ac679c-75d1-4be8-9ee9-6e3778323e7e_1200x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Jt1y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6ac679c-75d1-4be8-9ee9-6e3778323e7e_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!Jt1y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6ac679c-75d1-4be8-9ee9-6e3778323e7e_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!Jt1y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6ac679c-75d1-4be8-9ee9-6e3778323e7e_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!Jt1y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6ac679c-75d1-4be8-9ee9-6e3778323e7e_1200x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Jt1y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6ac679c-75d1-4be8-9ee9-6e3778323e7e_1200x1200.png" width="1200" height="1200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e6ac679c-75d1-4be8-9ee9-6e3778323e7e_1200x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1200,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:144685,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/184308448?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6ac679c-75d1-4be8-9ee9-6e3778323e7e_1200x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Jt1y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6ac679c-75d1-4be8-9ee9-6e3778323e7e_1200x1200.png 424w, https://substackcdn.com/image/fetch/$s_!Jt1y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6ac679c-75d1-4be8-9ee9-6e3778323e7e_1200x1200.png 848w, https://substackcdn.com/image/fetch/$s_!Jt1y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6ac679c-75d1-4be8-9ee9-6e3778323e7e_1200x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!Jt1y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6ac679c-75d1-4be8-9ee9-6e3778323e7e_1200x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image 5: Applying the pragmatic clean architecture design for designing the writing agent app.</figcaption></figure></div><p>Here is the data flow of how a request travels through these layers in the writing agent example:</p><ol><li><p><strong>Client sends a request.</strong> The <strong>MCP Client</strong> sends a tool request (e.g., <code>generate_article</code>) to the <strong>MCP Server</strong> (Serving Layer).</p></li><li><p><strong>Serving Layer receives the request and builds the infrastructure.</strong> The MCP Server receives the request and triggers the <strong>Builders</strong> to instantiate the necessary infrastructure components (like the <code>GeminiModel</code> and <code>MarkdownLoader</code>) based on the configuration.</p></li><li><p><strong>Serving instantiates Application.</strong> The Serving Layer instantiates the <strong>Application</strong> (specifically, the <code>GenerateArticleWorkflow</code>) and injects the infrastructure instances built in step 2 into the Orchestrator.</p></li><li><p><strong>Application executes workflow.</strong> The Orchestrator (Application Layer) executes the workflow logic, which involves calling <strong>Domain Nodes</strong> in sequence.</p></li><li><p><strong>Domain Nodes perform logic using infrastructure.</strong> The <strong>Domain Nodes</strong> perform their core logic, utilizing the injected infrastructure via interfaces (e.g., <code>loader.load()</code>, <code>model.invoke()</code>).</p></li><li><p><strong>Final Article returned to Client.</strong> The final <strong>Article</strong> (Entity) is returned up the stack: from Domain Nodes to the Orchestrator, then to the MCP Server, and finally back to the <strong>Client</strong>.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_x3e!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aab8741-dda6-4d53-86f1-6a7f944c2430_2446x1004.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_x3e!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aab8741-dda6-4d53-86f1-6a7f944c2430_2446x1004.png 424w, https://substackcdn.com/image/fetch/$s_!_x3e!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aab8741-dda6-4d53-86f1-6a7f944c2430_2446x1004.png 848w, https://substackcdn.com/image/fetch/$s_!_x3e!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aab8741-dda6-4d53-86f1-6a7f944c2430_2446x1004.png 1272w, https://substackcdn.com/image/fetch/$s_!_x3e!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aab8741-dda6-4d53-86f1-6a7f944c2430_2446x1004.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_x3e!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aab8741-dda6-4d53-86f1-6a7f944c2430_2446x1004.png" width="1456" height="598" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7aab8741-dda6-4d53-86f1-6a7f944c2430_2446x1004.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:598,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3863397,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.decodingai.com/i/184308448?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aab8741-dda6-4d53-86f1-6a7f944c2430_2446x1004.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_x3e!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aab8741-dda6-4d53-86f1-6a7f944c2430_2446x1004.png 424w, https://substackcdn.com/image/fetch/$s_!_x3e!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aab8741-dda6-4d53-86f1-6a7f944c2430_2446x1004.png 848w, https://substackcdn.com/image/fetch/$s_!_x3e!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aab8741-dda6-4d53-86f1-6a7f944c2430_2446x1004.png 1272w, https://substackcdn.com/image/fetch/$s_!_x3e!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7aab8741-dda6-4d53-86f1-6a7f944c2430_2446x1004.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image 6: The data flow of how a request travels through these layers in the writing agent example (by NotebookLM).</figcaption></figure></div><p>Before wrapping up, let&#8217;s look at the 3 biggest mistakes I&#8217;ve seen people keep making on the projects I&#8217;ve been working on and on social media.</p><h2>The 3 Biggest Mistakes</h2><p>Even with good intentions, it is easy to misapply these principles. Here are the three most common traps.</p><h3>1. Keeping a Flat Hierarchy vs. Rigid Layers</h3><p>The biggest mistake is interpreting the four layers as physical folders that you <em>must</em> create. I have seen projects where developers create top-level directories named <code>domain</code>, <code>application</code>, <code>infrastructure</code>, and <code>interface</code>. They force every file into one of them.</p><p>These layers are <strong>virtual concepts</strong> to help you manage dependencies. They are not a rigid filing system. Structuring your project this way often leads to circular imports. It also creates confusion about where a file belongs. For example, if you have a <code>User</code> object, does it go in <code>domain</code>? But if it has database annotations, does it go in <code>infrastructure</code>? This ambiguity slows you down. Instead, we should keep a flatter hierarchy scoped based on your app requirements.</p><p>Here is an example of a rigid, layer-based structure that we should avoid:</p><pre><code>writing-agent/
&#9500;&#9472;&#9472; domain/
&#9474;   &#9500;&#9472;&#9472; entities.py
&#9474;   &#9492;&#9472;&#9472; nodes.py
&#9500;&#9472;&#9472; application/
&#9474;   &#9492;&#9472;&#9472; workflows.py
&#9500;&#9472;&#9472; infrastructure/
&#9474;   &#9500;&#9472;&#9472; models.py
&#9474;   &#9492;&#9472;&#9472; memory.py
&#9492;&#9472;&#9472; interface/
    &#9492;&#9472;&#9472; mcp.py</code></pre><p>And here is a flatter structure that works better. Notice how each module sits at the same level, and we respect the layer boundaries only virtually, not through folder nesting:</p><pre><code>writing-agent/
&#9500;&#9472;&#9472; entities/       # Domain
&#9500;&#9472;&#9472; nodes/          # Domain
&#9500;&#9472;&#9472; workflows/      # Application
&#9500;&#9472;&#9472; evals/          # Application
&#9500;&#9472;&#9472; models/         # Infrastructure
&#9500;&#9472;&#9472; memory/         # Infrastructure
&#9500;&#9472;&#9472; observability/  # Infrastructure
&#9500;&#9472;&#9472; mcp/            # Serving
&#9492;&#9472;&#9472; utils/          # Shared utilities</code></pre><h3>2. Organizing by &#8220;Actionability&#8221;</h3><p>Another common fallacy is the &#8220;Folder-per-Type&#8221; structure. This occurs when we create folders like <code>/prompts</code>, <code>/nodes</code>, and <code>/chains</code>. We then scatter the logic for a single feature across all of them.</p><pre><code>code-agent/
&#9500;&#9472;&#9472; prompts/
&#9474;   &#9500;&#9472;&#9472; code_reviewer.py
&#9474;   &#9492;&#9472;&#9472; code_generator.py
&#9500;&#9472;&#9472; nodes/
&#9474;   &#9500;&#9472;&#9472; code_reviewer.py
&#9474;   &#9492;&#9472;&#9472; code_generator.py
&#9492;&#9472;&#9472; chains/
    &#9500;&#9472;&#9472; code_reviewer.py
    &#9492;&#9472;&#9472; code_generator.py</code></pre><p>This makes the code hard to read and modify. If you want to change the <code>CodeReviewer</code>, you have to open three different files in three different folders. This increases cognitive load and makes debugging harder.</p><p>Instead, organize by <strong>Actionability</strong>. We keep everything related to a specific task in one place. This includes the class, the system prompt, and the utility methods. A good sanity check is: <strong>Can I copy-paste this module into another project and have it still make sense?</strong> If your <code>CodeReviewer</code> logic is self-contained in a single module or package, you have designed it well. This modularity allows you to treat your code like &#8220;Lego bricks&#8221; that can be reused across different agents.</p><pre><code>src/
&#9492;&#9472;&#9472; brown/
    &#9492;&#9472;&#9472; chains/
        &#9500;&#9472;&#9472; code_reviewer/
        &#9474;   &#9500;&#9472;&#9472; prompts.py
        &#9474;   &#9500;&#9472;&#9472; nodes.py
        &#9474;   &#9492;&#9472;&#9472; chain.py
        &#9492;&#9472;&#9472; code_generator/
            &#9500;&#9472;&#9472; prompts.py
            &#9500;&#9472;&#9472; nodes.py
            &#9492;&#9472;&#9472; chain.py</code></pre><h3>3. The &#8220;Pragmatic&#8221; Middle Ground</h3><p>Finally, avoid over-engineering. Python is not Java. Strictly following Clean Architecture <em>&#8220;by the book&#8221;</em> can lead to a nightmare of duplicated code and unnecessary abstractions.</p><p>You need to find the wins that are worth it and add real value. For example, if you don&#8217;t plan to change your database, say from Postgres to MySQL, applying the clean architecture pattern to writing generic Object-Relational Mapping (ORM) abstractions will be a complete waste of time.</p><p>Engineers tend to write abstractions for &#8220;just in case&#8221; scenarios, anticipating future reuse. I am sure you often say to yourself, &#8220;What if I need to reuse this in the future? Let&#8217;s refactor it.&#8221; For sure I am.</p><p>In some big-tech environments, engineers even use <strong>Tags</strong> on large entities to handle different layer requirements and avoid duplicating their data structures by creating separate Data Transfer Objects (DTOs) for every layer. This keeps the layers tight together, but if you don&#8217;t plan to change them, it works!</p><p>Decouple only what is worth decoupling! For example, in your writing agent example, we knew we wanted to swap between multiple file operators that either write to disk or S3 buckets.</p><p>To conclude, if you never plan to swap a particular piece of infrastructure, don&#8217;t bother decoupling it and making your code more complicated than it should be. Be pragmatic. If an abstraction makes the code harder to read without adding immediate value, delete it.</p><h2>Conclusion</h2><p>Clean Architecture is a powerful mental model, but it makes for a terrible, rigid rulebook. If you try to force every Python script into a strict four-folder hierarchy, you will end up hating your codebase. </p><p>Instead, treat these patterns as a tool/framework agnostic mind map. Whether you are building with LangGraph, FastAPI, or vanilla Python, these principles provide a solid foundation that outlasts any specific library or trend.</p><p>Ultimately, the goal is not to have a &#8220;perfect&#8221; architecture. The goal is to have a system that is easy to change, easy to test, and easy to understand. Start with these principles, but always prioritize simplicity over purity.</p><p>See you next Tuesday.</p><p><a href="https://www.pauliusztin.ai/">Paul Iusztin</a></p><div><hr></div><p><em>What&#8217;s your opinion? Do you agree, disagree, or is there something I missed?</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/how-to-design-python-ai-projects/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.decodingai.com/p/how-to-design-python-ai-projects/comments"><span>Leave a comment</span></a></p><div><hr></div><p><em>Enjoyed the article? The most sincere compliment is to share our work.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/how-to-design-python-ai-projects?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.decodingai.com/p/how-to-design-python-ai-projects?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h2>Go Deeper</h2><p>Everything you learned in this article, from building evals datasets to evaluators, comes from the AI Evals &amp; Observability module of our <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Agentic AI Engineering self-paced course</a>.</p><p><strong>Your path to agentic AI for production. </strong>Built in partnership with <a href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31">Towards AI</a>.</p><p>Across <strong>34 lessons</strong> (articles, videos, and a lot of code), you&#8217;ll design, build, evaluate, and deploy production-grade AI agents end to end. By the final lesson, you&#8217;ll have built a multi-agent system that orchestrates <strong>Nova</strong> (a deep research agent) and <strong>Brown</strong> (a full writing workflow), plus a <strong>capstone project</strong> where you apply everything on your own. </p><p><em>Three portfolio projects and a certificate to show off in interviews. Plus a Discord community where you have direct access to other industry experts and me.</em></p><p>Rated 4.9/5 &#11088;&#65039; by 190+ early students &#8212; &#8220;Every AI Engineer needs a course like this.&#8221;</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31&quot;,&quot;text&quot;:&quot;Learn more&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.towardsai.net/courses/agent-engineering?ref=b3ab31"><span>Learn more</span></a></p><p><em>Not ready to commit?</em> We also prepared a free 6-day email course to reveal the <em><strong>6 critical mistakes that silently destroy agentic systems. </strong><a href="https://email-course.towardsai.net/?ref=b3ab31">Get the free email course.</a></em></p><div><hr></div><p><em>Thanks again to <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a> (by Comet) for sponsoring the series and keeping it free!</em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oSDm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 424w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 848w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1272w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png" width="1456" height="364" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:364,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Opik Banner&quot;,&quot;title&quot;:&quot;Opik Banner&quot;,&quot;type&quot;:null,&quot;href&quot;:&quot;https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Opik Banner" title="Opik Banner" srcset="https://substackcdn.com/image/fetch/$s_!oSDm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 424w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 848w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1272w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"><a href="https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Try Opik for free here</a> (25k spans/month free)</figcaption></figure></div><p><strong>If you want to monitor, evaluate and optimize your AI workflows and agents:</strong></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;text&quot;:&quot;Try Opik for free&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul"><span>Try Opik for free</span></a></p><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item><item><title><![CDATA[The Realistic Guide to Mastering AI Agents in 2026]]></title><description><![CDATA[From zero to production in 6&#8211;9 months. What to learn, what to skip, and why most tutorials fail.]]></description><link>https://www.decodingai.com/p/realistic-guide-to-ai-agents-in-2026</link><guid isPermaLink="false">https://www.decodingai.com/p/realistic-guide-to-ai-agents-in-2026</guid><dc:creator><![CDATA[Paolo Perrone]]></dc:creator><pubDate>Tue, 06 Jan 2026 12:02:24 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!zA4h!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98ecfb9c-c928-4311-ad7f-6b54ac4fafa4_4290x2490.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Paul:</strong> <em>Today&#8217;s spotlight: Paolo Perrone, master of turning tech into scroll-stopping content. </em>This one&#8217;s packed, let&#8217;s go &#128064; &#8595;</p><div><hr></div><p>I&#8217;m going to be honest with you.</p><p>Most AI agent tutorials are garbage.</p><p>They show you how to copy-paste LangChain code, build a demo that breaks the moment you try anything real, and leave you feeling like you learned something. Three months later, you try to build something that wasn&#8217;t in the tutorial, and you&#8217;re completely stuck.</p><p>I&#8217;ve watched people waste years this way. Chasing frameworks. Collecting certificates. Building toy projects nobody cares about.</p><p>This guide is different.</p><p>What you&#8217;re reading is the result of a lot of trial and error, figuring out what matters when learning agentic AI. I&#8217;ll tell you how long each phase takes and what &#8220;good enough&#8221; looks like before moving on. Every resource you need is here. And this path produces people who can actually ship &#8212; not just follow along.</p><p>Here&#8217;s my promise: if you follow this roadmap seriously for 6&#8211;9 months, you&#8217;ll be able to build and deploy AI agents that work in the real world. Not demos. Systems that solve problems.</p><p>What you&#8217;ll get:</p><ul><li><p>The 8 phases from zero to deployed agents (with realistic timelines)</p></li><li><p>Which resources are actually worth your time</p></li><li><p>The specialization paths that lead to real jobs</p></li><li><p>What &#8220;good enough&#8221; looks like before moving on</p></li><li><p>The mistakes I made so you don&#8217;t have to</p></li></ul><p>But you have to do the work. Not skim. Not bookmark for later. Not tell yourself you&#8217;ll get to the math eventually.</p><p>If that&#8217;s you, let&#8217;s go.</p><p><em>But first, a quick word from our sponsor, Opik</em> &#8595;</p><div><hr></div><h2><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik: Open-Source LLMOps Platform (Sponsored)</a></h2><p>This <strong>AI Agents Foundations</strong> series is brought to you by <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a>, the LLMOps open-source platform used by Uber, Etsy, Netflix, and more. </p><p>But most importantly, we are incredibly grateful to be supported by a tool that we personally love and keep returning to for all our open-source courses and real-world AI products. <em>Why?</em> Because it makes escaping the PoC purgatory possible!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yeD8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 424w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 848w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1272w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yeD8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png" width="1200" height="400" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/deaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:400,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Opik Banner&quot;,&quot;title&quot;:&quot;Opik Banner&quot;,&quot;type&quot;:null,&quot;href&quot;:&quot;https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Opik Banner" title="Opik Banner" srcset="https://substackcdn.com/image/fetch/$s_!yeD8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 424w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 848w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1272w, https://substackcdn.com/image/fetch/$s_!yeD8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaa636a-b1a6-429b-83d8-080bc218e596_1200x400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Try Opik for free here</a> (25k spans/month free)</figcaption></figure></div><p>Here is how Opik helps us ship AI workflows and agents to production:</p><ul><li><p><em>We see everything</em> - Visualize complete traces of LLM calls, including costs and latency breakdowns at each reasoning step.</p></li><li><p><em>Easily optimize our system</em> - Measure our performance using custom LLM judges, run experiments, compare results and pick the best configuration.</p></li><li><p><em>Catch issues quickly - Plug in the LLM Judge metrics into production traces and receive</em> on-demand alarms.</p></li><li><p><em>Stop manual prompt engineering</em> - Their prompt versioning and optimization features allow us to track and improve our system automatically. The future of AutoAI.</p></li></ul><p><em><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a> is fully open-source and works with custom code or most AI frameworks. You can also use the managed version for free (w/ 25K spans/month on their generous free tier).</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;text&quot;:&quot;Try Opik for free&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul"><span>Try Opik for free</span></a></p><div><hr></div><p><em>&#8595;</em>  <em>Now, let&#8217;s move back to the article.</em></p><h2>Why Agentic AI Matters Right Now</h2><p>Traditional AI is reactive. Input in, output out. You ask a question, you get an answer. That&#8217;s it.</p><p>Agentic AI pursues goals. It looks at its environment, makes plans, takes actions, sees what happened, and adjusts. It can use tools, call APIs, search the web, write code, and work with other agents to get things done.</p><p>Here&#8217;s a concrete example.</p><p>Ask traditional AI to help you research competitors and it&#8217;ll summarize what it already knows.</p><p>Ask an agentic system and it will search the web for your competitors&#8217; recent moves, pull their press releases and funding announcements, analyze their positioning, cross-reference with industry reports, write up a strategic brief, save it to your drive, and email you when it&#8217;s done. While you&#8217;re asleep.</p><p>This is why companies want people who can build these systems. By 2026, agents won&#8217;t be impressive anymore &#8212; they&#8217;ll be expected. The question is whether you&#8217;ll know how to build them.</p><h2>The Roadmap</h2><p>Here&#8217;s the full picture before we get into details:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zA4h!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98ecfb9c-c928-4311-ad7f-6b54ac4fafa4_4290x2490.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zA4h!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98ecfb9c-c928-4311-ad7f-6b54ac4fafa4_4290x2490.png 424w, https://substackcdn.com/image/fetch/$s_!zA4h!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98ecfb9c-c928-4311-ad7f-6b54ac4fafa4_4290x2490.png 848w, https://substackcdn.com/image/fetch/$s_!zA4h!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98ecfb9c-c928-4311-ad7f-6b54ac4fafa4_4290x2490.png 1272w, https://substackcdn.com/image/fetch/$s_!zA4h!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98ecfb9c-c928-4311-ad7f-6b54ac4fafa4_4290x2490.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zA4h!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98ecfb9c-c928-4311-ad7f-6b54ac4fafa4_4290x2490.png" width="1200" height="696.4285714285714" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/98ecfb9c-c928-4311-ad7f-6b54ac4fafa4_4290x2490.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:845,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!zA4h!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98ecfb9c-c928-4311-ad7f-6b54ac4fafa4_4290x2490.png 424w, https://substackcdn.com/image/fetch/$s_!zA4h!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98ecfb9c-c928-4311-ad7f-6b54ac4fafa4_4290x2490.png 848w, https://substackcdn.com/image/fetch/$s_!zA4h!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98ecfb9c-c928-4311-ad7f-6b54ac4fafa4_4290x2490.png 1272w, https://substackcdn.com/image/fetch/$s_!zA4h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98ecfb9c-c928-4311-ad7f-6b54ac4fafa4_4290x2490.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Phase 1: Math</h2><p><strong>4&#8211;6 weeks</strong></p><p>You want to skip this. I know.</p><p>But without linear algebra, calculus, and probability, you won&#8217;t understand why your agents do what they do. You&#8217;ll copy code that works until it doesn&#8217;t, and then you&#8217;ll have no idea how to fix it.</p><p>You don&#8217;t need to become a mathematician. You need working fluency in three areas.</p><ol><li><p><strong>Linear Algebra: </strong>Vectors, matrices, eigenvalues, SVD. Neural networks are matrix math. Embeddings are vectors. This is the foundation.</p></li><li><p><strong>Calculus</strong> Derivatives, gradients, optimization. This is how models learn.</p></li><li><p><strong>Probability &amp; Statistics</strong> &#8212; Bayes&#8217; theorem, distributions, hypothesis testing. Agents reason under uncertainty. This is how.</p></li></ol><h3>Resources</h3><p><strong>Linear Algebra:</strong></p><ul><li><p><a href="https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab">3Blue1Brown: Essence of Linear Algebra</a>. The best visual explanation of linear algebra ever made. Grant Sanderson has a gift for making abstract concepts feel intuitive. Start here.</p></li><li><p><a href="https://www.khanacademy.org/math/linear-algebra">Khan Academy Linear Algebra</a>. More traditional, more comprehensive. Good for filling gaps after 3Blue1Brown.</p></li><li><p><a href="https://www.youtube.com/watch?v=Gv9_4yMHFhI">Machine Learning Foundations: Welcome to the Journey</a>. Specifically designed for ML, so you&#8217;re learning math with a purpose.</p></li><li><p><a href="https://www.youtube.com/watch?v=uZeDTwWcnuY">Math for Machine Learning</a>. Another ML-focused option. Pick whichever style clicks for you.</p></li></ul><p><strong>Calculus:</strong></p><ul><li><p><a href="https://www.youtube.com/watch?v=5yfh5cf4-0w">Calculus for Machine Learning</a>. Targeted and practical. Skips the stuff you won&#8217;t need.</p></li><li><p><a href="https://www.khanacademy.org/math/calculus-1">Khan Academy Calculus 1</a>. The classic. Thorough and reliable.</p></li><li><p><a href="https://www.youtube.com/watch?v=HfACrKJ_Y2w">Calculus 1 Full College Course</a>. If you want the full university experience without the tuition.</p></li></ul><p><strong>Probability:</strong></p><ul><li><p><a href="https://www.khanacademy.org/math/statistics-probability">Khan Academy Statistics and Probability</a>. Covers everything you need at a good pace.</p></li><li><p><a href="https://www.youtube.com/playlist?list=PLblh5JKOoLUK0FLuzwntyYI10UQFUhsY9">StatQuest Statistics Fundamentals</a>. Josh Starmer explains stats like he&#8217;s talking to a friend. Genuinely enjoyable.</p></li><li><p><a href="https://www.youtube.com/watch?v=9wCnvr7Xw4E">StatQuest: Bayes&#8217; Theorem</a>. Bayes is everywhere in AI. This video makes it click.</p></li><li><p><a href="https://ocw.mit.edu/courses/6-041-probabilistic-systems-analysis-and-applied-probability-fall-2010/">MIT OpenCourseWare: Introduction to Probability</a>. University-level rigor if you want to go deeper.</p></li></ul><p><strong>Textbook:</strong></p><ul><li><p><em>Mathematics for Machine Learning</em> &#8212; <a href="https://mml-book.github.io/">free PDF</a>. The comprehensive reference. Dense but complete. Good to have on hand when you need to look something up.</p></li></ul><h3>When to move on</h3><p>You can explain what matrix multiplication does geometrically. You can compute a gradient by hand. You can explain Bayes&#8217; theorem with an example. You don&#8217;t need mastery. Just enough that you&#8217;re not lost when these concepts show up later.</p><h2>Phase 2: Programming</h2><p><strong>3&#8211;4 weeks</strong></p><p>Python. There&#8217;s no alternative.</p><p>But knowing Python syntax isn&#8217;t the same as being comfortable writing code. You need to be able to read other people&#8217;s code without struggling, write your own without constantly Googling, and debug when things go wrong.</p><p>You also need the libraries for working with data.</p><p>Core Python</p><p>Functions, classes, decorators, error handling, async. You&#8217;ll need all of it.</p><ul><li><p><a href="https://www.youtube.com/watch?v=rfscVS0vtbw">Learn Python: Full Course for Beginners</a>. 4+ hours, covers everything. Good if you&#8217;re starting from zero.</p></li><li><p><a href="https://www.youtube.com/watch?v=JJmcL1N2KQs">Python Crash Course for Beginners</a>. Faster paced. Better if you&#8217;ve programmed in other languages.</p></li><li><p><em>Learn Python the Hard Way</em> (textbook). Old school approach: learn by typing code until it sticks. Works for some people.</p></li></ul><h3>Data Libraries</h3><ul><li><p><strong>NumPy</strong> for arrays and numerical computing. The foundation everything else builds on.</p></li><li><p><strong>Pandas</strong> for data manipulation. You&#8217;ll use this every single day.</p></li><li><p><strong>Matplotlib &amp; Seaborn</strong> for visualization. You can&#8217;t debug what you can&#8217;t see.</p></li></ul><p>Resources:</p><ul><li><p><a href="https://www.youtube.com/watch?v=r-uOLxNrNk8">Data Analysis with Python (NumPy, Pandas, Matplotlib, Seaborn)</a>. All-in-one tutorial that covers the whole stack.</p></li><li><p><a href="https://python-course.eu/numerical-programming/">NumPy, Matplotlib and Pandas tutorials by Bernd Klein</a>. Written format with good depth. Nice complement to video.</p></li></ul><h3>Optional: R</h3><p>If you&#8217;re coming from statistics or want research roles, R is useful. Otherwise skip it.</p><ul><li><p><a href="https://www.youtube.com/watch?v=_V8eKsto3Ug">R Programming in One Hour</a>. Exactly what it sounds like. Quick orientation.</p></li><li><p><em>R for Data Science</em> (<a href="https://r4ds.had.co.nz/">free online</a>). The definitive R book. Hadley Wickham knows what he&#8217;s doing.</p></li></ul><h3>When to move on</h3><p>You can write a script from scratch without looking up basic syntax. You can load a CSV, clean the data, run some analysis, and plot the results. When you see NumPy code, you understand what it&#8217;s doing.</p><h2>Phase 3: Machine Learning</h2><p><strong>6&#8211;8 weeks</strong></p><p>This is where people get stuck forever.</p><p>They watch course after course, never feeling ready to move on. Don&#8217;t do that. The goal isn&#8217;t to become an ML researcher. It&#8217;s to understand the main approaches well enough to know which one fits which problem.</p><h3>The Three Types</h3><ol><li><p><strong>Supervised Learning.</strong> You show the model examples with correct answers, it learns the pattern. <strong>Algorithms to know</strong>: linear regression, logistic regression, decision trees, SVMs, k-nearest neighbors, neural networks. <strong>When to use it</strong>: classification, prediction, anywhere you have labeled data.</p></li><li><p><strong>Unsupervised Learning.</strong> The model finds patterns in data without being told what to look for. <strong>Algorithms to know</strong>: k-means clustering, hierarchical clustering, DBSCAN, PCA. <strong>When to use it</strong>: grouping similar things, reducing dimensions, finding structure.</p></li><li><p><strong>Reinforcement Learning.</strong> An agent takes actions, gets rewards or penalties, and learns from experience. <strong>Concepts to know</strong>: states, actions, rewards, policies, Q-learning.<strong>When to use it</strong>: sequential decisions, games, robotics, planning. This matters a lot for agents.</p></li></ol><h3>Resources</h3><p><strong>The main course:</strong></p><ul><li><p><a href="https://www.coursera.org/specializations/machine-learning-introduction">Machine Learning Specialization by Andrew Ng</a>. This is the one. Ng is the best teacher in the field. Clear explanations, good pacing, covers what matters. Worth paying for the certificate if you want it on your resume.</p></li><li><p><a href="https://www.youtube.com/playlist?list=PLkDaE6sCZn6FNC6YRfRQc_FbeQrF8BwGI">Same course on YouTube</a>. If you just want to learn without the certificate.</p></li></ul><p><strong>Other options:</strong></p><ul><li><p><a href="https://www.youtube.com/watch?v=i_LwzRVP7bg">Machine Learning for Everybody</a>. More accessible if Ng feels too academic.</p></li><li><p><a href="https://www.kaggle.com/learn/intro-to-machine-learning">Kaggle: Intro to Machine Learning</a>. Short, hands-on, gets you building fast. Good supplement.</p></li><li><p><a href="https://www.youtube.com/watch?v=GwIo3gDZCVQ">Edureka: Machine Learning Full Course</a>. Comprehensive alternative if Ng&#8217;s style doesn&#8217;t work for you.</p></li><li><p><a href="https://course.fast.ai/">Fast.ai Practical Deep Learning</a>. Top-down approach: start building, learn theory as needed. Some people swear by this.</p></li></ul><p><strong>For practice:</strong></p><ul><li><p><a href="https://www.youtube.com/playlist?list=PLQVvvaa0QuDd0flgGphKCej-9jp-QdzZ3">Scikit-learn tutorials</a>. Implement what you&#8217;re learning. Theory without code is useless.</p></li></ul><h3>When to move on</h3><p>You can explain supervised vs unsupervised vs reinforcement learning and give examples of when you&#8217;d use each. You&#8217;ve trained a model with scikit-learn and can explain what the metrics mean.</p><h2>Phase 4: How Agents Work</h2><p><strong>4&#8211;6 weeks</strong></p><p>Agents have memory, use tools, and plan ahead. That&#8217;s different from a chatbot.</p><p>Understanding these parts is what separates people who can glue APIs together from people who can design systems that hold up.</p><h3>The Basic Loop</h3><p>Every agent does some version of this:</p><ol><li><p><strong>Perceive.</strong> Take in information (user input, search results, API responses).</p></li><li><p><strong>Reason.</strong> Process it, figure out what matters.</p></li><li><p><strong>Plan.</strong> Decide what to do.</p></li><li><p><strong>Act.</strong> Do it (call a tool, generate text, hit an API).</p></li><li><p><strong>Learn.</strong> See what happened, adjust.</p></li></ol><h3>Concepts You Need</h3><p><strong>Memory:</strong></p><ul><li><p>Short-term: what&#8217;s in the context window right now</p></li><li><p>Long-term: vector databases, stored knowledge</p></li><li><p>Episodic: records of past interactions</p></li></ul><p><strong>Reasoning patterns:</strong></p><ul><li><p>Chain-of-thought: thinking step by step</p></li><li><p>Tree-of-thought: exploring multiple paths</p></li><li><p>ReAct: alternating between reasoning and acting</p></li></ul><p><strong>Tool use:</strong></p><ul><li><p>How agents call external tools</p></li><li><p>How to handle failures</p></li><li><p>Orchestrating multiple tools</p></li></ul><p><strong>Planning:</strong></p><ul><li><p>Breaking goals into steps</p></li><li><p>Search algorithms (A*, etc.)</p></li><li><p>Hierarchical planning</p></li></ul><p><strong>Multi-agent systems:</strong></p><ul><li><p>Multiple agents working together</p></li><li><p>How they communicate</p></li><li><p>Specialization</p></li></ul><h3>Resources</h3><p><strong>Concepts:</strong></p><ul><li><p><a href="https://www.youtube.com/watch?v=sal78ACtGTc">The Power of AI Agents and Agentic AI Explained</a>. Good starting point. Covers the landscape without getting too technical.</p></li><li><p><a href="https://medium.com/data-science-collective/ai-agents-in-5-levels-of-difficulty-with-full-code-implementation-15d794becfb8">AI Agents in 5 Levels of Difficulty</a>. Starts simple, gets complex. Full code for each level. Great for seeing how agent complexity scales.</p></li><li><p><a href="https://medium.com/data-science-collective/the-complete-guide-to-building-your-first-ai-agent-its-easier-than-you-think-c87f376c84b2">The Complete Guide to Building Your First AI Agent</a>. Hands-on walkthrough. Good if you want to build something today.</p></li></ul><p><strong>Reinforcement learning (important for agents):</strong></p><ul><li><p><a href="https://huggingface.co/learn/deep-rl-course">Hugging Face Deep RL Course</a>. Excellent and free. This is how agents learn to make decisions over time. Don&#8217;t skip it.</p></li><li><p><a href="https://github.com/azminewasi/Curated-Reinforcement-Learning-Resources">Curated RL Resources on GitHub</a>. Massive collection if you want to go deeper.</p></li></ul><h3>When to move on</h3><p>You can draw the agent loop on a whiteboard and explain each part. You can describe different memory architectures. You understand ReAct and why it works.</p><h2>Phase 5: Building With Frameworks</h2><p><strong>6&#8211;8 weeks</strong></p><h3>Patterns</h3><ol><li><p><strong>ReAct.</strong> The agent thinks about what to do, does it, observes the result, thinks again. Simple and works for a lot of cases.</p></li><li><p><strong>Plan-and-Execute.</strong> The agent makes a full plan first, then executes step by step. Better for complex multi-step tasks.</p></li><li><p><strong>Multi-Agent.</strong> Multiple specialized agents working together. One researches, one writes, one reviews.</p></li></ol><h3>Frameworks</h3><ol><li><p><strong>LangChain / LangGraph.</strong> The current standard. LangChain for simpler stuff, LangGraph when you need complex state management.</p></li><li><p><strong>AutoGen.</strong> Microsoft&#8217;s multi-agent framework. Good when agents need to have back-and-forth discussions.</p></li><li><p><strong>CrewAI.</strong> Higher-level multi-agent orchestration. Faster to prototype, less flexible.</p></li></ol><h3>Resources</h3><p><strong>Courses:</strong></p><ul><li><p><a href="https://learn.deeplearning.ai/courses/agentic-ai/information">DeepLearning.AI: Agentic AI</a>. Andrew Ng teaching agent design patterns. Covers reflection, tool use, planning, and multi-agent. Free and worth your time.</p></li><li><p><a href="https://www.youtube.com/watch?v=e2zIr_2JMbE">Master ALL 20 Agentic AI Design Patterns</a>. Covers patterns you&#8217;ll use constantly. Bookmark it.</p></li></ul><p><strong>LangChain:</strong></p><ul><li><p><a href="https://www.youtube.com/watch?v=lG7Uxts9SXs">LangChain Crash Course</a>. Quick start. Gets you building in an afternoon.</p></li><li><p><a href="https://www.youtube.com/watch?v=Cyv-dgv80kE">LangChain Mastery: Full 5 Hour Course</a>. The deep dive. Watch this when you&#8217;re ready to get serious.</p></li><li><p><a href="https://python.langchain.com/docs/">LangChain docs</a>. The source of truth. You&#8217;ll live here.</p></li><li><p><a href="https://langchain-ai.github.io/langgraph/">LangGraph docs</a>. For when your agents need real state management.</p></li></ul><p><strong>Multi-agent:</strong></p><ul><li><p><a href="https://www.youtube.com/watch?v=Zz8GAza49kg">Simplilearn: How to Build a Multi-Agent System</a>. Practical walkthrough. Good for your first multi-agent project.</p></li></ul><h3>When to move on</h3><p>You&#8217;ve built at least three agents: a simple ReAct agent, something with RAG, and a multi-step workflow. You know when to use LangChain vs LangGraph.</p><h2>Phase 6: Pick a Specialization</h2><p><strong>8&#8211;12 weeks, then ongoing</strong></p><p>At some point you need to go deep in one area. Generalists can talk about agents. Specialists get hired to build them.</p><p>Pick one of these and commit for at least 3 months before deciding it&#8217;s not for you.</p><h3>Path A: Business Automation</h3><p>The biggest market right now. Agents that handle research, support, operations. Real budgets, real jobs.</p><p><strong>What to focus on:</strong></p><ul><li><p>RAG (retrieval-augmented generation). Teaching agents to search and use knowledge.</p></li><li><p>API integrations. Connecting agents to the tools businesses already use.</p></li><li><p>Multi-step workflows. Complex processes with handoffs and error handling.</p></li><li><p>Human-in-the-loop patterns. Knowing when to escalate to a person.</p></li></ul><p><strong>Projects to build:</strong></p><ul><li><p>Email assistant that drafts responses based on context</p></li><li><p>Competitor research agent that monitors news and summarizes changes</p></li><li><p>Customer support bot that knows when it&#8217;s out of its depth</p></li><li><p>Report generator that pulls from multiple data sources</p></li></ul><h3>Path B: Robotics</h3><p>Higher barrier, less competition. Agents that operate in the physical world.</p><p><strong>What to focus on:</strong></p><ul><li><p>ROS (Robot Operating System). The standard for robotics software.</p></li><li><p>Computer vision. How robots see.</p></li><li><p>Path planning. How robots navigate.</p></li><li><p>Simulation. Test without breaking expensive hardware.</p></li></ul><p><strong>Resources:</strong></p><ul><li><p><a href="https://www.youtube.com/watch?v=rDnzS5w5oQk">Introduction to Autonomous Robotics</a>. Solid starting point for the field.</p></li><li><p><a href="https://www.sciencedirect.com/journal/robotics-and-autonomous-systems">Robotics and Autonomous Systems journal</a>. Academic but useful for seeing what&#8217;s cutting edge.</p></li><li><p><a href="https://github.com/bulletphysics/bullet3">PyBullet</a> for physics simulation, <a href="https://gazebosim.org/docs/latest/getstarted/">Gazebo</a> for robot environments, <a href="https://github.com/openai/gym">OpenAI Gym</a> for RL training loops.</p></li></ul><h3>Path C: Research &amp; Model Development</h3><p>Fewer jobs, higher ceiling. This is for people who want to work on the models themselves, not just use them.</p><p><strong>What to focus on:</strong></p><ul><li><p>Fine-tuning LLMs with LoRA and PEFT. Making models better at specific tasks.</p></li><li><p>RLHF and reward modeling. Training models from human feedback.</p></li><li><p>Evaluation and benchmarking. Measuring what actually works.</p></li><li><p>Reading and implementing papers. Staying at the frontier.</p></li></ul><p><strong>Projects to build:</strong></p><ul><li><p>Fine-tune a model for a specific domain</p></li><li><p>Build an evaluation pipeline for agent outputs</p></li><li><p>Implement a recent paper from scratch</p></li><li><p>Contribute to an open source model</p></li></ul><p><strong>Resources:</strong></p><ul><li><p><a href="https://huggingface.co/docs/transformers">Hugging Face Transformers docs</a>. The reference for working with models.</p></li><li><p><a href="https://www.youtube.com/watch?v=2MBJOuVq380">RLHF Course</a>. How to train models from human preferences.</p></li><li><p><a href="https://arxiv.org/list/cs.AI/recent">arXiv AI papers</a> and <a href="https://arxiv.org/list/cs.LG/recent">ML papers</a>. Where new ideas show up first.</p></li></ul><h2>Phase 7: Deployment</h2><p><strong>3&#8211;4 weeks</strong></p><p>Your agent works in your notebook. Then a user sends an input you didn&#8217;t expect and the whole thing crashes. Production exposes every shortcut you took.</p><ol><li><p>Most self-taught people skip this part, which is exactly why learning it makes you stand out.</p></li></ol><h3>What you need</h3><ol><li><p><strong>APIs.</strong> Expose your agent as a service. FastAPI is the standard.</p></li><li><p><strong>Containers.</strong> Package everything so it runs the same anywhere. Docker.</p></li><li><p><strong>Cloud.</strong> Run at scale. Pick AWS, GCP, or Azure and learn one well.</p></li><li><p><strong>Monitoring.</strong> Track what your agent is doing in production. You&#8217;ll be surprised how often it misbehaves.</p></li><li><p><strong>Cost management.</strong> LLM calls add up fast. Caching, model selection, prompt efficiency all matter.</p></li></ol><h3>Resources</h3><p><strong>Overview:</strong></p><ul><li><p><a href="https://www.youtube.com/watch?v=ozwEXoepf2A">Deploying Agentic AI in Production</a>. Big picture of what deployment involves.</p></li><li><p><a href="https://medium.com/data-science-collective/why-most-ai-agents-fail-in-production-and-how-to-build-ones-that-dont-f6f604bcd075">Why Most AI Agents Fail in Production</a>. Learn from other people&#8217;s mistakes. Covers the common failure modes and how to avoid them.</p></li><li><p><a href="https://medium.com/data-science-collective/beyond-the-prototype-15-hard-earned-lessons-to-ship-production-ready-ai-agents-e58139d80299">Beyond the Prototype: 15 Hard-Earned Lessons</a>. Real lessons from shipping agents. Read this before you deploy.</p></li></ul><p><strong>Hands-on:</strong></p><ul><li><p><a href="https://www.youtube.com/watch?v=KC8HT0eWSGk&amp;feature=youtu.be">Build and Deploy AI Agents with Docker, FastAPI, LangChain</a>. Full walkthrough from code to deployed service.</p></li><li><p><a href="https://fastapi.tiangolo.com/">FastAPI docs</a>. Well-written docs. You can learn FastAPI just from these.</p></li><li><p><a href="https://www.youtube.com/watch?v=fqMOX6JJhGo">Docker for Beginners</a>. Containers aren&#8217;t optional anymore. Learn this.</p></li></ul><p><strong>Cloud:</strong></p><ul><li><p><a href="https://www.freecodecamp.org/news/deploy-an-ai-agent-with-amazon-bedrock/">Deploy an AI Agent with Amazon Bedrock</a>. AWS-specific but shows the managed service approach.</p></li><li><p><a href="https://docs.aws.amazon.com/bedrock/latest/userguide/agents-deploy.html">AWS Bedrock Agent docs</a>. Reference for when you&#8217;re actually doing it.</p></li></ul><p><strong>Monitoring:</strong></p><ul><li><p><a href="https://smith.langchain.com/">LangSmith</a>. Built specifically for LangChain. Shows you exactly what your agent is doing.</p></li><li><p><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a>. Observability platform. Similar to LangSmith, but open-source.</p></li><li><p><a href="https://wandb.ai/">Weights &amp; Biases</a>. More general ML tracking. Good if you&#8217;re doing custom training.</p></li></ul><h3>When to move on</h3><p>You&#8217;ve deployed one agent end-to-end. Containerized, served via API, running in the cloud, with some monitoring. You can explain your setup and why you made the choices you did.</p><h2>Phase 8: Portfolio and Staying Current</h2><p><strong>Ongoing</strong></p><p>The field moves fast. What&#8217;s new today is standard in six months. You need habits, not a one-time study session. If you&#8217;re not learning continuously, you&#8217;re falling behind. Simple as that.</p><h3>Your portfolio</h3><p>Your portfolio is proof you can build. Not certificates. Proof.</p><p><strong>What makes it strong:</strong></p><ul><li><p>Deployed projects. Running systems, not just repos. Anyone can push code to GitHub.</p></li><li><p>Real problems solved. Not tutorial recreations. Something you actually needed or that solves a real pain.</p></li><li><p>Documented decisions. Why you built it that way. What tradeoffs you made.</p></li><li><p>Clean code. Shows you can work on a team.</p></li></ul><p>Aim for 2&#8211;3 solid projects, at least one deployed and accessible.</p><h3>A) Contributing</h3><p>Nothing shows competence like merged PRs. Contribute to LangChain, AutoGen, or smaller projects. Documentation fixes count. They&#8217;re undervalued and maintainers appreciate them.</p><h3>B) Staying current</h3><p>Set aside a few hours a week for this.</p><p><strong>Where to look:</strong></p><ul><li><p><a href="https://paperswithcode.com/">Trending Papers</a>. See what&#8217;s getting attention in the research community right now.</p></li><li><p><a href="https://spinningup.openai.com/en/latest/spinningup/keypapers.html">OpenAI: Key Papers in Deep RL</a>. Curated list of foundational papers. Good for building depth.</p></li></ul><p><strong>Who to follow:</strong></p><ul><li><p><a href="https://www.youtube.com/@AndrejKarpathy">Andrej Karpathy</a>. His YouTube tutorials explain complex things clearly.</p></li><li><p><a href="https://twitter.com/DrJimFan">Jim Fan</a>. Posts about embodied AI and agents.</p></li><li><p><a href="https://lilianweng.github.io/">Lilian Weng</a>. Her blog posts are better than most courses.</p></li><li><p><a href="https://simonwillison.net/">Simon Willison</a>. Builds with LLMs constantly, shares what works.</p></li><li><p><a href="https://twitter.com/swyx">Swyx</a>. Tracks what&#8217;s actually useful in AI engineering.</p></li><li><p><a href="https://www.anthropic.com/research">Anthropic&#8217;s research blog</a>. How frontier models actually work.</p></li></ul><h2>That&#8217;s It</h2><p>This is everything you need.</p><p>Not everything that exists. There&#8217;s always more. But everything you need to go from zero to building production agents.</p><p>A few things to keep in mind as you go:</p><ol><li><p><strong>Build constantly.</strong> Every phase should include projects. Watching and reading isn&#8217;t learning. Building is.</p></li><li><p><strong>Confusion is normal.</strong> If you&#8217;re never confused, you&#8217;re not pushing hard enough. The discomfort means you&#8217;re learning.</p></li><li><p><strong>Teach what you learn.</strong> Blog posts, videos, explaining to others. This is how you find out what you actually understand.</p></li><li><p><strong>Find people.</strong> Discord servers, meetups, LinkedIn. Learning alone is harder and lonelier.</p></li><li><p><strong>Be patient.</strong> 6&#8211;9 months is realistic. Some weeks you&#8217;ll feel great, others you&#8217;ll feel stuck. Both are normal. Keep going.</p></li></ol><p>The gap between people who &#8220;want to learn AI agents&#8221; and people who actually build them comes down to one thing: starting.</p><p>Not starting perfectly. Not having the ideal setup. Just starting.</p><p>Six months from now, you could have a portfolio of deployed agents, real skills that companies pay for, and the confidence that comes from actually building things. Or you could still be collecting bookmarks and waiting for the right moment.</p><p>The right moment is now. The field is young. The opportunities are real. And you have everything you need right here.</p><p>Scroll to Phase 1. Open the first resource. Start today.</p><div><hr></div><p><em>What&#8217;s your take on today&#8217;s topic? Do you agree, disagree, or is there something I missed?</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/realistic-guide-to-ai-agents-in-2026/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/realistic-guide-to-ai-agents-in-2026/comments"><span>Leave a comment</span></a></p><div><hr></div><p>If you enjoyed this article, the ultimate compliment is to share our work.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.decodingai.com/p/realistic-guide-to-ai-agents-in-2026?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.decodingai.com/p/realistic-guide-to-ai-agents-in-2026?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><div><hr></div><h2>Struggling to grow your audience as a Tech Professional?</h2><p>The Tech Audience Accelerator is the go-to newsletter for tech creators serious about growing their audience. You&#8217;ll get the proven frameworks, templates, and tactics behind my 30M+ impressions (and counting).</p><div class="embedded-publication-wrap" data-attrs="{&quot;id&quot;:3879535,&quot;name&quot;:&quot;The Tech Audience Accelerator&quot;,&quot;logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!Bv_O!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63e0b99f-c8b1-4033-8f63-1a3d78cf7eef_256x256.png&quot;,&quot;base_url&quot;:&quot;https://techaudienceaccelerator.substack.com&quot;,&quot;hero_text&quot;:&quot;The go-to newsletter for tech creators building serious audiences.\n\nSteal the exact frameworks, templates, and tactics behind my 30M+ impressions (and counting).\n\nNo fluff, no guesswork. Just high-leverage strategies that work&quot;,&quot;author_name&quot;:&quot;Paolo Perrone&quot;,&quot;show_subscribe&quot;:true,&quot;logo_bg_color&quot;:&quot;#292524&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="EmbeddedPublicationToDOMWithSubscribe"><div class="embedded-publication show-subscribe"><a class="embedded-publication-link-part" native="true" href="https://techaudienceaccelerator.substack.com?utm_source=substack&amp;utm_campaign=publication_embed&amp;utm_medium=web"><img class="embedded-publication-logo" src="https://substackcdn.com/image/fetch/$s_!Bv_O!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63e0b99f-c8b1-4033-8f63-1a3d78cf7eef_256x256.png" width="56" height="56" style="background-color: rgb(41, 37, 36);"><span class="embedded-publication-name">The Tech Audience Accelerator</span><div class="embedded-publication-hero-text">The go-to newsletter for tech creators building serious audiences.

Steal the exact frameworks, templates, and tactics behind my 30M+ impressions (and counting).

No fluff, no guesswork. Just high-leverage strategies that work</div><div class="embedded-publication-author-name">By Paolo Perrone</div></a><form class="embedded-publication-subscribe" method="GET" action="https://techaudienceaccelerator.substack.com/subscribe?"><input type="hidden" name="source" value="publication-embed"><input type="hidden" name="autoSubmit" value="true"><input type="email" class="email-input" name="email" placeholder="Type your email..."><input type="submit" class="button primary" value="Subscribe"></form></div></div><div><hr></div><p><em>Thanks again to <a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Opik</a> for sponsoring the series and keeping it free!</em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oSDm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 424w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 848w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1272w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png" width="1456" height="364" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:364,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Opik Banner&quot;,&quot;title&quot;:&quot;Opik Banner&quot;,&quot;type&quot;:null,&quot;href&quot;:&quot;https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Opik Banner" title="Opik Banner" srcset="https://substackcdn.com/image/fetch/$s_!oSDm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 424w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 848w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1272w, https://substackcdn.com/image/fetch/$s_!oSDm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c21863-4ee6-4026-91c7-74650eb16dac_3168x792.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"><a href="https://www.comet.com/site/?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul">Try Opik for free here</a> (25k spans/month free)</figcaption></figure></div><p><strong>If you want to monitor, evaluate and optimize your AI workflows and agents:</strong></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul&quot;,&quot;text&quot;:&quot;Try Opik for free&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.comet.com/signup?utm_source=newsletter&amp;utm_medium=partner&amp;utm_campaign=paul"><span>Try Opik for free</span></a></p><div><hr></div><h2>Images</h2><p>If not otherwise stated, all images are created by the author.</p>]]></content:encoded></item></channel></rss>