How to Design Evaluators That Catch What Actually Breaks
The practical guide to code-based checks, LLM judges, and rubrics for real-world AI apps
Welcome to the AI Evals & Observability series: A 7-part journey from shipping AI apps to systematically improving them. Made by busy people. For busy people.
đ§ Everyone says you need AI evals. Few explain how to actually build them and answer questions such asâŚ
How do we avoid creating evals that waste our time and resources? How do we build datasets and design evaluators that matter? How do we adapt them for RAG? ...and most importantly, how do we stop âvibe checkingâ and leverage evals to actually track and optimize our app?
This 7-article series breaks it all down from first principles:
How to Design Evaluators â You are here
How to Evaluate the Evaluator â Available next week
Evaluating RAG (Information Retrieval + RAG-Specific Metrics)
By the end, youâll know how to integrate AI evals that actually track and improve the performance of your AI product. No vibe checking required!
Letâs get started.
How to Design Evaluators
You have a dataset. Youâve manually labeled examples. Youâve fixed the obvious bugs. Now you need evaluators that can run automatically and catch problems before users do.
But hereâs what trips up most teams: they build evaluators that check for things nobody cares about, or they use off-the-shelf metrics that sound impressive but donât match their actual use case.
Three months ago, I spent a weekend building what I thought was a comprehensive evaluation suite for an AI agent that drafted replies to customer support tickets. I had ROUGE scores, BLEU scores, semantic similarity metrics, the works. Everything from the NLP textbook.
Then I ran it on production traces. The evaluators gave perfect scores to replies that were factually wrong, missed the customerâs actual question, and used the wrong tone for frustrated users. Meanwhile, they penalized perfectly good replies for using âdifferent words than the reference answer.â
Thatâs when I realized: generic metrics optimize for academic benchmarks, not business outcomes. (And no, Iâm not saying academic metrics are useless. Theyâre just solving a different problem than âdid this agent do what my users needed?â)
The solution is to design evaluators that match your specific success criteria. Not what worked for someone elseâs summarization task. Not what scored well on SQuAD. What actually matters for your users in your use case.
In this article, we will cover:
The evaluation harness: infrastructure that runs evals end-to-end
Dataset and metric types: direct scoring vs. pairwise vs. reference-based
Model evaluation vs. app evaluation (and why benchmarks lie)
Components of an evaluator: reference examples, metrics, rubrics
When to use code-based checks vs. LLM judges
Common mistakes (and how to avoid them)
Advanced metric designs for multi-turn conversations and agentic workflows
Before digging into the article, a quick word from our sponsor, Opik. â
Opik: Open-Source LLMOps Platform (Sponsored)
This AI Evals & Observability series is brought to you by Opik, the LLMOps open-source platform used by Uber, Etsy, Netflix, and more.
We use Opik daily across our courses and AI products. Not just for observability, but to design and run the exact evaluators this article teaches: custom LLM judges, code-based checks, and experiments â all from the same platform.

This article shows you how to design evaluators. Opik gives you the harness to run them at scale. Here is how we use it:
Custom LLM judges with rubrics â Build the evaluators this article describes: define your criteria, add few-shot examples, and run them across hundreds of traces automatically.
Run experiments, compare results â Test different prompts, models, or configurations side by side. Opik scores each variant with your evaluators and shows you which one wins.
Plug evaluators into production â The same LLM judges you design for testing run on live traces too. Set up alarms when scores drop below your threshold so you catch regressions before users do.
Opik is fully open-source and works with custom code or most AI frameworks. You can also use the managed version for free (with 25K spans/month on their generous free tier):
â Now, letâs move back to the article.
Understanding the Evaluation Harness
You canât manually run 500 test cases. You need automation.
The infrastructure that runs evals end-to-end is called an evaluation harness (1). It loads your dataset, executes your agent on each test case, captures all the outputs and traces, runs your graders, and aggregates the scores into something you can actually use.
Think of it like pytest for AI apps. Except instead of checking if a function returns the right type, youâre checking if an LLM generated text that accomplishes a business goal.
Hereâs what a harness does:
Loads tasks from your evaluation dataset
Provides instructions and tools to the agent (system prompts, available functions, etc.)
Runs tasks (often in parallel across multiple trials because LLM outputs vary)
Records everything: inputs, outputs, tool calls, reasoning traces, intermediate states
Runs graders on the results (your evaluators)
Aggregates scores across trials and tasks

Without a harness, youâre manually running your agent on test cases and eyeballing the output. With a harness, you run 500 test cases overnight and wake up to a report showing exactly which failure categories spiked [1].
The harness is separate from your evaluators. The evaluators decide what âgoodâ means. The harness handles the boring work of running everything at scale and collecting results.
Popular harness options include Opik (what we use), Braintrust, LangSmith, and open-source frameworks like Promptfoo. But honestly, you can build a minimal harness in ~100 lines of Python if you need custom logic [1]. The hard part isnât the infrastructure - itâs assembling the right context (system prompts, conversation history, retrieved docs, tools) for each task. The key is having one. Donât manually run evals.
Now letâs talk about what those evaluators actually check.
Dataset and Metric Types: Three Ways to Grade
When designing an evaluator, you need to pick a grading strategy. There are three main approaches, each suited for different situations.

1. Direct Scoring (Pointwise Evaluation)
The evaluator looks at a single output and scores it in isolation. No comparison to anything else.
Example:
Input: âRefund my order #12345â
Output: âIâve processed your refund for order #12345. Youâll see the credit in 3-5 business days.â
Score: Pass (correctly identified the task, provided timeline, professional tone)
When to use:
You have clear, absolute quality criteria (was it helpful? was it safe? did it call the right tool?)
You want to track performance over time on the same dataset
Your baseline is âgood enoughâ not âbetter than Xâ
Metrics:
Binary pass/fail
0-1 scores (where 1 = perfect)
Classification labels (Helpful/Neutral/Harmful)
2. Pairwise Comparison
The evaluator compares two outputs and picks which one is better.
Example:
Input: âRefund my order #12345â
Output A: âRefund processed.â
Output B: âIâve processed your refund for order #12345. Youâll see the credit in 3-5 business days.â
Winner: Output B (more informative, sets expectations)
When to use:
Comparing two model versions (baseline vs. candidate)
A/B testing different prompts
LLMs are better at ranking than absolute scoring
Watch out for biases (2):
Position bias: LLMs favor the first or last response shown
Verbosity bias: LLMs prefer longer answers even when theyâre not better
Self-enhancement bias: LLMs favor outputs from themselves over other models
You can mitigate these by randomizing response order and using multiple trials.
3. Reference-Based Evaluation
The evaluator compares the output to a known âgold standardâ answer.
Example:
Input: âWhatâs the capital of France?â
Output: âParisâ
Reference: âParisâ
Score: Exact match (Pass)
Example 2 (Semantic equivalence):
Input: âSummarize the refund policyâ
Output: âCustomers can return items within 30 days for a full refund if unused.â
Reference: âFull refunds are available for unused products returned within 30 days of purchase.â
Score: Pass (different wording, same meaning)
When to use:
You have ground truth answers (FAQs, knowledge bases, structured tasks)
Task has a single correct answer or small set of acceptable answers
Youâre testing retrieval accuracy or factual correctness
How to measure:
Exact match: For structured outputs (dates, product IDs, categorical values)
Semantic similarity / LLM judges: For natural language, where multiple phrasings are valid (summaries, explanations, instructions)
Common metrics (3):
Exact match
ROUGE (recall-oriented, good for summarization)
BLEU (precision-oriented, originally for translation)
BERTScore (semantic similarity using embeddings)
LLM judges (for nuanced semantic equivalence)
The trap: Exact match metrics penalize valid variations. If your reference says âThe meeting is on Fridayâ and your agent says âThe meeting is scheduled for this Friday,â exact match fails. This is where semantic similarity metrics (BERTScore) or LLM judges become powerful - they can recognize that different phrasings convey the same outcome.
Model Evaluation vs. App Evaluation (Why Benchmarks Lie)
Hereâs a distinction that matters more than people realize:
Model evaluation measures the LLM itself, in isolation, on generic tasks. This is what benchmarks like MMLU, HumanEval, and Chatbot Arena do.
App evaluation measures your entire application (LLM + prompts + tools + retrieval + business logic) on your specific use case.
High MMLU score doesnât mean it handles your refund policy correctly. Benchmarks test general capability. You need to test your specific use case.
Model Evaluation (Benchmarks)
Tests: âCan this LLM answer random trivia, write code snippets, or score high on standardized tests?â
Useful for:
Comparing foundation models across the board
Understanding general capabilities
Academic research
Useless for:
Predicting whether it will handle your refund policy correctly
Knowing if it will escalate frustrated customers at the right time
Determining if it respects your companyâs tone of voice
App Evaluation (What You Actually Need)
Tests: âDoes my customer support agent correctly process refunds, handle escalations, and follow our policies?â
This is what matters because your users donât care if GPT-5 scored 95% on MMLU. They care if it solved their problem.
Your evaluators must be grounded in your business use case, not generic academic benchmarks. This means:
Testing against your actual policies, not Wikipedia facts
Using your real user queries, not synthetic textbook questions
Measuring outcomes that impact revenue, retention, or safety
Benchmarks tell you which LLM is âgenerally smarter.â App evals tell you which version of your system works better for your users.
Donât mistake one for the other.
Components of an Evaluator
Now that you know the types, letâs build one. Every evaluator has three components:
1. Reference Examples (Few-Shot Prompts)
These are the labeled examples from your dataset. They show the evaluator what âgoodâ and âbadâ look like for your specific task.
Remember from Article 2: the real power isnât in the system prompt, itâs in these few-shot examples. They encode your domain expertâs judgment.
Example:
Example 1 - PASS
Input: âI need a refund for order #12345â
Output: âIâve processed your refund. Youâll see the credit in 3-5 business days.â
Reason: Confirms action, sets timeline, professional tone.
Example 2 - FAIL
Input: âCan you waive the late fee on my account?â
Output: âI can help with that!â
Reason: Didnât actually take action or explain next steps. Empty promise.
2. Metrics
The quantifiable measurement of quality. This can be:
Objective: Did it call the right tool? Is the JSON valid? Is the response under 200 words?
Subjective: Was it helpful? Was the tone appropriate? Did it follow the conversation flow?
For objective metrics, use code-based checks (fast, cheap, deterministic).
For subjective metrics, use LLM judges or human evaluation.
3. Rubrics
For subjective metrics, you need a rubric: explicit criteria that define what youâre measuring.
Bad rubric:
âWas the response helpful?â
(Too vague. Helpful how? To whom? Compared to what?)
Good rubric:
âDid the response: (1) correctly identify the userâs request, (2) provide a specific action or next step, (3) include a timeline or expectation, and (4) maintain professional tone?â
Rubrics force precision. They make subjective judgments repeatable. These criteria become part of your LLM judgeâs system prompt.
Code-Based Evaluators: Fast, Cheap, Objective
Some checks are deterministic. Did the agent call refund_order()? Is the output valid JSON? Does it include a required disclaimer?
Use code for these. Itâs faster, cheaper, and never gives you a different answer on the same input.

Use code-based evaluators for:
Tool calls: Did it call
refund_order()with the right parameters?Format checks: Is the output valid JSON? Is it under the character limit?
Required elements: Does it include a disclaimer? Does it have a timestamp?
Prohibited content: Does it contain banned phrases or leaked data?
Example (pseudocode):
def evaluate_refund_agent(trace):
# Check if right tool was called
if ârefund_orderâ not in trace.tool_calls:
return {âpassâ: False, âreasonâ: âDidnât call refund_orderâ}
# Check if order_id parameter was provided
params = trace.tool_calls["refund_order"].parameters
if "order_id" not in params:
return {"pass": False, "reason": "Missing order_id parameter"}
# Check if response includes timeline
if not any(word in trace.output.lower() for word in ["days", "week", "timeline"]):
return {"pass": False, "reason": "No timeline provided to customer"}
return {"pass": True, "reason": "All checks passed"}`Code-based evaluators are:
Fast: Milliseconds per check
Cheap: No API costs
Reproducible: Same input always gives same result
Easy to debug: When they fail, you know exactly what broke
But they canât handle nuance. They canât judge tone, helpfulness, or conversational flow. For that, you need LLM judges.
These code-based evaluators work exactly like classic unit tests youâre already familiar with. Theyâre deterministic, fast, and easy to debug. Thatâs why you should always try to implement code-based checks first before reaching for LLM judges. If you can check it with code, do that. Only use LLM judges when code canât capture what you need to measure.
LLM Judges: Flexible, Scalable, Nuanced
An LLM judge is an LLM that grades another LLMâs output. You give it the task, the output, and the evaluation criteria, and it returns a score with reasoning
LLM judges work in two modes: evaluating outputs against absolute criteria (is it helpful? professional? accurate?) or comparing outputs to reference answers when you have ground truth but need semantic understanding rather than exact string matching.
Use LLM judges for:
Tone: Was it empathetic? Professional? Not condescending?
Helpfulness: Did it actually answer the question or deflect?
Conversation flow: Did it maintain context across turns?
Reasoning quality: Did the agentâs plan make sense?
How it works:
You provide:
The input (user query)
The output (agentâs response)
The context (system prompt, retrieved docs, conversation history)
Evaluation criteria (what youâre checking for)
Few-shot examples (labeled passes and fails)
The LLM judge outputs:
A score (pass/fail or 0-1 scale)
A critique explaining why
Example prompt (simplified):
You are evaluating customer support responses. For each trace, output Pass or Fail
with reasoning.
Evaluation criteria:
1. Did the response correctly identify the customerâs request?
2. Did it provide a specific action or next step?
3. Did it include a timeline or expectation?
4. Did it maintain a professional tone?
Here are examples of how a domain expert judged similar cases:
[Few-shot examples from your labeled dataset]
Now evaluate this trace:
Input: [customer query]
Output: [agent response]
Context: [system prompt, policies]The judge generates:
FAIL
The response correctly identified the refund request (criterion 1: pass) and
maintained professional tone (criterion 4: pass). However, it didnât specify a next
step beyond âweâll look into thisâ (criterion 2: fail) and provided no timeline
(criterion 3: fail). Customer is left waiting with no expectations set.Strengths of LLM Judges
Flexible: Handle open-ended tasks where code canât
Scalable: Grade thousands of traces automatically
Explainable: Critiques show reasoning, helping debug failures
Weaknesses of LLM Judges
Non-deterministic: Same input might get different scores across runs
Expensive: Every evaluation is an API call
Needs calibration: Must align with human judgment (we cover this in Article 5)
Making LLM Judges More Stable
Use the most capable model (e.g., Claude Opus, GPT-4o) + footnotes (4)
Add chain-of-thought reasoning before scoring (âLetâs think step-by-step...â)
Control for verbosity bias (normalize response lengths)
Run multiple trials and average scores for critical evals
Increase dataset size to at least 50-100 samples (reduces noise)
Common Mistakes (And How to Avoid Them)
Mistake 1: Not Providing Critiques
Wrong:
Score: 1
Right:
Score: 1
Critique: âResponse correctly identified the refund request but didnât provide a timeline. Customer left without expectations.â
Critiques are not optional. Theyâre how you debug failures and train better evaluators.
Mistake 2: Overly Terse Critiques
Wrong:
âBad toneâ
Right:
âResponse used dismissive language (âjust waitâ) when customer expressed frustration about a delayed order. Should have acknowledged frustration and provided specific next steps.â
The critique should be detailed enough to serve as a few-shot example later.
Mistake 3: Missing Context
Donât evaluate the output in isolation. Give the evaluator everything a human would see:
The full conversation history (for multi-turn tasks)
Retrieved documents (for RAG)
System prompts (for understanding constraints)
Tool call results (for agentic workflows)
If a human needs it to judge quality, the evaluator needs it too.
Mistake 4: Not Providing Diverse Examples
If all your few-shot examples are âcustomer angry, agent apologizes,â the judge wonât know how to handle âcustomer confused, needs technical explanation.â
Cover the failure modes you actually see in production.
Mistake 5: Using Ready-Made Metrics Without Validation
ROUGE, BLEU, BERTScore, etc. sound professional, but they might not correlate with your actual goal.
Before using any metric, validate it against human judgment on your specific task. If high ROUGE doesnât mean âusers are happy,â donât optimize for ROUGE.
Mistake 6: Using 1-5 Scales Instead of Binary Pass/Fail
Wrong:
Score: 3.2 out of 5
Right:
Score: 0 (Fail)
Critique: âResponse didnât provide a timeline or next steps.â
Why it matters: A score of 3.2 is ambiguous. Is that good enough to ship? Should you fix it? Binary forces clarity. Either it passes your quality bar or it doesnât. Scoring on a float scale (0.0-1.0) has the same problem - it leaves room for interpretation instead of forcing a clear decision.
When Should I Use Similarity Metrics (BERTScore, ROUGE, etc.)?
Short answer: Only for specific, narrow tasks where semantic overlap actually matters.
When They Work
Summarization: ROUGE measures how much of the source content appears in the summary. If your task is âdonât miss key facts,â ROUGE helps.
Translation: BLEU checks n-gram overlap with reference translations. Works when thereâs a narrow acceptable output space.
Retrieval accuracy: BERTScore compares semantic similarity between retrieved chunks and expected documents.
When They Fail
Open-ended generation: Your AI agent says âIâve refunded order #12345. Youâll see the credit in 3-5 days.â Reference says âRefund processed for order #12345, expect 3-5 business days.â Different words, same meaning. ROUGE fails.
Tone and helpfulness: Similarity metrics donât measure if the tone was appropriate or if it actually helped the user.
Business outcomes: High similarity doesnât mean the customer is satisfied, the sale closed, or the task completed.
The Rule
If your success criterion is âoutput should be semantically similar to the reference answer,â use similarity metrics.
If your success criteria are âuser achieved their goal,â use app-level evaluators grounded in outcomes.
Advanced Metric Designs
Now letâs handle the hard cases: multi-turn conversations, complex workflows, and agentic systems.
Evaluating Multi-Turn Conversation Traces
A single-turn eval checks one input and one output. Multi-turn evals check entire conversations.
Challenges:
Context must carry across turns
Errors compound (one bad response derails the rest)
You need to catch the first upstream failure, not downstream symptoms
Strategy:
End-to-end task success: Did the agent accomplish the userâs goal by the end?
Turn-by-turn checks: Evaluate each exchange individually
Did turn 3 maintain context from turn 1?
Did turn 5 escalate when the user got frustrated?
Failure attribution: When something breaks, find the first turn where it went wrong
Example (customer support conversation):
Turn 1:
User: âI need to return order #12345â
Agent: âSure, I can help with that. Whatâs the reason for the return?â
Eval: Pass (acknowledged request, asked clarifying question)
Turn 2:
User: âIt arrived damagedâ
Agent: âIâll process a refund. Expect 3-5 business days.â
Eval: FAIL (Skipped required step: didnât offer replacement or ask for photos of damage)
Turn 3:
User: âDo I need to ship it back?â
Agent: âNo, keep it.â
Eval: Pass (but only because Turn 2 already failed the workflow)
The first upstream failure is Turn 2. Everything after is a consequence.
Important: When evaluating any turn, provide all previous turns as context. Evaluating Turn 2? Include Turn 1. Evaluating Turn 3? Include Turns 1 and 2. The evaluator needs the full conversation history to judge whether context was properly maintained.
Evaluating Complex Multi-Step Workflows
Workflows have dependencies. Step 3 canât succeed if Step 1 failed. Your evaluator needs to know this.
Example (flight booking agent):
Required sequence:
Search flights
Validate availability
Confirm payment
Book reservation
Bad eval: Check if all steps ran (yes/no)
Good eval: Check if steps ran in the right order, with correct dependencies
Evaluating Agentic Workflows
Agents donât follow fixed scripts. They plan, reason, and adapt. This makes evaluation harder.

Two-phase approach (from Hamel Husain) (5):
Phase 1: End-to-End Task Success
Treat the agent as a black box. Did it meet the userâs goal?
Define precise success rules per task:
Exact answer match (for factual tasks)
Correct side-effect (database updated, email sent, file created)
User satisfaction (thumbs up, complaint rate, retry rate)
Use human judges or well-aligned LLM judges. Focus on first upstream failures during error analysis.
Phase 2: Step-Level Diagnostics
Once you know which workflows fail, diagnose why.
Assuming youâve instrumented your system to log tool calls and responses, score:
Tool choice: Was the selected tool appropriate?
Parameter extraction: Were inputs complete and well-formed?
Error handling: Did it recover from empty results or API failures?
Context retention: Did it preserve earlier constraints?
Plan quality: Does the agentâs plan match the task requirements?
Transition matrix analysis (Bryan Bischofâs approach):
Track which state transitions cause failures.
Example (text-to-SQL agent):
GenSQL â ExecSQL: 12 failures
DecideTool â PlanCal: 2 failures
This data-driven view shows where to focus debugging.
Session-level metrics:
Task completion rate
Step completion (did it finish the required steps?)
Trajectory quality (did it avoid loops?)
Self-aware failures (did it acknowledge limitations?)
Node-level metrics (per tool call):
Tool correctness (right tool with right parameters?)
Tool call accuracy (did the tool run without errors?)
Output correctness (did the tool return valid results?)
System efficiency metrics:
Latency (time to complete task)
Token usage (cost per task)
Tool calls per task (efficiency of plan)
These metrics layer on top of each other[6]. System efficiency ensures scalability. Session-level metrics validate goal achievement. Node-level metrics pinpoint root causes.
Bringing It All Together
Pick evaluators based on what youâre actually trying to measure, not what sounds impressive. Hereâs how to decide which evaluator to use:
Can you check it with code?
Yes â Use code-based evaluators (tool calls, format checks, required elements)
No â Move to next question
Is there a single correct answer or narrow acceptable range?
Yes â Use reference-based evaluation (exact match, ROUGE, BLEU)
No â Move to next question
Are you comparing two versions?
Yes â Use pairwise comparison
No â Use direct scoring
Is the task subjective (tone, helpfulness, flow)?
Yes â Use LLM judges with rubrics and few-shot examples
No â Rethink your criteria (you might have missed a code-based check)
Is it a multi-turn or agentic workflow?
Yes â Use two-phase approach (end-to-end task success + step-level diagnostics)
No â Single-turn direct scoring
And remember: your evaluators are only as good as your dataset and few-shot examples. The system prompt matters less than you think. The examples matter more than you think.
Next Steps
You now know how to design evaluators that match your use case. You know when to use code, when to use LLMs, and when to combine both.
But hereâs the critical question we havenât answered: How do you know if your evaluator is actually working?
An evaluator who says everything is great when itâs not is worse than no evaluator at all. You need to validate that your automated judges align with human judgment before you trust them.
Thatâs what weâll cover in Article 5: How to Evaluate the Effectiveness of the Evaluator.
Also, remember that this article is part of a 7-piece series on AI Evals & Observability. Hereâs whatâs ahead:
How to Design Evaluators â You just finished this one
How to Evaluate the Evaluator â Move to this one
Evaluating RAG â Move to this one (released next Tuesday)
See you next Tuesday.
Whatâs your opinion? Do you agree, disagree, or is there something I missed?
Most AI newsletters give you news. The AI Engineer gives you understanding.
One concept per week, explained from first principles: when to fine-tune vs. prompt vs. RAG, which vector database fits your workload, and how companies like DoorDash ship AI at scale.
Written for senior engineers and tech leads who build with AI, not just read about it.
Go Deeper
Go from zero to production-grade AI agents with the Agentic AI Engineering self-paced course. Built in partnership with Towards AI.
Across 34 lessons (articles, videos, and a lot of code), youâll design, build, evaluate, and deploy production-grade AI agents end to end. By the final lesson, youâll have built a multi-agent system and a capstone project where you apply everything you've learned on your own.
Three portfolio projects and a certificate to showcase in interviews. Plus a Discord community where you have direct access to other industry experts and me.
Rated 4.9/5 âď¸ by 290+ early students â âEvery AI Engineer needs a course like this.â
Not ready to commit? We also prepared a free 6-day email course to reveal the 6 critical mistakes that silently destroy agentic systems. Get the free email course.
Thanks again to Opik for sponsoring the series and keeping it free!
If you want to monitor, evaluate and optimize your AI workflows and agents:
References
Anthropic. (n.d.). Demystifying evals for AI agents. https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
Evidentlyai. (n.d.). LLM-as-a-judge: a complete guide. https://www.evidentlyai.com/llm-guide/llm-as-a-judge
Evidentlyai. (n.d.). LLM evaluation metrics and methods. https://www.evidentlyai.com/llm-guide/llm-evaluation-metrics
OpenAI. (n.d.). Evaluation best practices. https://developers.openai.com/api/docs/guides/evaluation-best-practices
Husain, H. (n.d.). How do I evaluate agentic workflows? https://hamelhusain.substack.com/p/how-do-i-evaluate-agentic-workflows
Maxim. (n.d.). Evaluating agentic workflows: The essential metrics that matter. https://www.getmaxim.ai/articles/evaluating-agentic-workflows-the-essential-metrics-that-matter
Confident AI. (n.d.). LLM evaluation metrics: Everything you need for LLM evaluation. https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
Images
If not otherwise stated, all images are created by the author.











