Paul: Today, the scene is owned by two of the sharpest minds in applied AI: Hugo and Hamel.
You have Hugo, a brilliant mind who advises and teaches teams building LLM-powered systems, including engineers from Netflix, Meta, and the U.S. Air Force. Joining him is Hamel Husain, an AI engineer who has spent over a decade building and evaluating AI systems at giants like GitHub and Airbnb.
They’re not just professionals, they’re leading educators. Hugo runs a course on the complete LLM software development lifecycle—from retrieval and evaluation to agent design, while Hamel co-teaches a top-rated course on AI Evals.
So when they got together on Hugo’s podcast, Vanishing Gradients, the conversation shifted. Instead of championing rigorous evals, they focused on what actually goes wrong during development: the common mistakes, flawed approaches, and bad habits that hold AI projects back.
Enough talking, I’ll let them dig into today’s topic. ↓🎙️
P.S. You’re going to want to take notes on this one. 🤫
Hugo: Framed as “10 Things I Hate About AI Evals” (with a bonus 11th), the discussion is a masterclass in why building robust AI requires a shift from a traditional software engineering mindset to one grounded in data analysis and scientific rigor.
Below is a short and sweet summary of our conversation, with all the essential insights filtered and extracted. Enjoy!
Check out the full episode here:
👉 Hamel and Shreya Shankar (UC Berkeley) are teaching the final cohort for the year of their AI Evals course. It started on October 6th, but it’s not too late to join in. Since the first class is async, you can easily catch up on the recording. They’re giving readers of Decoding AI 35% off. Use this link.👈
Introduction: Why AI Software Demands a New Approach
The conversation begins by establishing a fundamental premise: AI-powered software is different from traditional software. Its stochastic nature means outputs are unpredictable (not to mention inputs!), rendering standard unit tests insufficient.
The outputs are unpredictable. And so how do you write tests for that? How do you measure if your application is behaving? And you can’t just apply your standard software engineering techniques... You have to bring some data literacy to bear... you need to analyze and look at data and reason about the behavior of a system in a data-driven way rather than asserting tests.
This data-driven approach is a skill set familiar to data scientists and ML engineers, but it requires adaptation for modern AI systems. Techniques like error analysis are more crucial than ever because they provide direct leverage; unlike traditional ML, where fixing an error might require complex retraining, fixing an LLM app issue can often be as simple as improving a prompt or retrieval strategy.
Getting Started: The Evals Flywheel
Before diving into the “hates,” Hamel outlines a systematic process for getting started with evals. It’s not about immediately writing automated tests but about building a foundation of understanding.
Start with Error Analysis: The first step is always to look at your data. Manually review your application’s traces to find where it’s failing. This qualitative, data-driven process informs what problems are most important to solve and what evals you should prioritize.
Choose Your Evals: Based on your findings, decide on the type of eval to write.
Code-based Evals: Use for objective, assertable logic (e.g., “Does the output contain JSON?”).
LLM-as-Judge: Use for fuzzy, subjective assessments that require judgment (e.g., “Is the tone of this response helpful?”).
Validate Your Judge (Meta-Evaluation): If you use an LLM to judge another AI, you must first validate its reliability. This involves measuring the LLM judge’s performance against a set of human-labeled data to ensure you can trust its outputs.
Automate and Monitor: Once you have a set of trusted, automated evals, you can integrate them into your CI/CD pipeline and use them for production monitoring.
11 Things I Hate About AI Evals
Here are the common failure modes Hamel sees teams fall into, which form the core of the conversation.
1. Using Generic Metrics & Off-the-Shelf Evals
Problem: Teams often start by implementing generic, off-the-shelf metrics like “hallucination score,” “toxicity score,” or “conciseness score” from third-party libraries or vendors. They plug their application’s traces into these systems and get a dashboard full of numbers that seem to measure quality.
Why it matters: These generic scores waste time and create a false sense of security. They are rarely correlated with the specific, high-priority problems your users are actually facing, leading teams to chase illusory targets that don’t improve the product.
Solution: Ground your evaluation in reality by starting with manual error analysis. Manually review a sample of your system’s outputs to identify the most frequent and severe failure modes for your specific use case. Use these qualitative findings to define custom metrics that track real problems. Generic metrics can be used cautiously as a tool for data exploration (for instance, sorting traces by a score to surface interesting examples) but should never be your primary measure of success.
These give you some kind of false security, and that’s wasting your time is super expensive. And so, and it’s very destructive because you start chasing these illusory targets that don’t actually make your system better.
— Hamel Husain, Timestamp: 00:29:00
2. Outsourcing Data Review to Engineers
Problem: Product owners or stakeholders hand a set of requirements to the engineering team and expect them to build and validate the AI system alone. The domain experts who understand the nuances of the product’s goals are left out of the evaluation loop.
Why it matters: Engineers, however skilled, often lack the deep domain context to accurately judge the quality of an AI’s output. This disconnect leads to systems that are technically functional but fail to meet real-world user needs.
Solution: Embed domain experts directly into the evaluation process. They are essential for reviewing data, defining what “good” looks like, and even writing and refining prompts. To avoid getting bogged down by committee-based decisions during annotation, appoint a single, trusted expert as a “benevolent dictator” for a specific area. This streamlines feedback and ensures a consistent standard of quality.
3. Overeager Automation
Problem: Faced with the daunting task of reviewing data, teams jump straight to automation. They might try to use an LLM to automatically categorize all errors or set up algorithmic hill-climbing to optimize a metric without any human oversight.
Why it matters: Premature automation skips the most crucial step: thinking! It prevents the team from building a deep understanding of the problem space and often results in optimizing for a flawed metric or producing “slop” that doesn’t represent a true quality improvement.
Solution: Start manually. The initial goal of evaluation is to infuse your taste and judgment into the AI system, and that can only be done through hands-on review. Use error analysis to understand your system’s failure modes first. Once you have a deep, qualitative understanding of the problems and have validated your metrics against human judgment, only then should you begin to thoughtfully automate parts of the workflow.
4. Not Looking at the Data at All
Problem: This is the most common and damaging mistake. Teams dive into building, swapping out models, or tweaking frameworks without ever systematically looking at a sample of their system’s inputs and outputs (traces).
Why it matters: Without looking at the data, you are flying blind. Teams consistently miss obvious, low-hanging fruit problems and have no real intuition for why their system is underperforming.
Solution: Make manual data review the non-negotiable first step for any new feature or debugging effort. Before writing a single line of evaluation code, randomly sample 50-100 traces and read through them. Take detailed notes on what went wrong—this simple act of error analysis is the single most effective way to debug an AI system and prioritize your work.
Every time we get contacted for a consulting project... the first thing I’ll do is I’ll say, ‘Okay, let’s take a look at your traces.’ And within a few hours, we always find very significant problems that are easily fixed.
— Hamel Husain, Timestamp: 00:38:45
5. Not Thinking Deeply About Prompts
Problem: Teams treat prompts as an afterthought. They are often written hastily, copy-pasted from examples, or completely obscured by a framework. The prompt, a critical piece of the system’s logic, goes unreviewed and unversioned.
Why it matters: A poorly written or hidden prompt is a primary source of failure and makes debugging nearly impossible. Garbage in, garbage out.
Solution: Treat your prompts with the same rigor as your application code. They should be clear, reviewed by both engineers and domain experts, and stored in version control. If you use a framework, make a habit of inspecting the final, compiled prompt that is actually sent to the model. You might be surprised by the complexity you’ve unknowingly adopted.
6. Relying on Noisy Dashboards
Problem: A team’s evaluation dashboard is cluttered with the generic, off-the-shelf metrics mentioned in the first point. It displays a dozen scores that are noisy, uncorrelated with user outcomes, and ultimately unactionable.
Why it matters: A noisy dashboard creates the illusion of measurement while providing zero signal. It makes it impossible to track meaningful progress and can lead the team to focus on irrelevant fluctuations.
Solution: Your dashboard should be simple and opinionated. It should feature a small number of custom metrics that directly track the critical failure modes you identified during error analysis. The goal is to answer one question: “Are we reducing our most important errors?” not to present a sea of context-free numbers.
7. Getting Stuck in Annotation Hell
Problem: The process of labeling data becomes a major bottleneck. The team either gets paralyzed by “analysis paralysis,” endlessly debating annotation guidelines by committee, or they use clunky, inefficient tools that make the process a painful chore.
Why it matters: If annotation is difficult, teams will avoid it. This friction pushes them right back into the trap of not looking at their data.
Solution: Aggressively remove friction from the annotation process. Use the “benevolent dictator” model to make decisive calls on edge cases. More importantly, invest in building simple, custom annotation tools tailored to your specific data. For example, if you’re evaluating email drafts, render them to look like actual emails. Make the experience of reviewing data as seamless as possible.
8. Endlessly Churning Through Tools
Problem: When faced with poor performance, a team’s first instinct is to swap out components. They’ll try a new vector database, migrate to a different agent framework, or test the latest foundation model, all without a clear, data-informed reason.
Why it matters: This “tool churning” is a form of procrastination that avoids the real work of diagnosing the root cause. It consumes significant time and resources while often failing to address the underlying problem.
Solution: Anchor every technical change to a data-driven hypothesis from your error analysis. Don’t switch your vector DB if your analysis shows the retrieved documents are fine but the LLM is failing to synthesize them correctly. Don’t switch models if your retrieval context is consistently missing the necessary information. Let the data guide your technical decisions.
9. Using an Ungrounded LLM-as-Judge
Problem: A team sets up an LLM to evaluate the output of their main application (an “LLM-as-judge”). They write a prompt for the judge and then trust its outputs implicitly, without any validation.
Why it matters: An uncalibrated LLM judge is an unreliable narrator. It can be biased, inconsistent, or just plain wrong. Optimizing your system against a faulty judge can easily make your product worse.
Solution: You must validate your LLM judge against human judgment. Create a small “golden set” of examples that have been carefully labeled by a human expert. Measure your LLM judge’s performance (e.g., accuracy, F1 score) against this ground truth. Iterate on the judge’s prompt, few-shot examples, and chain-of-thought reasoning until its judgments align closely with your human expert’s.
10. Lacking Intuition from Personal Use
Problem: Engineers who are building AI-powered products do not use AI tools extensively in their own daily work. Their interaction with AI is confined to the specific application they are building.
Why it matters: Without being a power user of AI yourself, you lack a deep, intuitive feel for its “jagged frontier”—the unpredictable mix of surprising competence and baffling failure. This intuition is critical for scoping projects, anticipating failure modes, and maintaining a healthy level of skepticism.
Solution: Cultivate a team culture where everyone is encouraged to use AI constantly. Use it for coding, writing emails, summarizing documents, and brainstorming. The hands-on experience of seeing AI succeed and fail in varied, low-stakes contexts is invaluable for building better products and motivating the need for rigorous evaluation.
11. Resisting a Data-Driven Mindset
Problem: This is the most fundamental challenge. Software engineers, trained in a world of deterministic logic and binary pass/fail tests, can be resistant to the statistical, data-centric mindset required for AI. They push back against qualitative analysis and ask for a simple, automated test suite that can give them a green light.
Why it matters: AI systems are stochastic, not deterministic. Applying a traditional software testing mindset is a recipe for frustration and failure. It’s impossible to “test” your way to a high-quality AI product without embracing data analysis.
Solution: Frame this work as “Data Science for AI.” Acknowledge that it’s a different paradigm. The best way to get buy-in is to show, not just tell. Run a short, time-boxed error analysis session and demonstrate how quickly you can uncover critical, actionable insights. Proving that this data-driven approach is the fastest path to improvement is the most effective way to foster adoption.
Key Takeaways: A Checklist for Actionable AI Evals
Hamel’s “hates” all point toward a more grounded, pragmatic, and data-centric approach. Instead of chasing silver bullets, effective AI evaluation is about systematic, investigative work.
Here’s a summary checklist:
Ditch generic metrics for custom evals based on error analysis.
Embed domain experts in your review process; appoint a “benevolent dictator.”
Start with manual data review before automating any part of your evals.
Make looking at traces the first step in any debugging or improvement cycle.
Treat prompts as code: review, version, and write them with care.
Build dashboards that track a few key, custom error rates, not noisy scores.
Streamline annotation with simple tools and clear ownership to remove friction.
Form a data-driven hypothesis before switching tools or models.
Calibrate every LLM-as-judge against a golden set of human labels.
Use AI tools daily to build personal intuition about their strengths and weaknesses.
Champion a “data science for AI” mindset; show that looking at data is the fastest path to improvement.
Check out this final clip to find out what Hamel LOVES most about evals and why we need to see the “revenge of the data scientist.”
The fact they’re super fun and the problems and fixes they uncover help people fix and maintain their products.
Conclusion
While the conversation focused on what can go wrong, the underlying message was optimistic. When done right, evaluation isn’t a chore; it’s a powerful and engaging process of discovery. As Hamel put it, we may be seeing the “revenge of the data scientist,” where the skills of data analysis, critical thinking, and statistical reasoning become the true differentiators in building successful AI products. The path to better AI isn’t through more complex tools, but through a deeper, more disciplined look at the data itself.
Until next time ✌️
👉 Hamel and Shreya Shankar (UC Berkeley) are teaching the final cohort for the year of their AI Evals course. It started on October 6th, but it’s not too late to join in. Since the first class is async, you can easily catch up on the recording. They’re giving readers of Decoding AI 35% off. Use this link.👈
What’s your take on today’s topic? Do you agree, disagree, or is there something I missed?
If you enjoyed this article, the ultimate compliment is to share our work.
Images
If not otherwise stated, all images are created by the author.
Copyrights
The article was originally published in the Vanishing Gradients - original article.
Thanks for another great collab, Paul! I hope it resonates with your audience 🤗
Thanks for the experience insights and services to the same for the good 😊