Subscribe
Sign in
Home
Notes
Chat
LLM Engineer's Handbook
Agentic AI Engineering Course
Roadmaps
Perks
Reach Out
Archive
About
AI Evals & Observability
Our LLM Judge Passed Everything. It Was Wrong.
Align your evaluator with human judgment, or don't trust it at all.
9 hrs ago
•
Paul Iusztin
9
1
2
How to Design Evaluators That Catch What Actually Breaks
The practical guide to code-based checks, LLM judges, and rubrics for real-world AI apps
Mar 3
•
Paolo Perrone
18
5
4
Generate Synthetic Datasets for AI Evals
5 strategies from cold start to 450 diverse inputs in minutes
Feb 24
•
Paul Iusztin
24
7
4
No Evals Dataset? Here's How to Build One from Scratch
Build evaluators to signal problems that users actually care about. Step-by-step guide.
Feb 17
•
Paul Iusztin
24
4
Integrating AI Evals Into Your AI App
The holistic guide: From optimization to production monitoring
Feb 10
•
Paul Iusztin
30
6
6
Behind the Scenes of AI Observability in Production
What actually works after 6 months of trial and error
Feb 3
•
Alejandro Aboy
28
5
6
Stop Launching AI Apps Without This Framework
A practical guide to building an eval-driven loop for your LLM app using synthetic data, before you have users.
Oct 30, 2025
•
Hugo Bowne-Anderson
41
4
6
Escaping POC Purgatory: Evaluation-Driven Development for AI Systems
A new software development life cycle for LLMs
Oct 16, 2025
•
Hugo Bowne-Anderson
and
Stefan Krawczyk
24
8
3
The 5-Star Lie: You Are Doing AI Evals Wrong
Why binary evals are better than likert scales
Sep 20, 2025
•
Hamel Husain
50
10
11
The Mirage of Generic AI Metrics
Why off-the-shelf evals sabotage your AI product
Sep 13, 2025
•
Hamel Husain
61
7
6
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts