How to Design Evaluators That Catch What…

Paolo Perrone

Mar 3

The practical guide to code-based checks, LLM judges, and rubrics for real-world AI apps

Read →

6 Comments

Paul Iusztin

Mar 3

Another great piece from you, Paolo!

"How to Evaluate the Evaluator" is my favorite version of "watching the watchmen" - looking forward to that one!

Yes! I think it’s one of the most underrated steps in doing AI Evals. What is your approach to it?

I've been working on some skills to cross check traces scores and "calibrate" if the evals are giving false positives, too strict, or over evaluating. Then I pay attention for a couple days to dashboards and deep dive again if needed. Manual but you are more aware of the score reasons. For now is just/opik-calibrate that has some API wrappers to fetch the evaluators and patch them again after discussing with Claude Code.

Nice! Yes, I also think that manual is the best until you are 100% sure about an automation. Otherwise, you will stop trusting the ai evals layer.

Reply

Share