6 Comments
User's avatar
Paul Iusztin's avatar

Another great piece from you, Paolo!

Alejandro Aboy's avatar

"How to Evaluate the Evaluator" is my favorite version of "watching the watchmen" - looking forward to that one!

Paul Iusztin's avatar

Yes! I think it’s one of the most underrated steps in doing AI Evals. What is your approach to it?

Alejandro Aboy's avatar

I've been working on some skills to cross check traces scores and "calibrate" if the evals are giving false positives, too strict, or over evaluating. Then I pay attention for a couple days to dashboards and deep dive again if needed. Manual but you are more aware of the score reasons. For now is just/opik-calibrate that has some API wrappers to fetch the evaluators and patch them again after discussing with Claude Code.

Paul Iusztin's avatar

Nice! Yes, I also think that manual is the best until you are 100% sure about an automation. Otherwise, you will stop trusting the ai evals layer.

ToxSec's avatar

fantastic bringing it all together section here. really nice read.