6 Comments
User's avatar
ToxSec's avatar

looks like a great course of materials. i think the llm judge eval and the future of that science is fascinating.

Paul Iusztin's avatar

thanks! yes it is 🤩

Mykola Kondratuk's avatar

Honestly, still evolving. Right now: Pytest-based regression suite that runs on every PR against a fixed golden dataset, a lightweight LLM judge for semantic correctness tied to specific acceptance criteria we wrote ourselves, and production monitoring via structured logging to catch distribution drift. The judge took the longest to get right - we went through three versions before the scores stopped feeling arbitrary. What made the difference was forcing ourselves to write failure examples first, not success cases.

Mykola Kondratuk's avatar

The three-layer framing is the thing I keep trying to explain to teams - development evals, regression before merge, production monitoring. They sound like the same thing until you actually build them and realize each layer catches completely different failure modes.

The part about trusting your LLM judges is where most teams quietly give up. Generic helpfulness metrics are easy to instrument and meaningless in practice. The hard work is writing evaluators grounded in what your product actually needs to do - and that requires somebody to first articulate what good looks like, which turns out to be a PM problem as much as an engineering one.

Looking forward to the production monitoring installment. That is where the interesting edge cases live.

Paul Iusztin's avatar

Yes, you got it right! It's tricky because all 3 options look similar, making it hard to distinguish them.

How does your AI evals system look?

Mykola Kondratuk's avatar

Fairly scrappy right now. I run evals at the agent task level - each agent has a set of expected outputs for known inputs, and I check regression before deploying config changes. No LLM judge yet, just deterministic checks and manual spot review.

The gap I keep running into is exactly what you describe in the series - my test cases were hand-crafted early on and do not reflect the distribution of real production inputs. The organic growth from production data approach you lay out is the thing I want to implement next.