Discussion about this post

User's avatar
ToxSec's avatar

looks like a great course of materials. i think the llm judge eval and the future of that science is fascinating.

Mykola Kondratuk's avatar

The three-layer framing is the thing I keep trying to explain to teams - development evals, regression before merge, production monitoring. They sound like the same thing until you actually build them and realize each layer catches completely different failure modes.

The part about trusting your LLM judges is where most teams quietly give up. Generic helpfulness metrics are easy to instrument and meaningless in practice. The hard work is writing evaluators grounded in what your product actually needs to do - and that requires somebody to first articulate what good looks like, which turns out to be a PM problem as much as an engineering one.

Looking forward to the production monitoring installment. That is where the interesting edge cases live.

4 more comments...

No posts

Ready for more?