Discussion about this post

User's avatar
ToxSec's avatar

“AI Evaluators must assess the quality of LLM calls operating in a non-deterministic environment, often with unstructured data. Instead of writing unit and integration test cases, AI evals cases are operated as eval datasets, reflecting the AI-centric approach.”

glad you hammered that point. i see too many approach still assuming deterministic style controls or scald and wondering why they fail. great post!

Yashi Gupta's avatar

"check if your “refund accuracy” metric improved compared to the baseline, tweak again, and repeat" -- how to check this? this post covered when to do evals, not how to do evals. do you have a post on that too?

6 more comments...

No posts

Ready for more?