8 Comments
User's avatar
ToxSec's avatar

“AI Evaluators must assess the quality of LLM calls operating in a non-deterministic environment, often with unstructured data. Instead of writing unit and integration test cases, AI evals cases are operated as eval datasets, reflecting the AI-centric approach.”

glad you hammered that point. i see too many approach still assuming deterministic style controls or scald and wondering why they fail. great post!

Paul Iusztin's avatar

Thanks, man! Haha, yes. And this article is just the beginning. AI evals is such a large and complex problem.

ToxSec's avatar

absolutely, they totally are. 🔥 keeps up busy though!

Yashi Gupta's avatar

"check if your “refund accuracy” metric improved compared to the baseline, tweak again, and repeat" -- how to check this? this post covered when to do evals, not how to do evals. do you have a post on that too?

Paul Iusztin's avatar

Yes, we cover everything about the "How" in the upcoming articles from the series. The next step is to build the evals dataset: https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis

Evangelos Evangelou's avatar

Really nice article. At last, someone tries to put AI evals in practise and make them understood for software engineers! I have a question regarding the evaluation iterations/cycles in optmization and regression scenarios.

Does classic tools like MLflow still have their place in experiment tracking? Or this functionality is covered entirely by the new AI-centric tools such as Opik & Langsmith?

This question goes mainly for LLM small fine-tuning with a few hundred shot examples rather than prompt engineering cases.

Paul Iusztin's avatar

That should be our new motto: "Making AI accessible to Software Engineers!" 😂

Regarding your question, when it comes to fine-tuning, yes, I still think classic tools like MLflow, W&B or Comet make a lot of sense. Whether you train an LLM or another model, the process is very similar from an ops perspective. Only when you need to monitor and evaluate an AI app as a whole (not just the model itself), use tools such as Opik or Langsmith.

Bhavishya Pandit's avatar

This really resonates.

Most teams start with gut feel, and honestly, that’s fine in the beginning. But once things scale, that approach breaks down quickly.

What I liked here is the idea of evals being part of the system, not just something you run occasionally. That shift alone changes how you build.

One thing I’ve noticed is that if your evals don’t reflect real user expectations, you can improve metrics without actually improving the product.

At the end of the day, moving from “this feels better” to “this is better” is what actually makes AI reliable.