5 Comments
User's avatar
ToxSec's avatar

“AI Evaluators must assess the quality of LLM calls operating in a non-deterministic environment, often with unstructured data. Instead of writing unit and integration test cases, AI evals cases are operated as eval datasets, reflecting the AI-centric approach.”

glad you hammered that point. i see too many approach still assuming deterministic style controls or scald and wondering why they fail. great post!

Paul Iusztin's avatar

Thanks, man! Haha, yes. And this article is just the beginning. AI evals is such a large and complex problem.

ToxSec's avatar

absolutely, they totally are. 🔥 keeps up busy though!

Evangelos Evangelou's avatar

Really nice article. At last, someone tries to put AI evals in practise and make them understood for software engineers! I have a question regarding the evaluation iterations/cycles in optmization and regression scenarios.

Does classic tools like MLflow still have their place in experiment tracking? Or this functionality is covered entirely by the new AI-centric tools such as Opik & Langsmith?

This question goes mainly for LLM small fine-tuning with a few hundred shot examples rather than prompt engineering cases.

Paul Iusztin's avatar

That should be our new motto: "Making AI accessible to Software Engineers!" 😂

Regarding your question, when it comes to fine-tuning, yes, I still think classic tools like MLflow, W&B or Comet make a lot of sense. Whether you train an LLM or another model, the process is very similar from an ops perspective. Only when you need to monitor and evaluate an AI app as a whole (not just the model itself), use tools such as Opik or Langsmith.