Our LLM Judge Passed Everything. It Was…

Paul Iusztin

Mar 10

Align your evaluator with human judgment, or don't trust it at all.

Read →

7 Comments

ToxSec

Mar 10

‘The workflow is straightforward: measure where your judge disagrees with the expert, fix those gaps, and confirm on data the judge has never seen.’

excellent takeaway here. i’ve been trying to see how llm judges work with security tools. they have some strengths but a strong jailbreak can hit both the product and the judge

That's an interesting use case! I am sure there are ways to implement the judge to detect jailbreaks as well, but I haven't played with that yet. Have you tried it?

yeah we have llm judges that are designed to try to detect if there is a jailbreak on llms.

they work with small context, but if you a long form attack like crescendo using PAIR they will eventually both fall.

it’s an interesting area of research.

Reply

Share