7 Comments
User's avatar
ToxSec's avatar

‘The workflow is straightforward: measure where your judge disagrees with the expert, fix those gaps, and confirm on data the judge has never seen.’

excellent takeaway here. i’ve been trying to see how llm judges work with security tools. they have some strengths but a strong jailbreak can hit both the product and the judge

Paul Iusztin's avatar

That's an interesting use case! I am sure there are ways to implement the judge to detect jailbreaks as well, but I haven't played with that yet. Have you tried it?

ToxSec's avatar

yeah we have llm judges that are designed to try to detect if there is a jailbreak on llms.

they work with small context, but if you a long form attack like crescendo using PAIR they will eventually both fall.

it’s an interesting area of research.

Meenakshi NavamaniAvadaiappan's avatar

Thanks for the step by step processing walkthrough for the good 😊

Paul Iusztin's avatar

My pleasure 🙏

Pradeep Gupta's avatar

Thanks for the step-by-step implementation and execution.

Last month, I did a detailed analysis of Reliable AI on Why Verification is a must for the outcome from AI Agents or LLMs.

https://beyondthestacknow.substack.com/p/your-ai-didnt-fail-your-definition