‘The workflow is straightforward: measure where your judge disagrees with the expert, fix those gaps, and confirm on data the judge has never seen.’
excellent takeaway here. i’ve been trying to see how llm judges work with security tools. they have some strengths but a strong jailbreak can hit both the product and the judge
That's an interesting use case! I am sure there are ways to implement the judge to detect jailbreaks as well, but I haven't played with that yet. Have you tried it?
‘The workflow is straightforward: measure where your judge disagrees with the expert, fix those gaps, and confirm on data the judge has never seen.’
excellent takeaway here. i’ve been trying to see how llm judges work with security tools. they have some strengths but a strong jailbreak can hit both the product and the judge
That's an interesting use case! I am sure there are ways to implement the judge to detect jailbreaks as well, but I haven't played with that yet. Have you tried it?
yeah we have llm judges that are designed to try to detect if there is a jailbreak on llms.
they work with small context, but if you a long form attack like crescendo using PAIR they will eventually both fall.
it’s an interesting area of research.
Thanks for the step by step processing walkthrough for the good 😊
My pleasure 🙏
Thanks for the step-by-step implementation and execution.
Last month, I did a detailed analysis of Reliable AI on Why Verification is a must for the outcome from AI Agents or LLMs.
https://beyondthestacknow.substack.com/p/your-ai-didnt-fail-your-definition
Thanks!