Shouldn’t the synthetic questions generated for a RAG system be reviewed by a domain expert? Sometimes the synthetic questions don’t make sense to a real user that knows the knowledge base. This happenes even for the simple case where you generate a question just from one chunk.
But what for a more complex situation where for answering a question you might need context from 2 chunks that are in different parts of a document, or even from different documents. You as an engineer don’t know if generating a question from 2 random chunks is a valid question to be asked. Whats your opinion on this?
As for the chunks part, in this article, I kept it super simple, but you can scale it, and create a set of chunks from multiple documents, then start sampling from that set, and do the queries based on multiple chunks at once. Does that answer your question?
Yes. Thank you! I have another question. 😁 How do you maintain the eval dataset when the documents change: new ones are added and existing ones are changed? Do you involve the domain expert at regular intervals?
‘For tasks with deterministic correct answers, you can use your system’s schema or rules to generate both the input and the ground truth’
makes sense. this was a good read ty :)
Yes! That's the only use case when you should do that. thanks man!
Shouldn’t the synthetic questions generated for a RAG system be reviewed by a domain expert? Sometimes the synthetic questions don’t make sense to a real user that knows the knowledge base. This happenes even for the simple case where you generate a question just from one chunk.
But what for a more complex situation where for answering a question you might need context from 2 chunks that are in different parts of a document, or even from different documents. You as an engineer don’t know if generating a question from 2 random chunks is a valid question to be asked. Whats your opinion on this?
Yes, you are totally right. For the domain expert part we have a full article: https://www.decodingai.com/p/build-an-ai-evals-dataset-with-error-analysis
As for the chunks part, in this article, I kept it super simple, but you can scale it, and create a set of chunks from multiple documents, then start sampling from that set, and do the queries based on multiple chunks at once. Does that answer your question?
Yes. Thank you! I have another question. 😁 How do you maintain the eval dataset when the documents change: new ones are added and existing ones are changed? Do you involve the domain expert at regular intervals?
Yes! The domain expert should constantly be present and in the loop. Evals it's a cycle, not a one time thing
Thanks for the good 😊