Evaluation of RAG using Azure Prompt Flow
Up to this point, we have discussed the development of resilient RAG applications. However, the question arises: How can we determine whether these applications are functioning as anticipated and if the context they retrieve is pertinent? While manual validation—comparing the responses generated by LLMs against ground truth—is possible, this method proves to be labor-intensive, costly, and challenging to execute on a large scale. Consequently, it’s essential to explore methodologies that facilitate automated evaluation on a vast scale. Recent research has delved into the concept of utilizing “LLM as a judge” to assess output, a strategy that Azure Prompt Flow incorporates within its offerings.
Azure Prompt Flow has built-in and structured metaprompt templates with comprehensive guardrails to evaluate your output against ground truth. The following mentions four metrics that can help you evaluate your RAG solution in Prompt Flow:
- Groundedness: Measures the alignment of the model’s answers with the input source, making sure the model’s generated response is not fabricated. The model must always extract information from the provided “context” while responding to user’s query.
- Relevance: Measures the degree to which the model’s generated response is closely connected to the context and user query.
- Retrieval score: Measures the extent to which the model’s retrieved documents are pertinent and directly related to the given questions.
- Custom metrics: While the above three are the most important for evaluating RAG applications, Prompt Flow allows you to use custom metrics, too. Bring your own LLM as a judge and define your own metrics by modifying the existing metaprompts. This also allows you to use open source models such as Llama and to build your own metrics from code with Python functions. The above evaluations are more no-code or low-code friendly; however, for a more pro-code friendly approach, azureml-metrics SDK, such as ROUGE, BLEU, F1-Score, Precision, and Accuracy, can be utilized as well.
The field is advancing quickly, so we recommend regularly checking Azure ML Prompt Flow’s latest updates on evaluation metrics. Start with the “Manual Evaluation” feature in Prompt Flow to gain a basic understanding of LLM performance. It’s important to use a mix of metrics for a thorough evaluation that captures both semantic and syntactic essence rather than relying on just one metric to compare the responses with the actual ground truth.