Evaluating RAG with Ragas
Retrieval-Augmented Generation (RAG) systems combine a retriever and a generator. Measuring quality on the final answer alone is often insufficient: failures may come from irrelevant or incomplete retrieval, from generation that drifts away from sources, or from both.
Ragas (Retrieval-Augmented Generation Assessment) is a Python framework that scores RAG behavior with metrics for faithfulness to context, answer relevance, retrieval quality, and more—depending on which metrics are enabled and which columns are available in the evaluation set.
This page focuses on using the Ragas SDK directly in a notebook or batch job. Integration with other evaluation platforms is optional and not covered here.
TOC
What to record for each evaluation exampleRagas metrics overviewCore RAG metricsOptional RAG metricsChoosing a minimal RAG setCalling the Ragas SDKPrerequisitesRunnable notebookTroubleshootingInterpreting resultsFurther readingWhat to record for each evaluation example
A typical single-turn row includes:
Ragas metrics overview
Ragas exposes a large catalog of metrics. Only a subset is needed for a typical RAG evaluation pass. Each metric expects specific dataset columns (for example question, contexts, answer, ground_truth) and may require an LLM, embeddings, both, or neither. Names and import paths evolve across releases; confirm the installed-version guidance in the Ragas metrics documentation.
The lists below summarize intent and common use; they are not a substitute for upstream API details.
Core RAG metrics
These metrics are most directly aligned with retrieval-augmented generation quality and are usually the first set to track:
Unless otherwise noted, classes in this section are from ragas.metrics.collections.
Field and import requirements can vary by Ragas version and metric variant. Confirm against the installed version in the Ragas metrics documentation.
Optional RAG metrics
These metrics are useful in specific evaluation setups, especially when reference answers are available or when robustness checks are needed:
For metrics that are weakly related to RAG core evaluation (for example generic text-overlap metrics, rubric-based custom metrics, agent/tool metrics, SQL metrics, or multimodal metrics), refer to the Ragas metrics documentation.
Choosing a minimal RAG set
A practical default for many RAG benchmarks is: faithfulness, answer relevancy, context precision, and context recall (recall and some precision variants need ground_truth or equivalent). Add answer correctness or semantic similarity when a reference answer is available. Match metrics to the columns present in the dataset and to cost constraints (LLM-heavy metrics are slower and more expensive).
Calling the Ragas SDK
For modern Ragas usage, instantiate metrics from ragas.metrics.collections and score each row using ascore() (or score() in synchronous scripts).
-
Prepare OpenAI-compatible clients (
AsyncOpenAI) for LLM and embeddings, then wirellm_factoryandOpenAIEmbeddings(see the sample notebook for environment-variable configuration). -
Instantiate metrics with explicit dependencies (
llm,embeddingswhere required). -
Iterate through rows and call
metric.ascore(...)with metric-specific arguments.
When selecting metrics, the following differences affect how the scoring call is prepared:
In practice, this means the main work is to align dataset fields and metric selection, then score rows with the chosen metric instances.
Prerequisites
- Python 3.10+ recommended.
- Network access to an LLM API (and to an embeddings API for metrics that need embeddings). The sample notebook assumes an OpenAI-compatible setup and supports configuring credentials and an optional base URL for compatible gateways.
- Awareness that evaluation issues many model calls; cost and latency scale with rows and metrics.
- Version pinning: Ragas APIs and metric classes change between releases. For reproducible benchmarks, pin
ragas(and related packages) in the environment or notebook; see the commented install line in the sample notebook.
Runnable notebook
Download and open the notebook in JupyterLab or another Jupyter environment:
The notebook opens with a short SDK recap focused on modern metrics (ragas.metrics.collections) and explicit LLM/embedding setup. The canonical explanation is the Calling the Ragas SDK section on this page.
The notebook:
- Installs dependencies (with an optional commented version pin for reproducibility).
- Creates a small
datasets.Datasetwithuser_input,retrieved_contexts,response, andreference. - Runs baseline evaluation with faithfulness and answer relevancy using modern metric classes.
- Adds optional retrieval-focused metrics (context precision and context recall) using modern metric classes.
- Shows aggregate and per-row results, followed by a short troubleshooting section.
Troubleshooting
- Credentials or endpoint configuration: configure LLM API credentials (and an optional base URL for compatible gateways). If embeddings use a separate endpoint, configure embeddings credentials as well, then pass separate
AsyncOpenAIclients intollm_factoryandOpenAIEmbeddings. - Dataset validation errors: verify required arguments for selected metrics and ensure dataset keys align with modern examples (
user_input,retrieved_contexts,response,reference). - Notebook async execution: the sample notebook uses
await metric.ascore(...). For synchronous scripts, usemetric.score(...)or wrap async code withasyncio.run(...). - Version-related warnings: metric classes and signatures can change across Ragas versions. Pin package versions for reproducible runs and confirm behavior against the installed version documentation.
Interpreting results
- Compare scores only under the same dataset and evaluation configuration (judge LLM, embeddings, and prompts); otherwise shifts may reflect configuration changes rather than RAG quality.
- For retrieval-oriented evaluation, use the same embedding model as the production RAG retriever whenever possible to reduce metric drift caused by mismatched embedding spaces.
- Use aggregate scores for trend tracking or quality gates, and per-row scores for diagnosis (for example missing context, hallucination, or irrelevant retrieval). Treat metric values as directional signals, not absolute truth.
Further reading
- Ragas documentation: https://docs.ragas.io/
- Ragas GitHub repository: https://github.com/vibrantlabsai/ragas