Evaluating RAG Quality: Precision, Recall, and Faithfulness

— Without evaluation, RAG systems cannot improve reliably. This article introduces practical metrics and evaluation strategies for measuring retrieval accuracy, answer grounding, and regression over time.

level: advanced topics: rag tags: rag, evaluation, metrics, llmops, production

Why “It Looks Good” Is Not Enough

After building a RAG system, engineers often test it like this:

# Manual testing
query = "How do I reset my password?"
answer = rag_system.query(query)
print(answer)

# Developer: "Looks good to me!"

This is not evaluation. This is a demo.

Production RAG systems need:

  1. Quantitative metrics: Measure performance objectively
  2. Regression detection: Know when changes break things
  3. Component-level insight: Identify where failures occur
  4. Continuous monitoring: Track quality over time

This article covers:

  • Metrics for retrieval quality
  • Metrics for generation quality
  • Building evaluation pipelines
  • Monitoring RAG in production

Evaluation Framework Overview

RAG Has Three Failure Modes

# Mode 1: Retrieval failure
query = "How do I authenticate?"
retrieved_docs = []  # Nothing retrieved
# Model cannot answer without context

# Mode 2: Ranking failure
retrieved_docs = [irrelevant_1, irrelevant_2, relevant_doc, ...]
# Relevant doc ranked too low, not included in context

# Mode 3: Generation failure
retrieved_docs = [relevant_doc_1, relevant_doc_2]
# Model generates answer not grounded in retrieved docs

Evaluation must cover all three modes.


Part 1: Retrieval Metrics

Metric 1: Precision

Definition: What fraction of retrieved documents are relevant?

def precision_at_k(retrieved: list[str], relevant: set[str]) -> float:
    """
    Precision@K: Fraction of retrieved docs that are relevant.
    """
    retrieved_set = set(retrieved)
    relevant_retrieved = retrieved_set & relevant

    return len(relevant_retrieved) / len(retrieved) if retrieved else 0

# Example
retrieved = ['doc1', 'doc2', 'doc3', 'doc4', 'doc5']
relevant = {'doc2', 'doc5', 'doc7'}

precision = precision_at_k(retrieved, relevant)
# 2 relevant out of 5 retrieved = 0.4

What it tells you:

  • High precision: Few irrelevant results
  • Low precision: Too much noise

When to optimize:

  • Limited context window
  • High cost per retrieved document

Metric 2: Recall

Definition: What fraction of relevant documents were retrieved?

def recall_at_k(retrieved: list[str], relevant: set[str]) -> float:
    """
    Recall@K: Fraction of relevant docs that were retrieved.
    """
    retrieved_set = set(retrieved)
    relevant_retrieved = retrieved_set & relevant

    return len(relevant_retrieved) / len(relevant) if relevant else 0

# Example
retrieved = ['doc1', 'doc2', 'doc3', 'doc4', 'doc5']
relevant = {'doc2', 'doc5', 'doc7'}

recall = recall_at_k(retrieved, relevant)
# 2 relevant out of 3 total relevant = 0.67

What it tells you:

  • High recall: Found most relevant documents
  • Low recall: Missing important information

When to optimize:

  • Completeness critical
  • Can tolerate some noise

Metric 3: Mean Reciprocal Rank (MRR)

Definition: Average of reciprocal ranks of first relevant result.

def mean_reciprocal_rank(results: list[list[str]], relevant_sets: list[set[str]]) -> float:
    """
    MRR: How high is the first relevant result ranked?
    """
    reciprocal_ranks = []

    for retrieved, relevant in zip(results, relevant_sets):
        for rank, doc_id in enumerate(retrieved, start=1):
            if doc_id in relevant:
                reciprocal_ranks.append(1 / rank)
                break
        else:
            reciprocal_ranks.append(0)  # No relevant doc found

    return sum(reciprocal_ranks) / len(reciprocal_ranks)

# Example
results = [
    ['doc1', 'doc2', 'doc3'],  # First relevant at position 2
    ['doc4', 'doc5', 'doc6'],  # First relevant at position 1
    ['doc7', 'doc8', 'doc9']   # No relevant doc
]
relevant_sets = [
    {'doc2', 'doc5'},
    {'doc4'},
    {'doc10'}
]

mrr = mean_reciprocal_rank(results, relevant_sets)
# (1/2 + 1/1 + 0) / 3 = 0.5

What it tells you:

  • High MRR: Relevant docs ranked high
  • Low MRR: Relevant docs ranked low or missing

When to optimize:

  • User experience depends on top results
  • Top-k context window limit

Metric 4: Normalized Discounted Cumulative Gain (NDCG)

Definition: Measures ranking quality with graded relevance.

import numpy as np

def dcg_at_k(relevance_scores: list[float], k: int) -> float:
    """
    DCG: Weighted sum of relevance scores.
    Higher-ranked docs have more weight.
    """
    relevance_scores = np.array(relevance_scores[:k])
    if relevance_scores.size == 0:
        return 0.0

    # Discount by log of position
    discounts = np.log2(np.arange(2, relevance_scores.size + 2))
    return np.sum(relevance_scores / discounts)

def ndcg_at_k(relevance_scores: list[float], k: int) -> float:
    """
    NDCG: DCG normalized by ideal DCG.
    Score between 0 and 1.
    """
    dcg = dcg_at_k(relevance_scores, k)

    # Ideal: Sort by relevance (best possible ranking)
    ideal_scores = sorted(relevance_scores, reverse=True)
    idcg = dcg_at_k(ideal_scores, k)

    return dcg / idcg if idcg > 0 else 0.0

# Example
# Query: "API authentication"
# Retrieved docs with relevance scores (3=perfect, 2=good, 1=partial, 0=irrelevant)
retrieved_relevance = [1, 3, 0, 2, 0]  # Retrieved ranking
# Position 2 is highly relevant but ranked low

ndcg = ndcg_at_k(retrieved_relevance, k=5)
# Lower than 1.0 because perfect doc ranked 2nd instead of 1st

What it tells you:

  • NDCG=1.0: Perfect ranking
  • Lower NDCG: Relevant docs ranked suboptimally

When to optimize:

  • Multiple levels of relevance
  • Ranking order matters

Part 2: Generation Metrics

Metric 1: Faithfulness (Answer Grounding)

Definition: Is the generated answer supported by retrieved documents?

class FaithfulnessEvaluator:
    """
    Measure if answer is grounded in retrieved context.
    """

    def evaluate(self, answer: str, retrieved_docs: list[str]) -> dict:
        # Extract claims from answer
        claims = self.extract_claims(answer)

        # Check each claim against retrieved docs
        results = {
            'total_claims': len(claims),
            'supported_claims': 0,
            'unsupported_claims': []
        }

        for claim in claims:
            if self.is_supported(claim, retrieved_docs):
                results['supported_claims'] += 1
            else:
                results['unsupported_claims'].append(claim)

        results['faithfulness_score'] = (
            results['supported_claims'] / results['total_claims']
            if results['total_claims'] > 0 else 0
        )

        return results

    def extract_claims(self, answer: str) -> list[str]:
        """
        Use LLM to break answer into atomic claims.
        """
        prompt = f"""
        Break this answer into individual factual claims.
        Each claim should be a single, verifiable statement.

        Answer: {answer}

        Claims (one per line):
        """
        claims_text = llm.generate(prompt)
        return [c.strip() for c in claims_text.split('\n') if c.strip()]

    def is_supported(self, claim: str, docs: list[str]) -> bool:
        """
        Check if claim is supported by documents.
        """
        context = '\n\n'.join(docs)

        verification_prompt = f"""
        Context:
        {context}

        Claim: {claim}

        Is this claim directly supported by the context?
        Answer only: YES or NO
        """

        result = llm.generate(verification_prompt).strip().upper()
        return result == 'YES'

# Example usage
answer = "The API rate limit is 1000 requests per hour, and it costs $0.01 per request."
retrieved_docs = ["API rate limit: 1000 req/hour"]

evaluator = FaithfulnessEvaluator()
faithfulness = evaluator.evaluate(answer, retrieved_docs)

# Result:
# {
#   'total_claims': 2,
#   'supported_claims': 1,  # Rate limit claim supported
#   'unsupported_claims': ['costs $0.01 per request'],  # Hallucinated
#   'faithfulness_score': 0.5
# }

What it tells you:

  • Low faithfulness: Model hallucinating
  • High faithfulness: Answers grounded in context

Metric 2: Relevance (Answer Completeness)

Definition: Does the answer address the question?

class RelevanceEvaluator:
    """
    Measure if answer actually addresses the question.
    """

    def evaluate(self, query: str, answer: str) -> float:
        prompt = f"""
        Query: {query}
        Answer: {answer}

        Does this answer fully address the query?
        Rate from 1-5:
        5 = Completely addresses query
        4 = Addresses query with minor gaps
        3 = Partially addresses query
        2 = Tangentially related
        1 = Does not address query

        Rating (1-5):
        """

        rating = llm.generate(prompt).strip()
        try:
            return int(rating) / 5.0  # Normalize to 0-1
        except ValueError:
            return 0.0

# Example
query = "How do I authenticate with the API?"
answer = "The API uses OAuth 2.0. Contact support for credentials."

relevance = RelevanceEvaluator().evaluate(query, answer)
# Might score 3/5: Mentions auth method but lacks implementation details

Metric 3: Context Precision

Definition: What fraction of retrieved context is actually useful for answering?

def context_precision(query: str, answer: str, retrieved_docs: list[str]) -> float:
    """
    How many retrieved docs were actually needed?
    """
    useful_docs = 0

    for doc in retrieved_docs:
        # Check if this doc contributed to the answer
        prompt = f"""
        Query: {query}
        Answer: {answer}
        Document: {doc}

        Was this document useful for generating the answer?
        Answer: YES or NO
        """

        result = llm.generate(prompt).strip().upper()
        if result == 'YES':
            useful_docs += 1

    return useful_docs / len(retrieved_docs) if retrieved_docs else 0

# High context precision = retrieval is efficient
# Low context precision = retrieving too much noise

Metric 4: Context Recall

Definition: Did retrieval capture all information needed to answer?

def context_recall(query: str, ground_truth: str, retrieved_docs: list[str]) -> float:
    """
    Can the ground truth answer be generated from retrieved docs?
    """
    context = '\n\n'.join(retrieved_docs)

    prompt = f"""
    Query: {query}
    Ground truth answer: {ground_truth}

    Context:
    {context}

    What fraction of the ground truth answer can be supported by this context?
    Answer with a number between 0.0 (none) and 1.0 (all).

    Fraction:
    """

    result = llm.generate(prompt).strip()
    try:
        return float(result)
    except ValueError:
        return 0.0

# High context recall = retrieval captured necessary info
# Low context recall = missing key documents

Part 3: Building an Evaluation Pipeline

Creating Test Cases

class RAGTestCase(BaseModel):
    """
    Single evaluation test case.
    """
    query: str
    ground_truth_answer: str
    relevant_doc_ids: set[str]
    metadata: dict = {}

def create_test_suite() -> list[RAGTestCase]:
    """
    Curate evaluation test cases.
    """
    return [
        RAGTestCase(
            query="How do I reset my password?",
            ground_truth_answer="Navigate to Settings > Security > Reset Password...",
            relevant_doc_ids={'doc_123', 'doc_456'}
        ),
        RAGTestCase(
            query="What is the API rate limit?",
            ground_truth_answer="1000 requests per hour for free tier",
            relevant_doc_ids={'doc_789'}
        ),
        # Add 50-100 diverse test cases
    ]

Evaluation Runner

class RAGEvaluator:
    """
    Run comprehensive evaluation on RAG system.
    """

    def __init__(self, rag_system, test_cases: list[RAGTestCase]):
        self.rag_system = rag_system
        self.test_cases = test_cases

    def evaluate_all(self) -> dict:
        """
        Run full evaluation suite.
        """
        results = {
            'retrieval_metrics': self.evaluate_retrieval(),
            'generation_metrics': self.evaluate_generation(),
            'end_to_end_metrics': self.evaluate_end_to_end()
        }

        return results

    def evaluate_retrieval(self) -> dict:
        """
        Evaluate retrieval quality.
        """
        precisions = []
        recalls = []
        mrrs = []

        for case in self.test_cases:
            retrieved = self.rag_system.retrieve(case.query, top_k=5)
            retrieved_ids = [r['id'] for r in retrieved]

            # Calculate metrics
            precision = precision_at_k(retrieved_ids, case.relevant_doc_ids)
            recall = recall_at_k(retrieved_ids, case.relevant_doc_ids)

            precisions.append(precision)
            recalls.append(recall)

            # MRR
            for rank, doc_id in enumerate(retrieved_ids, start=1):
                if doc_id in case.relevant_doc_ids:
                    mrrs.append(1 / rank)
                    break
            else:
                mrrs.append(0)

        return {
            'precision@5': mean(precisions),
            'recall@5': mean(recalls),
            'mrr': mean(mrrs)
        }

    def evaluate_generation(self) -> dict:
        """
        Evaluate generation quality.
        """
        faithfulness_scores = []
        relevance_scores = []

        for case in self.test_cases:
            # Get RAG system output
            result = self.rag_system.query(case.query)
            answer = result['answer']
            retrieved_docs = [r['content'] for r in result['retrieved_docs']]

            # Faithfulness
            faith_eval = FaithfulnessEvaluator()
            faith_result = faith_eval.evaluate(answer, retrieved_docs)
            faithfulness_scores.append(faith_result['faithfulness_score'])

            # Relevance
            rel_eval = RelevanceEvaluator()
            relevance_scores.append(rel_eval.evaluate(case.query, answer))

        return {
            'faithfulness': mean(faithfulness_scores),
            'relevance': mean(relevance_scores)
        }

    def evaluate_end_to_end(self) -> dict:
        """
        End-to-end evaluation comparing to ground truth.
        """
        exact_matches = 0
        semantic_similarities = []

        for case in self.test_cases:
            result = self.rag_system.query(case.query)
            answer = result['answer']

            # Exact match (rare but possible)
            if answer.strip().lower() == case.ground_truth_answer.strip().lower():
                exact_matches += 1

            # Semantic similarity
            similarity = compute_semantic_similarity(
                answer,
                case.ground_truth_answer
            )
            semantic_similarities.append(similarity)

        return {
            'exact_match': exact_matches / len(self.test_cases),
            'semantic_similarity': mean(semantic_similarities)
        }

# Run evaluation
evaluator = RAGEvaluator(rag_system, test_cases)
results = evaluator.evaluate_all()

print(f"Retrieval Precision: {results['retrieval_metrics']['precision@5']:.3f}")
print(f"Faithfulness: {results['generation_metrics']['faithfulness']:.3f}")

Part 4: Automated Evaluation with LLMs

Using LLM-as-Judge

class LLMJudge:
    """
    Use LLM to evaluate RAG quality.
    Faster and cheaper than human evaluation.
    """

    def evaluate_answer_quality(
        self,
        query: str,
        answer: str,
        retrieved_docs: list[str],
        ground_truth: str = None
    ) -> dict:
        """
        Multi-aspect LLM-based evaluation.
        """
        context = '\n\n'.join(retrieved_docs)

        eval_prompt = f"""
        Evaluate this RAG system output.

        Query: {query}
        Retrieved Context:
        {context}

        Generated Answer: {answer}
        {f'Ground Truth: {ground_truth}' if ground_truth else ''}

        Evaluate on these dimensions (score 1-5 for each):

        1. Faithfulness: Is the answer supported by the retrieved context?
        2. Relevance: Does the answer address the query?
        3. Completeness: Does the answer fully address the query?
        4. Conciseness: Is the answer appropriately concise?

        Provide scores and brief justification for each.

        Output format (JSON):
        {{
            "faithfulness": {{"score": 1-5, "justification": "..."}},
            "relevance": {{"score": 1-5, "justification": "..."}},
            "completeness": {{"score": 1-5, "justification": "..."}},
            "conciseness": {{"score": 1-5, "justification": "..."}}
        }}
        """

        result = llm.generate(eval_prompt)
        return json.loads(result)

# Example usage
judge = LLMJudge()
scores = judge.evaluate_answer_quality(
    query="How do I authenticate?",
    answer="Use OAuth 2.0 with client credentials.",
    retrieved_docs=["API uses OAuth 2.0..."],
    ground_truth="Configure OAuth 2.0 with client ID and secret..."
)

print(f"Faithfulness: {scores['faithfulness']['score']}/5")
print(f"Relevance: {scores['relevance']['score']}/5")

Pairwise Comparison

def compare_rag_versions(query: str, answer_a: str, answer_b: str) -> str:
    """
    Compare two RAG system versions.
    Often more reliable than absolute scoring.
    """
    prompt = f"""
    Query: {query}

    Answer A: {answer_a}
    Answer B: {answer_b}

    Which answer is better?
    Consider:
    - Accuracy
    - Completeness
    - Clarity

    Output: "A", "B", or "TIE"
    """

    return llm.generate(prompt).strip()

# A/B testing RAG improvements
wins = {'A': 0, 'B': 0, 'TIE': 0}

for case in test_cases:
    answer_a = rag_v1.query(case.query)
    answer_b = rag_v2.query(case.query)

    winner = compare_rag_versions(case.query, answer_a, answer_b)
    wins[winner] += 1

print(f"Version A: {wins['A']} wins")
print(f"Version B: {wins['B']} wins")
print(f"Ties: {wins['TIE']}")

Part 5: Production Monitoring

Real-Time Metrics

class RAGMonitor:
    """
    Track RAG performance in production.
    """

    def __init__(self):
        self.metrics_buffer = []

    def log_query(
        self,
        query: str,
        answer: str,
        retrieved_docs: list[dict],
        latency_ms: float,
        user_feedback: str = None
    ):
        """
        Log every RAG query for monitoring.
        """
        metric = {
            'timestamp': datetime.now(),
            'query_length': len(query.split()),
            'answer_length': len(answer.split()),
            'num_docs_retrieved': len(retrieved_docs),
            'latency_ms': latency_ms,
            'user_feedback': user_feedback,  # thumbs up/down
            'retrieval_scores': [d['score'] for d in retrieved_docs]
        }

        self.metrics_buffer.append(metric)

        # Alert on anomalies
        self.check_anomalies(metric)

    def check_anomalies(self, metric: dict):
        """
        Detect potential issues.
        """
        # High latency
        if metric['latency_ms'] > 5000:
            alert('High latency', metric)

        # No documents retrieved
        if metric['num_docs_retrieved'] == 0:
            alert('Retrieval failure', metric)

        # Low retrieval scores
        if metric['retrieval_scores'] and max(metric['retrieval_scores']) < 0.5:
            alert('Low relevance scores', metric)

    def generate_report(self, time_window: str = '24h') -> dict:
        """
        Aggregate metrics over time window.
        """
        recent_metrics = filter_by_time(self.metrics_buffer, time_window)

        return {
            'total_queries': len(recent_metrics),
            'avg_latency_ms': mean([m['latency_ms'] for m in recent_metrics]),
            'avg_docs_retrieved': mean([m['num_docs_retrieved'] for m in recent_metrics]),
            'positive_feedback_rate': sum(
                1 for m in recent_metrics
                if m['user_feedback'] == 'positive'
            ) / len(recent_metrics),
            'zero_results_rate': sum(
                1 for m in recent_metrics
                if m['num_docs_retrieved'] == 0
            ) / len(recent_metrics)
        }

Regression Detection

class RegressionDetector:
    """
    Detect when RAG quality degrades.
    """

    def __init__(self, test_cases: list[RAGTestCase]):
        self.test_cases = test_cases
        self.baseline_metrics = None

    def establish_baseline(self, rag_system):
        """
        Run evaluation and save as baseline.
        """
        evaluator = RAGEvaluator(rag_system, self.test_cases)
        self.baseline_metrics = evaluator.evaluate_all()

    def detect_regression(self, rag_system, threshold: float = 0.05):
        """
        Check if current performance dropped significantly.
        """
        evaluator = RAGEvaluator(rag_system, self.test_cases)
        current_metrics = evaluator.evaluate_all()

        regressions = []

        # Compare retrieval metrics
        for metric in ['precision@5', 'recall@5', 'mrr']:
            baseline = self.baseline_metrics['retrieval_metrics'][metric]
            current = current_metrics['retrieval_metrics'][metric]
            delta = baseline - current

            if delta > threshold:
                regressions.append({
                    'metric': metric,
                    'baseline': baseline,
                    'current': current,
                    'delta': delta
                })

        # Compare generation metrics
        for metric in ['faithfulness', 'relevance']:
            baseline = self.baseline_metrics['generation_metrics'][metric]
            current = current_metrics['generation_metrics'][metric]
            delta = baseline - current

            if delta > threshold:
                regressions.append({
                    'metric': metric,
                    'baseline': baseline,
                    'current': current,
                    'delta': delta
                })

        return regressions

# Usage in CI/CD
detector = RegressionDetector(test_cases)
detector.establish_baseline(rag_system_v1)

# Before deploying v2
regressions = detector.detect_regression(rag_system_v2)

if regressions:
    print("REGRESSION DETECTED:")
    for reg in regressions:
        print(f"{reg['metric']}: {reg['baseline']:.3f}{reg['current']:.3f}")
    raise Exception("Cannot deploy: performance regression")

Conclusion

Evaluation is not optional for production RAG.

Key practices:

  1. Measure retrieval separately: Precision, recall, MRR, NDCG
  2. Measure generation quality: Faithfulness, relevance, completeness
  3. Automate evaluation: Use LLM-as-judge for scalability
  4. Monitor continuously: Track metrics in production
  5. Detect regressions: Prevent quality degradation

Build evaluation infrastructure early. Without it, you are optimizing blindly.

The goal is not perfect scores. The goal is measurable improvement over time.

Continue learning

Next in this path

How to Evaluate an LLM Feature Before Launch (A Practical Pass/Fail Workflow)

A practical pre-launch workflow for evaluating LLM-powered features with pass/fail criteria, scoped test sets, and regression checks before rollout.

Intentional links