Claude Code for LlamaIndex RAG Pipeline Debugging

Debugging Retrieval-Augmented Generation (RAG) pipelines built with LlamaIndex can be challenging, especially when dealing with complex document processing, embedding generation, and query understanding. Claude Code provides a powerful toolkit that makes debugging these pipelines significantly more manageable. This guide explores how to use Claude Code's skills and features to effectively debug LlamaIndex RAG pipelines, including practical code examples, diagnostic patterns, and strategies for fixing the most common failure modes.

Understanding RAG Pipeline Components

Before diving into debugging, it's essential to understand the key components of a LlamaIndex RAG pipeline:

  1. Document Loading - Reading and parsing various file formats
  2. Text Chunking - Splitting documents into manageable pieces
  3. Embedding Generation - Converting text chunks into vector representations
  4. Vector Storage - Storing embeddings in a database
  5. Query Processing - Transforming user queries for retrieval
  6. Retrieval - Finding relevant context from the vector store
  7. Response Synthesis - Generating answers using the retrieved context

Each component presents potential points of failure that require systematic debugging. The challenge is that failures are often silent. the pipeline runs without errors but produces irrelevant answers or empty results. A disciplined, stage-by-stage approach is the only reliable way to isolate root causes.

Where Failures Typically Occur

Pipeline StageCommon Failure ModeVisible Symptom
Document LoadingEncoding errors, empty docsFewer docs than expected
Text ChunkingChunks too large or too smallPoor retrieval relevance
Embedding GenerationDimension mismatch, API errorsEmpty result sets
Vector StorageIndex not persistedResults reset on restart
Query ProcessingQuery too vague or too specificZero or wrong source nodes
RetrievalSimilarity threshold too highEmpty source nodes
Response SynthesisContext window overflowTruncated or hallucinated answers

Understanding which stage is failing saves hours of debugging. The approach below tests each stage in isolation before combining them.

Setting Up Claude Code for RAG Debugging

Start by ensuring Claude Code is properly configured with the necessary skills. The most relevant skills for RAG debugging include:

Place the python skill in ~/.claude/skills/python.md
Place the xlsx skill in ~/.claude/skills/xlsx.md
Place the docx skill in ~/.claude/skills/docx.md
Place the pdf skill in ~/.claude/skills/pdf.md

These skills enable Claude to read and analyze various document types, which is crucial when debugging the document loading phase. The python skill in particular lets Claude Code read your pipeline source files directly and suggest targeted fixes rather than generic advice.

You should also install LlamaIndex's built-in debugging utilities alongside your project dependencies:

pip install llama-index llama-index-callbacks-arize-phoenix arize-phoenix

Phoenix provides a full observability UI for LlamaIndex traces, which is invaluable when you need to see exactly what happened during a multi-step query.

Practical Debugging Techniques

  1. Inspecting Document Loading

One of the most common issues in RAG pipelines is improper document loading. Use Claude Code to examine loaded documents:

from llama_index.core import SimpleDirectoryReader

Load documents
documents = SimpleDirectoryReader("./data").load_data()

Inspect document metadata and content
for doc in documents[:3]:
 print(f"ID: {doc.doc_id}")
 print(f"Metadata: {doc.metadata}")
 print(f"Content preview: {doc.text[:200]}...")
 print(f"Content length: {len(doc.text)} chars")
 print("---")

Verify total count
print(f"\nTotal documents loaded: {len(documents)}")

A healthy document loading step should show consistent metadata across files and non-trivial content lengths. Watch for these red flags:

To diagnose encoding issues specifically:

import chardet
from pathlib import Path

def detect_file_encoding(path: str) -> dict:
 with open(path, "rb") as f:
 raw = f.read()
 result = chardet.detect(raw)
 return {"file": path, "encoding": result["encoding"], "confidence": result["confidence"]}

Run on all files in your data directory
for p in Path("./data").glob("/*"):
 if p.is_file():
 info = detect_file_encoding(str(p))
 print(info)

If files are detected as anything other than UTF-8 with high confidence, pass the encoding explicitly when constructing the reader or convert files upstream.

Claude Code can help you identify issues such as:

  1. Analyzing Text Chunking

Poor chunking can significantly impact retrieval quality. Debug chunk sizes and overlaps:

from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter

text_splitter = SentenceSplitter(
 chunk_size=512,
 chunk_overlap=50
)

Test chunking on sample text
sample_text = "Your long document text here..."
chunks = text_splitter.split_text(sample_text)

print(f"Number of chunks: {len(chunks)}")
for i, chunk in enumerate(chunks[:5]):
 print(f"Chunk {i}: {len(chunk)} chars")
 print(f" Preview: {chunk[:80]}...")

Claude Code can help you determine optimal chunk sizes based on your specific use case and document structure.

A critical insight most guides omit: chunk size should be tuned against your embedding model's token limit and the typical length of answers your users need. Here is a practical comparison:

Use CaseRecommended Chunk SizeOverlapRationale
Legal or technical docs256–512 tokens20%Dense facts; smaller chunks stay specific
News articles or blog posts512–1024 tokens10%Narrative context matters
Code files256 tokens0%Logical units (functions) should stay intact
Conversational transcripts128–256 tokens25%Preserve conversational turns
Scientific papers512 tokens15%Paragraph-level semantics

To measure chunk quality systematically, count how often your retrieved chunks contain the answer to a known question:

def evaluate_chunking(documents, queries_and_answers, chunk_size=512, chunk_overlap=50):
 """
 Simple chunk quality evaluator.
 queries_and_answers: list of (query_str, expected_keyword) tuples
 """
 from llama_index.core import VectorStoreIndex
 from llama_index.core.node_parser import SentenceSplitter

 splitter = SentenceSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
 index = VectorStoreIndex.from_documents(documents, transformations=[splitter])
 retriever = index.as_retriever(similarity_top_k=5)

 hits = 0
 for query, keyword in queries_and_answers:
 nodes = retriever.retrieve(query)
 combined = " ".join(n.text for n in nodes)
 if keyword.lower() in combined.lower():
 hits += 1

 recall = hits / len(queries_and_answers)
 print(f"chunk_size={chunk_size}, overlap={chunk_overlap} => recall={recall:.2%}")
 return recall

Run this function with several chunk size configurations and pick the one that maximizes recall on your test query set.

  1. Validating Embeddings

Embedding generation issues can silently degrade retrieval quality. Debug embeddings with:

from llama_index.core import Settings
from llama_index.embeddings.openai import OpenAIEmbedding
import numpy as np

Configure embedding
embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
Settings.embed_model = embed_model

Test embedding generation
test_texts = ["What is AI?", "Machine learning is great", "Unrelated text about cooking"]
embeddings = embed_model.get_text_embeddings(test_texts)

print(f"Embedding dimension: {len(embeddings[0])}")

Compute pairwise cosine similarities
def cosine_sim(a, b):
 a, b = np.array(a), np.array(b)
 return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

print(f"AI vs ML similarity: {cosine_sim(embeddings[0], embeddings[1]):.4f}")
print(f"AI vs cooking similarity: {cosine_sim(embeddings[0], embeddings[2]):.4f}")

For a well-functioning embedding model you should see the related texts score noticeably higher than the unrelated pair. If all similarities cluster near 1.0 or near 0.0, suspect a misconfigured model or an API returning zeros.

A common source of silent failure is embedding dimension mismatch. you indexed documents with one model, then switched to another. Detect this before it causes problems:

def verify_index_embedding_consistency(index, embed_model):
 """Check that stored vectors match the current embed model's output dimension."""
 sample_embedding = embed_model.get_text_embedding("test")
 expected_dim = len(sample_embedding)

 docstore = index.docstore
 sample_nodes = list(docstore.docs.values())[:3]

 for node in sample_nodes:
 if hasattr(node, "embedding") and node.embedding:
 stored_dim = len(node.embedding)
 if stored_dim != expected_dim:
 print(f"MISMATCH: stored={stored_dim}, current model={expected_dim}")
 return False

 print(f"Embeddings consistent: {expected_dim} dims")
 return True
  1. Query Engine Debugging

The query engine often requires careful debugging to ensure proper retrieval:

from llama_index.core import VectorStoreIndex

Create index
index = VectorStoreIndex.from_documents(documents)

Test query
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("What is the main topic?")

print(f"Response: {response.response}")
print(f"Source nodes: {len(response.source_nodes)}")
for i, node in enumerate(response.source_nodes):
 print(f"Node {i} score: {node.score:.4f}")
 print(f"Node {i} content: {node.text[:100]}...")

When source_nodes is empty, the problem is usually one of three things: the similarity threshold is too high, the query embedding is far from all document embeddings, or the vector store simply contains no documents. Debug each possibility in sequence:

Step 1: Confirm the index contains nodes
print(f"Index node count: {len(index.docstore.docs)}")

Step 2: Lower the similarity threshold to confirm retrieval works at all
retriever = index.as_retriever(similarity_top_k=10)
nodes = retriever.retrieve("any text")
print(f"Retrieved with loose query: {len(nodes)} nodes")
for n in nodes:
 print(f" Score: {n.score:.4f} | Text: {n.text[:60]}...")

Step 3: Check that query and document embeddings share a space
query_embedding = index._embed_model.get_query_embedding("What is the main topic?")
print(f"Query embedding norm: {np.linalg.norm(query_embedding):.4f}")

A near-zero norm on the query embedding indicates an API failure or model misconfiguration.

  1. Using Claude Code's Analysis Skills

Use Claude Code's specialized skills for deeper analysis:

A practical workflow when debugging retrieval quality with Claude Code:

  1. Export your query logs to CSV (query, retrieved nodes, user rating)
  2. Use the xlsx skill: "Find queries where retrieved node scores are all below 0.6 and user rating was negative"
  3. Claude Code will surface the patterns. often a vocabulary mismatch between how documents are written and how users ask questions
  4. Use that insight to add synonyms to your metadata or adjust your query rewriting step
  1. Pipeline Performance Monitoring

Implement comprehensive logging to track pipeline performance:

import logging
from llama_index.core import Settings
from llama_index.core.callbacks import CallbackManager, LlamaDebugHandler

Enable verbose logging
logging.basicConfig(level=logging.DEBUG)

Use LlamaIndex's tracing
llama_debug = LlamaDebugHandler(print_trace_on_end=True)
callback_manager = CallbackManager([llama_debug])

Settings.callback_manager = callback_manager

For production-grade observability, connect Phoenix:

import phoenix as px
from llama_index.core import set_global_handler

Launch Phoenix UI
px.launch_app()

Wire into LlamaIndex
set_global_handler("arize_phoenix")

Now every query will appear in the Phoenix UI at http://localhost:6006
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What is the main topic?")

Phoenix shows you each span in the pipeline. document loading, embedding, retrieval, synthesis. with latencies and intermediate values. It is the fastest way to find which stage is adding unexpected latency.

  1. End-to-End Regression Testing

A debugging session is only valuable if you can prevent the same issue from recurring. Build a lightweight regression harness:

import json
from pathlib import Path

golden_set.json: list of {query, expected_keywords, min_score}
GOLDEN_SET = Path("./tests/golden_set.json")

def run_regression(query_engine, golden_path=GOLDEN_SET):
 results = []
 golden = json.loads(golden_path.read_text())

 for item in golden:
 response = query_engine.query(item["query"])
 combined = response.response + " ".join(n.text for n in response.source_nodes)

 keyword_hits = sum(
 1 for kw in item["expected_keywords"]
 if kw.lower() in combined.lower()
 )
 passed = keyword_hits >= item.get("min_hits", 1)
 results.append({"query": item["query"], "passed": passed, "hits": keyword_hits})

 pass_rate = sum(r["passed"] for r in results) / len(results)
 print(f"Regression pass rate: {pass_rate:.1%} ({sum(r['passed'] for r in results)}/{len(results)})")
 return results

Run this after every pipeline change. A drop in pass rate pinpoints which queries broke and guides your next debugging session.

Common RAG Issues and Solutions

Issue 1: Retrieval Returns No Results

Symptoms: Queries return empty source nodes

Debugging Approach:

  1. Verify the vector store contains embeddings with len(index.docstore.docs)
  2. Check embedding dimension consistency between indexed documents and current model
  3. Test with exact text matches from known documents. if exact matches fail, the vector store is corrupted or empty
  4. Lower similarity_top_k to 20 and similarity_cutoff to 0.0 to confirm documents are there at all

Fix: Usually either re-index from scratch (if dimension mismatch) or reduce similarity_cutoff from the default (if threshold is too aggressive).

Issue 2: Poor Response Quality

Symptoms: Responses are irrelevant or incomplete

Debugging Approach:

  1. Examine retrieved context for relevance. print the full text of each source node
  2. Adjust similarity threshold: index.as_retriever(similarity_top_k=5, similarity_cutoff=0.75)
  3. Review chunk size and overlap settings using the evaluation function above
  4. Inspect whether the LLM prompt template is including context correctly

Fix: Most quality issues trace back to chunks that are either too small (no context) or too large (diluted signal). Run the chunk evaluation function with three different sizes and compare recall.

Issue 3: Slow Query Performance

Symptoms: Queries take excessive time

Debugging Approach:

  1. Check vector database indexing. FAISS in-memory is fast; Chroma or Weaviate with network calls adds latency
  2. Review embedding batch sizes. batching 10 documents at a time vs 1 at a time is a 10x speed difference
  3. Analyze query complexity. multi-step query decomposition adds LLM calls

Fix: Profile each stage using the Phoenix callback handler. Typically either the embedding API call or the LLM synthesis step dominates. Switch to a local embedding model (e.g., BAAI/bge-small-en) for latency-sensitive applications.

OptimizationLatency ReductionTrade-off
Local embedding model3–10xSlightly lower quality
Increase embedding batch size2–5xHigher memory use
Cache query embeddings10–100x on repeat queriesStale results if docs change
Reduce similarity_top_k from 10 to 32–3xMay miss relevant context
Use async query engineNear-linear scalingMore complex error handling

Issue 4: Hallucinations and Fabricated Facts

Symptoms: LLM generates confident-sounding answers not supported by retrieved context

Debugging Approach:

  1. Print retrieved source nodes alongside the response. verify the claim appears in the context
  2. Check if similarity_top_k is too low, meaning the LLM is filling gaps with parametric memory
  3. Examine the system prompt for language that encourages the LLM to answer even when uncertain

Fix: Add an explicit instruction in the synthesizer prompt: "Only answer using the provided context. If the context does not contain the answer, say you don't know." Also increase similarity_top_k to provide more context options.

Best Practices for RAG Debugging

  1. Incremental Testing - Test each pipeline component independently before integration. Never try to debug retrieval quality when you haven't yet confirmed documents are loading correctly.
  1. Version Control - Track configuration changes that affect pipeline behavior. Chunking parameters, similarity thresholds, and embedding models should be stored in config files committed to git, not hardcoded.
  1. Comprehensive Logging - Implement detailed logging at each stage. At minimum, log document count after loading, node count after chunking, embedding dimension on first generation, and source node scores on every query.
  1. Test Datasets - Maintain representative test documents and queries. A golden set of 20–50 question-answer pairs is the minimum needed to detect regressions. Include edge cases: very short questions, multi-hop reasoning, and questions the pipeline should decline to answer.
  1. Metric Tracking - Monitor retrieval precision, recall, and response quality over time. Even simple metrics like average source node score and response length tell you when the pipeline is degrading before users complain.
  1. Separate Indexing from Querying - Always persist your index to disk and load it separately. Re-indexing on every application start masks indexing bugs and adds unnecessary latency.
Save index
index.storage_context.persist(persist_dir="./storage")

Load index (separate process)
from llama_index.core import StorageContext, load_index_from_storage
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)

Advanced Debugging with Claude Code

For complex RAG issues, use Claude Code's ability to analyze your entire codebase:

Claude Code can also help you implement advanced features like:

Hybrid search combining keyword and vector search. improves recall for queries with rare terms that embeddings handle poorly:

from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.retrievers.bm25 import BM25Retriever

vector_retriever = index.as_retriever(similarity_top_k=5)
bm25_retriever = BM25Retriever.from_defaults(index=index, similarity_top_k=5)

hybrid_retriever = QueryFusionRetriever(
 [vector_retriever, bm25_retriever],
 similarity_top_k=5,
 num_queries=1,
 mode="reciprocal_rerank",
)

Re-ranking for improved result quality. a cross-encoder re-ranker scores each retrieved chunk against the query more accurately than cosine similarity:

from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker

reranker = FlagEmbeddingReranker(model="BAAI/bge-reranker-base", top_n=3)

query_engine = index.as_query_engine(
 similarity_top_k=10,
 node_postprocessors=[reranker],
)

Multi-step query decomposition. breaks complex questions into sub-questions, retrieves context for each, then synthesizes a final answer:

from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool

tools = [
 QueryEngineTool.from_defaults(query_engine=index.as_query_engine(), name="docs", description="Product documentation")
]

sub_question_engine = SubQuestionQueryEngine.from_defaults(query_engine_tools=tools)
response = sub_question_engine.query("Compare the performance of feature A and feature B in the latest release")

Debugging Checklist

Use this checklist before escalating a RAG pipeline issue:

Conclusion

Debugging LlamaIndex RAG pipelines requires a systematic approach and the right tools. Claude Code provides essential capabilities for analyzing documents, examining pipeline components, and identifying issues at each stage. By following the techniques outlined in this guide. stage-by-stage isolation, quantitative chunk evaluation, embedding consistency checks, and regression testing. you can effectively diagnose and resolve common RAG pipeline problems, leading to more reliable and accurate retrieval-augmented generation systems.

The most important mindset shift is treating debugging as measurement, not guesswork. Instrument every stage, track metrics over time, and build a golden test set before you need it. Claude Code accelerates this process by letting you analyze query logs, inspect documents, and generate targeted test cases without switching context.

Remember that successful RAG debugging is iterative. continuously monitor, analyze, and refine your pipeline based on real-world performance and user feedback.

Try it: Paste your error into our Error Diagnostic for an instant fix

I run 5 Claude Max subs, 16 Chrome extensions serving 50K users, and bill $500K+ on Upwork. These CLAUDE.md templates are what I actually use.

Grab the templates — $99 once, free forever →

Built by Michael Lip — solo dev, Da Nang.