End-to-End RAG System
Question Answering with Retrieval-Augmented Generation
Overview
Built a comprehensive Retrieval-Augmented Generation (RAG) system for question answering over a curated knowledge base of Carnegie Mellon University and Pittsburgh-related events. The system addresses the limitations of large language models by providing them with relevant external context at inference time, reducing hallucinations and improving factual accuracy.
Key Contributions
- Knowledge Base Construction: Curated a comprehensive corpus of over 144,000 text chunks by systematically crawling and parsing 29 seed URLs, including web pages and PDF documents. Implemented parallel crawling with intelligent chunking (1000 characters with 200-character overlap) to ensure contextual continuity.
- Multi-Strategy Retrieval: Designed and evaluated three retrieval approaches—sparse (BM25), dense (FAISS with SentenceTransformers), and hybrid fusion using weighted score combination. Achieved optimal performance with hybrid retrieval using α=0.25 interpolation weight.
- Reader Architectures: Implemented two reading strategies: concatenation-based reader for direct context fusion and a BART-based decoder for intelligent summarization of retrieved passages, balancing information density with context length.
- Model Fine-tuning: Applied RAFT-inspired fine-tuning on LLaMA-3.2-1B using 464 manually validated QA pairs with distractor documents, improving the model's ability to distinguish relevant from irrelevant context.
- Comprehensive Evaluation: Achieved best performance with LLaMA-3B + Sparse retrieval (EM: 0.1589, F1: 0.3202), demonstrating that sparse retrieval excels in factual precision while larger models improve semantic fidelity. Conducted rigorous statistical significance testing and category-based analysis across factual, temporal, causal, and descriptive questions.
System Architecture
The RAG pipeline integrates three core components: (1) a retriever that selects relevant documents using BM25, dense embeddings, or hybrid fusion; (2) a reader that either concatenates or summarizes retrieved content; and (3) a generator (LLaMA variants) that produces precise answers conditioned on the retrieved context. The modular design enables systematic evaluation of different retrieval and generation strategies.
Tools & Technologies
- Retrieval: BM25 (sparse), FAISS with SentenceTransformers (dense), Reciprocal Rank Fusion (hybrid)
- Embeddings: all-MiniLM-L6-v2 for semantic similarity
- Models: LLaMA-3.2-1B, LLaMA-3.2-3B, LLaMA-8B, BART-base
- Web Scraping: BeautifulSoup, PyPDF2, ThreadPoolExecutor for parallel crawling
- Evaluation: Exact Match, F1, ROUGE-L, BLEU, BERTScore
Key Findings
- Sparse retrieval (BM25) achieved highest factual precision (EM: 0.1589) for entity-centric questions
- Dense retrieval provided richer contextual understanding but introduced semantic drift
- Hybrid fusion with α=0.25 offered best balance between precision and recall
- Larger generators (LLaMA-3B) improved fluency and semantic alignment
- Fine-tuning on domain-specific QA pairs improved answer grounding
- The system exhibited minimal hallucination due to strong grounding in curated knowledge base
Impact
Delivered a fully functional, modular RAG system demonstrating strong performance on domain-specific question answering. The project showcases expertise in information retrieval, large language models, system design, and rigorous experimental evaluation. The findings provide valuable insights into retrieval strategy selection based on question types and computational constraints.
Course: 11-711 Advanced Natural Language Processing, Carnegie Mellon University
Team: Prachi Goyal, Medha Hira, Raj Maheshwari