End-to-End RAG System | Prachi Goyal

Overview

Built a comprehensive Retrieval-Augmented Generation (RAG) system for question answering over a curated knowledge base of Carnegie Mellon University and Pittsburgh-related events. The system addresses the limitations of large language models by providing them with relevant external context at inference time, reducing hallucinations and improving factual accuracy.

Key Contributions

Knowledge Base Construction: Curated a comprehensive corpus of over 144,000 text chunks by systematically crawling and parsing 29 seed URLs, including web pages and PDF documents. Implemented parallel crawling with intelligent chunking (1000 characters with 200-character overlap) to ensure contextual continuity.
Multi-Strategy Retrieval: Designed and evaluated three retrieval approaches—sparse (BM25), dense (FAISS with SentenceTransformers), and hybrid fusion using weighted score combination. Achieved optimal performance with hybrid retrieval using α=0.25 interpolation weight.
Reader Architectures: Implemented two reading strategies: concatenation-based reader for direct context fusion and a BART-based decoder for intelligent summarization of retrieved passages, balancing information density with context length.
Model Fine-tuning: Applied RAFT-inspired fine-tuning on LLaMA-3.2-1B using 464 manually validated QA pairs with distractor documents, improving the model's ability to distinguish relevant from irrelevant context.
Comprehensive Evaluation: Achieved best performance with LLaMA-3B + Sparse retrieval (EM: 0.1589, F1: 0.3202), demonstrating that sparse retrieval excels in factual precision while larger models improve semantic fidelity. Conducted rigorous statistical significance testing and category-based analysis across factual, temporal, causal, and descriptive questions.

System Architecture

The RAG pipeline integrates three core components: (1) a retriever that selects relevant documents using BM25, dense embeddings, or hybrid fusion; (2) a reader that either concatenates or summarizes retrieved content; and (3) a generator (LLaMA variants) that produces precise answers conditioned on the retrieved context. The modular design enables systematic evaluation of different retrieval and generation strategies.

Tools & Technologies

Retrieval: BM25 (sparse), FAISS with SentenceTransformers (dense), Reciprocal Rank Fusion (hybrid)
Embeddings: all-MiniLM-L6-v2 for semantic similarity
Models: LLaMA-3.2-1B, LLaMA-3.2-3B, LLaMA-8B, BART-base
Web Scraping: BeautifulSoup, PyPDF2, ThreadPoolExecutor for parallel crawling
Evaluation: Exact Match, F1, ROUGE-L, BLEU, BERTScore

Key Findings

Sparse retrieval (BM25) achieved highest factual precision (EM: 0.1589) for entity-centric questions
Dense retrieval provided richer contextual understanding but introduced semantic drift
Hybrid fusion with α=0.25 offered best balance between precision and recall
Larger generators (LLaMA-3B) improved fluency and semantic alignment
Fine-tuning on domain-specific QA pairs improved answer grounding
The system exhibited minimal hallucination due to strong grounding in curated knowledge base

Impact

Delivered a fully functional, modular RAG system demonstrating strong performance on domain-specific question answering. The project showcases expertise in information retrieval, large language models, system design, and rigorous experimental evaluation. The findings provide valuable insights into retrieval strategy selection based on question types and computational constraints.

Course: 11-711 Advanced Natural Language Processing, Carnegie Mellon University

Team: Prachi Goyal, Medha Hira, Raj Maheshwari

GitHub Repository
Project Report