Categories: Retrieval Augmented Generation Python

Simple Retrieval-Augmented Generation (RAG) System in Python

Despite their impressive performance on a variety of tasks, large language models (LLMs) still have a lot of drawbacks, especially when it comes to domain-specific or knowledge-intensive scenarios. Their tendency to produce “hallucinations” that result in believable but factually erroneous answers is a significant difficulty, particularly when processing questions that need current knowledge or are not related to their training data.

Retrieval-Augmented Generation (RAG), a strong approach that improves LLMs by integrating appropriate external data, has emerged to solve these issues. Through semantic similarity computations, RAG gets relevant document chunks from an external knowledge base, enabling the model to base its responses on up-to-date and correct data. RAG successfully lowers the production of factually inaccurate content and raises the general dependability of LLM outputs by leveraging outside knowledge sources. RAG has been widely adopted as a result of this innovation, making it a crucial technique for developing chatbots and enhancing the usefulness of LLMs in real-world contexts.

What is RAG?

A simple chatbot system without RAG:

While such chatbots can respond to common questions based on their training data, they often lack access to up-to-date or domain-specific knowledge.
For example, asking a chatbot like ChatGPT, “What is my mother’s name?”, won’t yield an answer—because it doesn’t have access to external or personal data.

Retrieval-Augmented Generation (RAG) is a powerful hybrid architecture designed to overcome this limitation. RAG models can not only generate fluent text but also ground their responses in real-world, current information. The retrieval module in a RAG system uses dense vector representations to search and identify relevant documents from large datasets, such as Wikipedia or private databases. These retrieved documents are then passed to the generation module, often a transformer-based language model, to generate responses that are both contextually relevant and knowledge-grounded.

Core Elements of RAG Systems

Overview of RAG Models

Retrieval-Augmented Generation (RAG) is an advanced hybrid model architecture that enhances natural language generation (NLG) by integrating external retrieval mechanisms. Unlike traditional large language models (LLMs) such as GPT-3 or BERT—which rely solely on their pre-trained internal knowledge—RAG models can access and utilize up-to-date information from external sources.

One key limitation of conventional LLMs is their tendency to hallucinate—that is, to produce fluent but factually incorrect responses. Additionally, updating these models requires extensive retraining, which is impractical for tasks needing current or domain-specific data, such as open-domain question answering and fact verification.

RAG models address these challenges by incorporating two main components:

Retriever – Identifies and fetches relevant documents from a corpus using techniques like Dense Passage Retrieval (DPR) or traditional methods such as BM25.
Generator – Synthesizes the retrieved documents into a coherent, contextually accurate response using a generative model (typically a transformer-based LLM).

Retriever Mechanisms in RAG Systems

The retriever component in RAG systems plays a crucial role in identifying and fetching relevant documents from an external knowledge corpus. Its effectiveness directly influences the factual accuracy and relevance of the model’s generated responses.

Retrieval mechanisms can vary in complexity from traditional sparse retrieval techniques like

BM25 (Best Matching 25) – A traditional ranking function based on term frequency and inverse document frequency. It is fast and effective for exact keyword matching but struggles with semantic variation.
Dense Passage Retrieval (DPR) – Uses bi-encoders to map both questions and documents into a shared embedding space. It enables retrieval based on semantic meaning rather than exact keyword match, which improves relevance in many use cases.
Hybrid Retrieval – Combines BM25 and dense retrieval to leverage both lexical and semantic similarities for improved coverage and performance.

Generator Mechanisms in RAG Systems

In Retrieval-Augmented Generation (RAG) systems, the generator mechanism is responsible for producing the final response by combining the retrieved information with the user’s query. Once the retriever has identified the most relevant documents, the generator synthesizes this information into a coherent and contextually accurate output.

The Large Language Model (LLM) acts as the core of the generator, ensuring the generated text is fluent, meaningful, and aligned with the retrieved data. By grounding its responses in external knowledge, the generator reduces hallucinations and delivers answers that are both contextually relevant and factually supported.

Build a Simple RAG System in Python Using Ollama

In this tutorial, we’ll build a minimal Retrieval-Augmented Generation (RAG) system using Python and Ollama, a local Large Language Model (LLM) runtime. No API keys or cloud services needed.

Prerequisites

Python 3.9+
Ollama installed and running
Required Python packages: langchain, chromadb, langchain-community

Set up a virtual environment

python -m venv venv
venv\Scripts\activate

Install Required Packages

pip install langchain chromadb langchain-community

Install and Start Ollama

ollama pull llama3
ollama run llama3

Load the Document

from langchain_community.document_loaders import TextLoader

loader = TextLoader("example.txt", encoding="utf-8")
documents = loader.load()

Generate Embeddings Using Ollama

from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma

embedding = OllamaEmbeddings(model="llama3")
vectorstore = Chroma.from_documents(documents, embedding)

Create a Retriever

retriever = vectorstore.as_retriever()

Load the Local LLM

from langchain_community.llms import Ollama

llm = Ollama(model="llama3")

Create the RAG Chain

from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)

To run the code, save it to a file named demo.py and run the following command:

python demo.py

You can now ask the chatbot questions, and it will generate responses based on the retrieved knowledge from the dataset.

Below is a full Python implementation showing how to load a text file, generate embeddings, perform retrieval, and query using a Retrieval-Augmented Generation (RAG) pipeline with Ollama.

from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain_community.llms import Ollama

# Step 1: Load text file (must be UTF-8 encoded)
loader = TextLoader("example.txt", encoding="utf-8")
documents = loader.load()

# Step 2: Create embeddings with Ollama
embedding = OllamaEmbeddings(model="llama3.2")
vectorstore = Chroma.from_documents(documents, embedding)

# Step 3: Convert to retriever
retriever = vectorstore.as_retriever()

# Step 4: Load LLM
llm = Ollama(model="llama3.2")

# Step 5: Create Retrieval QA chain
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)

# Step 6: Ask a question
while True:
    query = input("Ask a question (or type 'exit' to quit): ")
    if query.lower() in ["exit", "quit"]:
        print("Goodbye!")
        break

    response = qa_chain.run(query)
    print("A:", response)

Retrieval-Augmented Generation Models

RAG models are not limited to text—they have been extended to other modalities to enhance their capabilities in diverse applications

Text-Based RAG Models: These are the most mature and widely adopted. They rely entirely on textual data for both retrieval and generation. Common use cases include open-domain question answering, summarization, and chatbots.
Audio-Based RAG Models: These models adapt RAG principles to the audio domain, supporting tasks like speech recognition, audio summarization, and voice-driven conversational agents.
Video-Based RAG Models: These models integrate visual and textual components to tackle video-centric tasks, including video understanding, captioning, and scene-specific information retrieval.
Multimodal RAG Models: These models combine multiple modalities—such as text, audio, video, and images—to offer a comprehensive understanding and generation capability across various data types. This makes them particularly effective for complex real-world tasks where information is not limited to a single format.

Applications of RAG Models

RAG models are increasingly being adopted in fields where accuracy, relevance, and context are essential. Some prominent applications include

Open-Domain Question Answering: RAG systems can generate accurate answers by retrieving and referencing relevant content across a wide knowledge base.
Customer Support Chatbots: By retrieving specific product information or documentation, RAG-powered chatbots provide more helpful, accurate responses.
Medical Diagnosis Systems: RAG can enhance clinical decision-making by retrieving recent research findings, medical records, or patient history to assist in diagnostics.
Legal Advisory Tools: These systems retrieve case law, legal texts, or policy documents to help generate contextually appropriate and accurate legal advice.
Personalized Recommendation Engines: By retrieving past user interactions or preferences, RAG models can generate more relevant and personalized content or suggestions.

Conclusion

By incorporating outside knowledge sources, Retrieval-Augmented Generation (RAG) has been essential in improving the accuracy and dependability of Large Language Models (LLMs). RAG has demonstrated its adaptability from its earliest versions to more sophisticated implementations, particularly in specific fields as low-resource language applications, legal applications, and medical applications.

Despite these developments, there are still significant obstacles to overcome. Managing complicated domain-specific contexts, integrating unstructured or ambiguous data, and meeting the high processing needs of complex retrieval procedures remain challenges for RAG systems. These restrictions limit RAG’s wider usefulness in dynamic, real-world situations.

To address these challenges, future research must focus on improving retrieval techniques, enhancing context management, and ensuring scalability. Bridging these gaps could pave the way for the next generation of RAG models—more robust, efficient, and adaptable across diverse domains. Such advancements would significantly expand the potential of retrieval-augmented AI systems and their real-world impact.

Sewwandi JM