- 3
- April
Ollama Series EP.4 — From EP.3 where we learned to choose models and create Modelfiles, now we take it further — building AI that actually answers questions from your organization's documents. Not just general knowledge, but from company manuals, policies, financial reports, or any documents you feed it. All powered by a technique called RAG (Retrieval-Augmented Generation) — and most importantly, all data stays on your machine, never leaving the organization.
In short — What is RAG and why use it?
- RAG = Retrieval-Augmented Generation — a technique that makes AI "search" documents before answering
- Regular AI answers from "general knowledge" it was trained on — it doesn't know your organization's specifics
- RAG enables AI to answer from your organization's actual documents — manuals, policies, reports, contracts, etc.
- Combined with Ollama = data never leaves your machine
- Key tools: Ollama + LangChain (or LlamaIndex) + ChromaDB
- This article includes complete code examples you can follow immediately
What is RAG? — Why Regular AI Can't Answer from Your Documents
Imagine asking ChatGPT: "According to our company's leave policy, how many days can probationary employees take?" — ChatGPT can't answer because it has never seen your company's policy. It only knows general information it was trained on.
RAG (Retrieval-Augmented Generation) solves this by adding a "search" step before AI answers:
| Step | Regular AI | AI + RAG |
|---|---|---|
| 1. Receive question | Receive question | Receive question |
| 2. Search for info | ❌ No search, answers from memory | ✅ Searches relevant documents |
| 3. Generate answer | Answers from training data (may hallucinate) | Answers from actual documents + cites sources |
| Accuracy | Low (for organization-specific data) | High (answers from actual documents) |
| Hallucination | High — may fabricate information | Low — has document references |
The RAG technique was first proposed by Patrick Lewis and the team from Facebook AI Research (FAIR) in 2020 in a paper titled "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (arXiv:2005.11401). Today, RAG is a standard technique used by virtually every AI company — including ChatGPT, Claude, Gemini all use RAG in various forms for Knowledge Retrieval
How Does RAG Work? — 5 Steps
Now that you understand the concept, let's look at how RAG works technically:
| Step | What It Does | Tool |
|---|---|---|
| 1. Load | Load documents (PDF, Word, TXT, CSV, HTML) | LangChain Document Loaders |
| 2. Split | Split documents into small chunks (500-1,000 characters) | RecursiveCharacterTextSplitter |
| 3. Embed | Convert text to Vectors (numbers) for "semantic" search | Ollama Embeddings (nomic-embed-text) |
| 4. Store | Store Vectors in a Vector Database | ChromaDB / FAISS / Qdrant |
| 5. Retrieve + Generate | When a question comes → find relevant Chunks → send to LLM to generate answer | Ollama LLM (qwen2.5 / llama3.1) |
What is Embedding? — The Heart of RAG
Embedding is the process of converting text into "coordinates in high-dimensional space" (Vectors) so computers can understand the "meaning" of text. For example, "sick leave" and "time off due to illness" would be converted into Vectors that are close together, even though they use different words.
Ollama has built-in Embedding Models. The most popular is nomic-embed-text — small (274 MB) but high quality, supports 8,192 Tokens per Chunk, and most importantly, runs entirely locally without sending any data outside.
# Download Embedding Model
ollama pull nomic-embed-text
# Test Embedding
curl http://localhost:11434/api/embeddings -d '{
"model": "nomic-embed-text",
"prompt": "Company leave policy"
}'
# Returns a 768-dimensional number array, e.g. [0.123, -0.456, 0.789, ...]
What is a Vector Database?
Vector Database is a database designed to store and search Vectors — instead of keyword search like PostgreSQL it searches by "semantic similarity". Popular choices with Ollama:
| Vector DB | Type | Highlights | Best For |
|---|---|---|---|
| ChromaDB | Open-source, embeddable | Easiest to install pip install chromadb |
Getting started, Prototype, small orgs |
| FAISS | Library by Meta | Very fast, handles millions of Vectors | Large datasets, need speed |
| Qdrant | Open-source, Docker | Has REST API, complex filtering | Production, medium-large orgs |
| pgvector | PostgreSQL Extension | Works with existing PostgreSQL | Orgs already using PostgreSQL |
Hands-on — Build RAG with Ollama + LangChain + ChromaDB
Let's build a real RAG Pipeline — this example feeds a PDF document to AI and asks questions from it:
Step 1: Install Dependencies
# Install Python packages
pip install langchain langchain-community langchain-ollama
pip install chromadb
pip install pypdf
# Download Models in Ollama
ollama pull qwen2.5
ollama pull nomic-embed-text
Step 2: Write RAG Pipeline
# file: rag_demo.py
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
# 1. Load - Load PDF document
loader = PyPDFLoader("company-policy.pdf")
documents = loader.load()
print(f"Loaded {len(documents)} pages")
# 2. Split - Split into Chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Size of each Chunk (characters)
chunk_overlap=200, # Overlap 200 chars (to preserve context)
separators=["\n\n", "\n", ".", " "]
)
chunks = text_splitter.split_documents(documents)
print(f"Split into {len(chunks)} Chunks")
# 3. Embed + Store - Convert to Vector and store in ChromaDB
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db" # Persist to disk
)
print("Vector Database created")
# 4. Create RAG Chain
llm = ChatOllama(model="qwen2.5", temperature=0.2)
prompt_template = PromptTemplate(
template="""Based on the following information, answer the question:
Information:
{context}
Question: {question}
Answer (answer only from the provided information. If not found, say so):""",
input_variables=["context", "question"]
)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(
search_kwargs={"k": 4} # Retrieve 4 most relevant Chunks
),
chain_type_kwargs={"prompt": prompt_template},
return_source_documents=True
)
# 5. Ask a question!
result = qa_chain.invoke({
"query": "How many sick leave days can probationary employees take?"
})
print("\n=== Answer ===")
print(result["result"])
print("\n=== Sources ===")
for doc in result["source_documents"]:
print(f"- Page {doc.metadata.get('page', '?')}: {doc.page_content[:100]}...")
Step 3: Run
python rag_demo.py
# Example output:
# Loaded 45 pages
# Split into 128 Chunks
# Vector Database created
#
# === Answer ===
# According to company policy, Section 3, Article 3.2
# Probationary employees are entitled to no more than 15 working days of sick leave per year
# Must notify supervisor on the day of leave
# Sick leave of 3+ consecutive days requires a medical certificate
#
# === Sources ===
# - Page 12: Section 3 Leave 3.2 Sick Leave Probationary employees...
Where does data stay?
- Original documents → on your machine
- Embedding (Vector) → stored in ChromaDB on your machine (folder
./chroma_db) - LLM (qwen2.5) → runs on Ollama on your machine
- Nothing leaves your machine throughout the entire process — suitable for sensitive data at the level of enterprise risk management
Supports Multiple Document Formats
LangChain has Document Loaders for many file types — not just PDF:
| File Type | Loader | Requires |
|---|---|---|
PyPDFLoader |
pip install pypdf |
|
| Word (.docx) | Docx2txtLoader |
pip install docx2txt |
| Excel (.xlsx) | UnstructuredExcelLoader |
pip install unstructured openpyxl |
| CSV | CSVLoader |
Included with LangChain |
| Text (.txt) | TextLoader |
Included with LangChain |
| HTML | BSHTMLLoader |
pip install beautifulsoup4 |
| Entire Folder | DirectoryLoader |
Included with LangChain |
Shortcut — Use Open WebUI for RAG Without Coding
For those who don't want to write Python — Open WebUI has built-in RAG features. Just drag and drop files into the chat:
- Install Open WebUI as shown in EP.2
- Open a new Chat and select a Model (e.g., qwen2.5)
- Click the 📎 (attach file) icon and select a PDF, TXT, or DOCX file
- Open WebUI will automatically process the document into Chunks + Embeddings
- Ask questions — AI will answer from the attached document with source references
This method is perfect for non-developer users — such as accounting teams who want to ask AI about TFRS standards or HR teams who need to ask about company policies
RAG for ERP — Real Use Cases
For organizations already running ERP systems, RAG opens many possibilities:
| Use Case | Documents Fed | Example Questions |
|---|---|---|
| Company Policy Q&A | Company policy, employee handbook | "How many maternity leave days?", "How is overtime calculated?" |
| Help with Accounting | TFRS standards, Chart of Accounts | "How to record fixed assets per IFRS?", "What's in chart of accounts category 5?" |
| Report Summarization | Financial reports, budgets | "Summarize over-budget expenses this quarter", "What's the trend of manufacturing costs ?" |
| ERP User Manual | ERP manuals, SOPs | "How to create a purchase order in the system?", "Monthly closing steps?" |
| Contract Review | Employment contracts, purchase agreements | "What are contract termination conditions?", "What's the penalty for late delivery?" |
| Preserve Institutional Knowledge | Meeting notes, Best Practices | "How did we resolve the stock mismatch issue?", "What was the latest ISO meeting conclusion?" |
Key Techniques — Making RAG Accurate
- Optimal Chunk Size: Use 500-1,000 characters — too small and AI loses context, too large and AI gets too much noise. Set overlap to 100-200 characters to avoid losing data at chunk boundaries.
- Choose k (number of retrieved Chunks) wisely: Start with k=3-5 — too few may miss important data, too many sends noise that confuses the AI.
- Use good prompts: Tell AI clearly to "answer only from provided information; if not found, say so" — reduces Hallucination significantly.
- Clean documents first: Remove repeated headers/footers, blank pages, watermarks — clean data gives better results.
- Test with known answers: Start by asking questions you already know the answer to, to verify RAG retrieves correctly.
RAG Limitations You Must Know:
- Not Fine-tuning: RAG doesn't "retrain" the AI — it just provides additional information at answer time. Bad documents = bad answers.
- Depends on document quality: Poorly OCR'd scans, complex Excel tables may give poor results.
- Context Window is limited: Even with retrieved Chunks, LLMs have limited Context Windows (4K-128K Tokens). Sending too much data causes errors.
- Thai language may need better models: Some Embedding models don't handle Thai well — use
nomic-embed-textormxbai-embed-largefor better results.
Saeree ERP + RAG:
Saeree ERP is developing an AI Assistant using RAG techniques to let users query ERP data with natural language — such as "What were last month's sales?" or "How many purchase orders are pending approval?" Interested? Consult our team for free
Ollama Series — Read More
Ollama Series — 6 Episodes, Complete Local AI Guide:
- EP.1: What Is Ollama? — Run AI on Your Own Machine
- EP.2: Install Ollama on Every OS — macOS / Windows / Linux
- EP.3: Using Ollama for Real — Choosing Models, Writing Prompts, and Creating Modelfiles
- EP.4: Ollama + RAG — Build AI That Answers from Your Documents (this article)
- EP.5: Ollama API — Connect AI to Your Apps and Enterprise Systems
- EP.6: Secure Self-Hosted AI — Security & Best Practices
"RAG transforms AI from 'knows everything but not about you' into 'an expert on your organization's documents' — and with Ollama, everything stays within your own walls."
- Saeree ERP Team



