Ollama + RAG

3
April

Ollama Series EP.4 — From EP.3 where we learned to choose models and create Modelfiles, now we take it further — building AI that actually answers questions from your organization's documents. Not just general knowledge, but from company manuals, policies, financial reports, or any documents you feed it. All powered by a technique called RAG (Retrieval-Augmented Generation) — and most importantly, all data stays on your machine, never leaving the organization.

In short — What is RAG and why use it?

RAG = Retrieval-Augmented Generation — a technique that makes AI "search" documents before answering
Regular AI answers from "general knowledge" it was trained on — it doesn't know your organization's specifics
RAG enables AI to answer from your organization's actual documents — manuals, policies, reports, contracts, etc.
Combined with Ollama = data never leaves your machine
Key tools: Ollama + LangChain (or LlamaIndex) + ChromaDB
This article includes complete code examples you can follow immediately

What is RAG? — Why Regular AI Can't Answer from Your Documents

Imagine asking ChatGPT: "According to our company's leave policy, how many days can probationary employees take?" — ChatGPT can't answer because it has never seen your company's policy. It only knows general information it was trained on.

RAG (Retrieval-Augmented Generation) solves this by adding a "search" step before AI answers:

Step	Regular AI	AI + RAG
1. Receive question	Receive question	Receive question
2. Search for info	❌ No search, answers from memory	✅ Searches relevant documents
3. Generate answer	Answers from training data (may hallucinate)	Answers from actual documents + cites sources
Accuracy	Low (for organization-specific data)	High (answers from actual documents)
Hallucination	High — may fabricate information	Low — has document references

The RAG technique was first proposed by Patrick Lewis and the team from Facebook AI Research (FAIR) in 2020 in a paper titled "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (arXiv:2005.11401). Today, RAG is a standard technique used by virtually every AI company — including ChatGPT, Claude, Gemini all use RAG in various forms for Knowledge Retrieval

How Does RAG Work? — 5 Steps

Now that you understand the concept, let's look at how RAG works technically:

Step	What It Does	Tool
1. Load	Load documents (PDF, Word, TXT, CSV, HTML)	LangChain Document Loaders
2. Split	Split documents into small chunks (500-1,000 characters)	RecursiveCharacterTextSplitter
3. Embed	Convert text to Vectors (numbers) for "semantic" search	Ollama Embeddings (nomic-embed-text)
4. Store	Store Vectors in a Vector Database	ChromaDB / FAISS / Qdrant
5. Retrieve + Generate	When a question comes → find relevant Chunks → send to LLM to generate answer	Ollama LLM (qwen2.5 / llama3.1)

What is Embedding? — The Heart of RAG

Embedding is the process of converting text into "coordinates in high-dimensional space" (Vectors) so computers can understand the "meaning" of text. For example, "sick leave" and "time off due to illness" would be converted into Vectors that are close together, even though they use different words.

Ollama has built-in Embedding Models. The most popular is nomic-embed-text — small (274 MB) but high quality, supports 8,192 Tokens per Chunk, and most importantly, runs entirely locally without sending any data outside.

# Download Embedding Model
ollama pull nomic-embed-text

# Test Embedding
curl http://localhost:11434/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "Company leave policy"
}'
# Returns a 768-dimensional number array, e.g. [0.123, -0.456, 0.789, ...]

What is a Vector Database?

Vector Database is a database designed to store and search Vectors — instead of keyword search like PostgreSQL it searches by "semantic similarity". Popular choices with Ollama:

Vector DB	Type	Highlights	Best For
ChromaDB	Open-source, embeddable	Easiest to install `pip install chromadb`	Getting started, Prototype, small orgs
FAISS	Library by Meta	Very fast, handles millions of Vectors	Large datasets, need speed
Qdrant	Open-source, Docker	Has REST API, complex filtering	Production, medium-large orgs
pgvector	PostgreSQL Extension	Works with existing PostgreSQL	Orgs already using PostgreSQL

Hands-on — Build RAG with Ollama + LangChain + ChromaDB

Let's build a real RAG Pipeline — this example feeds a PDF document to AI and asks questions from it:

Step 1: Install Dependencies

# Install Python packages
pip install langchain langchain-community langchain-ollama
pip install chromadb
pip install pypdf

# Download Models in Ollama
ollama pull qwen2.5
ollama pull nomic-embed-text

Step 2: Write RAG Pipeline

# file: rag_demo.py
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# 1. Load - Load PDF document
loader = PyPDFLoader("company-policy.pdf")
documents = loader.load()
print(f"Loaded {len(documents)} pages")

# 2. Split - Split into Chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,      # Size of each Chunk (characters)
    chunk_overlap=200,    # Overlap 200 chars (to preserve context)
    separators=["\n\n", "\n", ".", " "]
)
chunks = text_splitter.split_documents(documents)
print(f"Split into {len(chunks)} Chunks")

# 3. Embed + Store - Convert to Vector and store in ChromaDB
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"  # Persist to disk
)
print("Vector Database created")

# 4. Create RAG Chain
llm = ChatOllama(model="qwen2.5", temperature=0.2)

prompt_template = PromptTemplate(
    template="""Based on the following information, answer the question:

Information:
{context}

Question: {question}

Answer (answer only from the provided information. If not found, say so):""",
    input_variables=["context", "question"]
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(
        search_kwargs={"k": 4}  # Retrieve 4 most relevant Chunks
    ),
    chain_type_kwargs={"prompt": prompt_template},
    return_source_documents=True
)

# 5. Ask a question!
result = qa_chain.invoke({
    "query": "How many sick leave days can probationary employees take?"
})

print("\n=== Answer ===")
print(result["result"])
print("\n=== Sources ===")
for doc in result["source_documents"]:
    print(f"- Page {doc.metadata.get('page', '?')}: {doc.page_content[:100]}...")

Step 3: Run

python rag_demo.py

# Example output:
# Loaded 45 pages
# Split into 128 Chunks
# Vector Database created
#
# === Answer ===
# According to company policy, Section 3, Article 3.2
# Probationary employees are entitled to no more than 15 working days of sick leave per year
# Must notify supervisor on the day of leave
# Sick leave of 3+ consecutive days requires a medical certificate
#
# === Sources ===
# - Page 12: Section 3 Leave 3.2 Sick Leave Probationary employees...

Where does data stay?

Original documents → on your machine
Embedding (Vector) → stored in ChromaDB on your machine (folder ./chroma_db)
LLM (qwen2.5) → runs on Ollama on your machine
Nothing leaves your machine throughout the entire process — suitable for sensitive data at the level of enterprise risk management

Supports Multiple Document Formats

LangChain has Document Loaders for many file types — not just PDF:

File Type	Loader	Requires
PDF	`PyPDFLoader`	`pip install pypdf`
Word (.docx)	`Docx2txtLoader`	`pip install docx2txt`
Excel (.xlsx)	`UnstructuredExcelLoader`	`pip install unstructured openpyxl`
CSV	`CSVLoader`	Included with LangChain
Text (.txt)	`TextLoader`	Included with LangChain
HTML	`BSHTMLLoader`	`pip install beautifulsoup4`
Entire Folder	`DirectoryLoader`	Included with LangChain

Shortcut — Use Open WebUI for RAG Without Coding

For those who don't want to write Python — Open WebUI has built-in RAG features. Just drag and drop files into the chat:

Install Open WebUI as shown in EP.2
Open a new Chat and select a Model (e.g., qwen2.5)
Click the 📎 (attach file) icon and select a PDF, TXT, or DOCX file
Open WebUI will automatically process the document into Chunks + Embeddings
Ask questions — AI will answer from the attached document with source references

This method is perfect for non-developer users — such as accounting teams who want to ask AI about TFRS standards or HR teams who need to ask about company policies

RAG for ERP — Real Use Cases

For organizations already running ERP systems, RAG opens many possibilities:

Use Case	Documents Fed	Example Questions
Company Policy Q&A	Company policy, employee handbook	"How many maternity leave days?", "How is overtime calculated?"
Help with Accounting	TFRS standards, Chart of Accounts	"How to record fixed assets per IFRS?", "What's in chart of accounts category 5?"
Report Summarization	Financial reports, budgets	"Summarize over-budget expenses this quarter", "What's the trend of manufacturing costs ?"
ERP User Manual	ERP manuals, SOPs	"How to create a purchase order in the system?", "Monthly closing steps?"
Contract Review	Employment contracts, purchase agreements	"What are contract termination conditions?", "What's the penalty for late delivery?"
Preserve Institutional Knowledge	Meeting notes, Best Practices	"How did we resolve the stock mismatch issue?", "What was the latest ISO meeting conclusion?"

Key Techniques — Making RAG Accurate

Optimal Chunk Size: Use 500-1,000 characters — too small and AI loses context, too large and AI gets too much noise. Set overlap to 100-200 characters to avoid losing data at chunk boundaries.
Choose k (number of retrieved Chunks) wisely: Start with k=3-5 — too few may miss important data, too many sends noise that confuses the AI.
Use good prompts: Tell AI clearly to "answer only from provided information; if not found, say so" — reduces Hallucination significantly.
Clean documents first: Remove repeated headers/footers, blank pages, watermarks — clean data gives better results.
Test with known answers: Start by asking questions you already know the answer to, to verify RAG retrieves correctly.

RAG Limitations You Must Know:

Not Fine-tuning: RAG doesn't "retrain" the AI — it just provides additional information at answer time. Bad documents = bad answers.
Depends on document quality: Poorly OCR'd scans, complex Excel tables may give poor results.
Context Window is limited: Even with retrieved Chunks, LLMs have limited Context Windows (4K-128K Tokens). Sending too much data causes errors.
Thai language may need better models: Some Embedding models don't handle Thai well — use nomic-embed-text or mxbai-embed-large for better results.

Saeree ERP + RAG:

Saeree ERP is developing an AI Assistant using RAG techniques to let users query ERP data with natural language — such as "What were last month's sales?" or "How many purchase orders are pending approval?" Interested? Consult our team for free

Ollama Series — Read More

Ollama Series — 6 Episodes, Complete Local AI Guide:

EP.1: What Is Ollama? — Run AI on Your Own Machine
EP.2: Install Ollama on Every OS — macOS / Windows / Linux
EP.3: Using Ollama for Real — Choosing Models, Writing Prompts, and Creating Modelfiles
EP.4: Ollama + RAG — Build AI That Answers from Your Documents (this article)
EP.5: Ollama API — Connect AI to Your Apps and Enterprise Systems
EP.6: Secure Self-Hosted AI — Security & Best Practices

"RAG transforms AI from 'knows everything but not about you' into 'an expert on your organization's documents' — and with Ollama, everything stays within your own walls."
- Saeree ERP Team

What is RAG? — Why Regular AI Can't Answer from Your Documents

How Does RAG Work? — 5 Steps

What is Embedding? — The Heart of RAG

What is a Vector Database?

Hands-on — Build RAG with Ollama + LangChain + ChromaDB

Step 1: Install Dependencies

Step 2: Write RAG Pipeline

Step 3: Run

Supports Multiple Document Formats

Shortcut — Use Open WebUI for RAG Without Coding

RAG for ERP — Real Use Cases

Key Techniques — Making RAG Accurate

Ollama Series — Read More

References

Bringing AI into your business — self-hosted or managed?

About the Author

Paitoon Butri

About Saeree ERP

Solutions

Resources

Contact Us

Ollama + RAG

What is RAG? — Why Regular AI Can't Answer from Your Documents

How Does RAG Work? — 5 Steps

What is Embedding? — The Heart of RAG

What is a Vector Database?

Hands-on — Build RAG with Ollama + LangChain + ChromaDB

Step 1: Install Dependencies

Step 2: Write RAG Pipeline

Step 3: Run

Supports Multiple Document Formats

Shortcut — Use Open WebUI for RAG Without Coding

RAG for ERP — Real Use Cases

Key Techniques — Making RAG Accurate

Ollama Series — Read More

References

Bringing AI into your business — self-hosted or managed?

About the Author

Paitoon Butri

Ollama + RAG — Build AI That Answers from Your Organization's Documents

Using Ollama for Real — Choosing Models, Writing Prompts

Install Ollama on Every OS — macOS, Windows, Linux

Don't Miss Our Updates

About Saeree ERP

Solutions

Resources

Contact Us