Domain 3: Implement RAG using OCI Generative AI Service (20%)
Domain 3 of the 1Z0-1127-25 OCI 2025 Generative AI Professional exam tests your ability to build Retrieval-Augmented Generation pipelines using OCI Generative AI, LangChain, and Oracle Database 23ai. This domain represents approximately 10 questions on the exam. It is the most hands-on domain -- expect questions that require understanding of specific classes, parameters, SQL syntax, and end-to-end pipeline architecture rather than abstract concepts.
1. RAG Architecture: End-to-End Workflow
What RAG Is and Why It Matters
Retrieval-Augmented Generation (RAG) combines a retrieval step with a generation step. Instead of relying solely on the LLM's training data (which may be outdated, incomplete, or hallucinated), RAG first retrieves relevant documents from an external knowledge base and injects them into the prompt as context. The LLM then generates a response grounded in those retrieved facts.
| Problem | How RAG Solves It |
|---|---|
| Hallucination | Grounds responses in retrieved factual documents |
| Stale knowledge | Knowledge base can be updated without retraining the model |
| Private data | Incorporates enterprise-specific data the LLM was never trained on |
| Cost | Far cheaper than fine-tuning a model on proprietary data |
| Auditability | Retrieved source documents can be cited and traced |
Exam trap: RAG vs. fine-tuning. RAG retrieves external context at inference time -- it does not modify model weights. Fine-tuning (covered in Domain 2) changes model weights through additional training. The exam will test whether you understand when to use each approach. RAG is preferred when you need up-to-date or private data without the cost and complexity of retraining. Fine-tuning is preferred when you need to change the model's style, tone, or specialized capability.
The RAG Pipeline (Five Stages)
Documents --> Load --> Split/Chunk --> Embed --> Store in Vector DB
|
User Query --> Embed Query --> Similarity Search -------+
|
Retrieved Context --+--> Prompt + LLM --> Response
| Stage | Component | OCI/LangChain Tool |
|---|---|---|
| 1. Document Loading | Ingest raw data (PDF, text, web, CSV) | LangChain Document Loaders |
| 2. Text Splitting | Break documents into chunks | RecursiveCharacterTextSplitter, TokenTextSplitter |
| 3. Embedding | Convert chunks to vector representations | OCIGenAIEmbeddings with Cohere Embed models |
| 4. Vector Storage | Store and index embeddings | Oracle 23ai (OracleVS), FAISS, ChromaDB |
| 5. Retrieval + Generation | Search for relevant chunks, generate response | similarity_search(), ChatOCIGenAI, LCEL chains |
2. LangChain Integration with OCI Generative AI
Installation
pip install langchain-oci langchain-oracledb langchain-community oracledb
The langchain-oci package is the current recommended package (replaces the deprecated langchain-community OCI integration). Source: OCI LangChain Docs
Core OCI LangChain Classes
| Class | Package | Purpose |
|---|---|---|
ChatOCIGenAI |
langchain_oci |
Chat/conversational model interface |
OCIGenAI |
langchain_oci |
Text completion (non-chat) interface |
OCIGenAIEmbeddings |
langchain_oci |
Embedding generation |
OracleVS |
langchain_oracledb |
Oracle 23ai vector store |
ChatOCIGenAI Configuration
from langchain_oci import ChatOCIGenAI
llm = ChatOCIGenAI(
model_id="cohere.command-r-plus-08-2024",
service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
compartment_id="ocid1.compartment.oc1...",
model_kwargs={"temperature": 0.3, "max_tokens": 1000},
)
Key constructor parameters:
| Parameter | Description |
|---|---|
model_id |
OCI model identifier (e.g., cohere.command-r-plus-08-2024, meta.llama-3.3-70b-instruct) |
service_endpoint |
Regional inference endpoint URL |
compartment_id |
OCI compartment OCID |
model_kwargs |
Dict with temperature, max_tokens, etc. |
auth_profile |
OCI config profile name (optional) |
is_stream |
Enable streaming responses (optional) |
Authentication methods (same as all OCI services): API Key (default), Session Token, Instance Principal, Resource Principal. Source: LangChain OCI Chat Docs
OCIGenAIEmbeddings Configuration
from langchain_oci import OCIGenAIEmbeddings
embeddings = OCIGenAIEmbeddings(
model_id="cohere.embed-english-v3.0",
service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
compartment_id="ocid1.compartment.oc1...",
)
# Single query embedding
query_vector = embeddings.embed_query("What is Oracle RAC?")
# Batch document embeddings
doc_vectors = embeddings.embed_documents(["Document 1 text", "Document 2 text"])
Exam trap: The embed_query() and embed_documents() methods use different Cohere input types internally. embed_query() uses input_type="search_query" and embed_documents() uses input_type="search_document". You must use the correct method for each purpose -- using embed_documents() for a query will produce suboptimal search results because the embedding model optimizes differently for each input type.
LangChain Expression Language (LCEL)
LCEL provides a declarative way to compose chains using the pipe (|) operator. This is the modern LangChain pattern replacing the older LLMChain and SequentialChain classes.
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_oci import ChatOCIGenAI
prompt = ChatPromptTemplate.from_template(
"Answer based on context:\n\nContext: {context}\n\nQuestion: {question}\n\nAnswer:"
)
llm = ChatOCIGenAI(model_id="cohere.command-r-plus-08-2024", ...)
output_parser = StrOutputParser()
chain = prompt | llm | output_parser
result = chain.invoke({"context": "...", "question": "What is RAG?"})
The LCEL chain reads left to right: prompt template formats input, passes to LLM, output parser extracts the string response. Source: LangChain LCEL Docs
3. Document Loading
LangChain Document Loaders convert external data sources into Document objects (containing page_content and metadata). Know these loaders:
| Loader | Source Type | Import |
|---|---|---|
PyPDFLoader |
PDF files | langchain_community.document_loaders |
TextLoader |
Plain text files | langchain_community.document_loaders |
CSVLoader |
CSV files (one row = one document) | langchain_community.document_loaders |
WebBaseLoader |
Web pages (uses requests + BeautifulSoup) | langchain_community.document_loaders |
DirectoryLoader |
All files in a directory | langchain_community.document_loaders |
UnstructuredFileLoader |
Multiple formats (PDF, HTML, DOCX, etc.) | langchain_community.document_loaders |
Exam trap: DirectoryLoader uses UnstructuredLoader by default, not TextLoader. It can handle mixed file types in a single directory.
4. Text Splitting and Chunking
After loading documents, you must split them into chunks small enough for embedding and retrieval. Chunk quality directly impacts retrieval quality.
RecursiveCharacterTextSplitter (Most Important)
The recommended default splitter for most RAG use cases. It splits text recursively using a hierarchy of separators: ["\n\n", "\n", " ", ""]. It tries the first separator first; if chunks are still too large, it falls back to the next separator.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
length_function=len,
separators=["\n\n", "\n", " ", ""],
)
chunks = splitter.split_documents(documents)
TokenTextSplitter
Splits based on token count rather than character count. Important when working with models that have specific token limits.
from langchain.text_splitter import TokenTextSplitter
splitter = TokenTextSplitter(chunk_size=512, chunk_overlap=50)
Chunking Strategy Guidelines
| Parameter | Guidance | Impact |
|---|---|---|
chunk_size |
400-512 tokens for most use cases | Too large: noise dilutes relevant content. Too small: loses context. |
chunk_overlap |
10-20% of chunk size (e.g., 50-100 for 512) | Prevents important information from being split across chunk boundaries |
| Factoid Q&A | 256-512 tokens optimal | Short, focused answers need precise retrieval |
| Analytical Q&A | 1024+ tokens may be needed | Complex answers need broader context |
Exam trap: Increasing chunk size does not always improve retrieval quality. Larger chunks contain more noise and can dilute the relevance signal during similarity search. The right chunk size depends on the query type and document structure.
5. Embedding Models in OCI Generative AI
OCI Generative AI provides Cohere embedding models. Know the model variants and their specifications.
OCI Embedding Model Specifications
| Model | Model ID | Dimensions | Input Type |
|---|---|---|---|
| Embed 4 (multimodal) | cohere.embed-v4.0 |
1024 | Text + images |
| Embed English v3 | cohere.embed-english-v3.0 |
1024 | Text only |
| Embed English Light v3 | cohere.embed-english-light-v3.0 |
384 | Text only |
| Embed Multilingual v3 | cohere.embed-multilingual-v3.0 |
1024 | Text only |
| Embed Multilingual Light v3 | cohere.embed-multilingual-light-v3.0 |
384 | Text only |
| Embed English Image v3 | cohere.embed-english-image-v3.0 |
1024 | Text + images |
| Embed Multilingual Image v3 | cohere.embed-multilingual-image-v3.0 |
1024 | Text + images |
Source: OCI Embedding Models, Cohere Embed English 3
Key Embedding Constraints
| Constraint | Value |
|---|---|
| Max inputs per batch | 96 |
| Max tokens per input (text models) | 512 |
| Max tokens per input (image models) | 128,000 combined |
| Truncation options | START, END, NONE (default returns error) |
| Output format | Array of floating-point vectors |
Cohere Input Types
Cohere Embed v3+ models require specifying an input_type. This is handled automatically by LangChain's embed_query() and embed_documents() methods.
| Input Type | When Used | LangChain Method |
|---|---|---|
search_document |
Embedding documents for storage in vector DB | embed_documents() |
search_query |
Embedding user queries for similarity search | embed_query() |
classification |
Text classification tasks | Manual API call |
clustering |
Text clustering tasks | Manual API call |
Exam trap: Full models produce 1024-dimensional vectors. Light models produce 384-dimensional vectors. Light models are faster and cheaper but less accurate. You cannot mix embeddings from different models in the same vector store -- the dimensions and semantic spaces will not align.
6. Vector Storage and Indexing
Vector Store Options
| Vector Store | Type | Best For |
|---|---|---|
Oracle Database 23ai (OracleVS) |
Production relational + vector DB | Enterprise RAG with existing Oracle infrastructure, combined relational + vector queries |
| FAISS | In-memory, local | Prototyping, small-to-medium datasets, no database infrastructure needed |
| ChromaDB | Embedded/client-server | Development, lightweight persistent storage |
Oracle 23ai Vector Store Integration
Oracle Database 23ai introduces the VECTOR data type and AI Vector Search. This is the exam's primary vector store.
from langchain_oracledb.vectorstores.oraclevs import OracleVS
from langchain_community.vectorstores.utils import DistanceStrategy
import oracledb
conn = oracledb.connect(user="username", password="password", dsn="host:port/service")
vector_store = OracleVS.from_documents(
documents,
embeddings,
client=conn,
table_name="KNOWLEDGE_BASE",
distance_strategy=DistanceStrategy.COSINE,
)
Source: LangChain Oracle Vector Store
Oracle 23ai SQL Vector Operations
-- Create table with VECTOR column
CREATE TABLE docs (
doc_id INT,
doc_text CLOB,
doc_vector VECTOR
);
-- Exact similarity search (top-K)
SELECT doc_id, doc_text
FROM docs
ORDER BY VECTOR_DISTANCE(doc_vector, :query_vector, COSINE)
FETCH EXACT FIRST 5 ROWS ONLY;
-- Approximate similarity search (uses vector index if available)
SELECT doc_id, doc_text
FROM docs
ORDER BY VECTOR_DISTANCE(doc_vector, :query_vector, COSINE)
FETCH FIRST 5 ROWS ONLY;
Source: Oracle AI Vector Search
Exam trap: The EXACT keyword in FETCH EXACT FIRST k ROWS ONLY forces exact (brute-force) search even when a vector index exists. Omitting EXACT allows the optimizer to use approximate search via vector indexes. On Autonomous Database Serverless (ADB-S), omitting EXACT automatically attempts approximate search if an index is available.
Vector Index Types
| Index Type | Organization | Best For | Key Parameters |
|---|---|---|---|
| HNSW | In-Memory Neighbor Graph | High search quality, fast queries, smaller datasets that fit in memory | neighbors (max edges per node), efConstruction (search width during build) |
| IVF | Neighbor Partitions | Large datasets, disk-based, balanced speed/quality | neighbor_part (number of partitions) |
-- HNSW index creation
CREATE VECTOR INDEX docs_hnsw_idx ON docs (doc_vector)
ORGANIZATION INMEMORY NEIGHBOR GRAPH
DISTANCE COSINE
WITH TARGET ACCURACY 95
PARAMETERS (type HNSW, neighbors 40, efConstruction 500);
-- IVF index creation
CREATE VECTOR INDEX docs_ivf_idx ON docs (doc_vector)
ORGANIZATION NEIGHBOR PARTITIONS
DISTANCE COSINE
WITH TARGET ACCURACY 90;
Source: Oracle Vector Index Types
Exam trap: If you do not specify a DISTANCE metric in the index definition, the default is COSINE. The distance metric used at index creation must match the metric used during search queries -- otherwise, Oracle falls back to exact search, bypassing the index entirely.
LangChain Index Creation (Python)
from langchain_oracledb.vectorstores import oraclevs
# HNSW with target accuracy
oraclevs.create_index(conn, vector_store, params={
"idx_name": "hnsw_idx", "idx_type": "HNSW", "accuracy": 95, "parallel": 8
})
# IVF with target accuracy
oraclevs.create_index(conn, vector_store, params={
"idx_name": "ivf_idx", "idx_type": "IVF", "accuracy": 90, "parallel": 16
})
7. Similarity Search and Retrieval
Distance Metrics
| Metric | VECTOR_DISTANCE Name |
DistanceStrategy |
Behavior |
|---|---|---|---|
| Cosine | COSINE |
DistanceStrategy.COSINE |
Measures angle between vectors (0 = identical, 1 = orthogonal). Default in Oracle 23ai. |
| Dot Product | DOT |
DistanceStrategy.DOT_PRODUCT |
Higher = more similar. Sensitive to vector magnitude. |
| Euclidean | EUCLIDEAN |
DistanceStrategy.EUCLIDEAN_DISTANCE |
Straight-line distance. Lower = more similar. |
| Euclidean Squared | EUCLIDEAN_SQUARED |
N/A | Faster variant (skips square root). Same ordering as Euclidean. |
Exam trap: Always use the distance metric that matches what the embedding model was trained with. Cohere Embed v3 models are optimized for cosine similarity. Using dot product with a cosine-trained model degrades results.
Search Methods in LangChain
# Basic similarity search (top-k)
results = vector_store.similarity_search("What is Oracle RAC?", k=3)
# Similarity search with relevance scores
results = vector_store.similarity_search_with_score("What is Oracle RAC?", k=3)
# Maximum Marginal Relevance (MMR) -- balances relevance and diversity
results = vector_store.max_marginal_relevance_search(
"What is Oracle RAC?",
k=3, # Number of results to return
fetch_k=20, # Candidates to consider before MMR reranking
lambda_mult=0.5 # 0=max diversity, 1=max relevance
)
Maximum Marginal Relevance (MMR) reduces redundancy in retrieved documents. Standard similarity search may return multiple chunks saying the same thing. MMR fetches a larger candidate set (fetch_k) then iteratively selects results that are both relevant to the query and diverse from each other. The lambda_mult parameter controls the trade-off: 0.0 maximizes diversity, 1.0 maximizes relevance (equivalent to standard similarity search).
Reranking with Cohere Rerank
OCI provides the cohere.rerank.v3-5 model to reorder retrieved documents by relevance. In a RAG pipeline, reranking is applied after initial retrieval (first-stage) to surface the most contextually relevant results before passing them to the LLM.
| Reranking Benefit | Description |
|---|---|
| Improved precision | Reorders results based on deep query-document relevance analysis |
| Reduced token usage | Filters to top-N most relevant documents before sending to LLM |
| Lower latency | Fewer documents in the prompt = faster LLM inference |
| Better than embedding-only retrieval | Embedding similarity is an approximation; reranking cross-encodes query + document pairs |
Source: Cohere Rerank 3.5 on OCI
8. Response Generation: Chain Types
After retrieving relevant chunks, you must pass them to the LLM for response generation. LangChain provides several chain types for this step.
RetrievalQA Chain Types
| Chain Type | How It Works | Pros | Cons |
|---|---|---|---|
| stuff | Concatenates all retrieved documents into a single prompt, sends one LLM call | Simple, fast, single API call | Fails if combined context exceeds model's context window |
| map_reduce | Sends each document to the LLM separately, then combines answers in a final LLM call | Handles arbitrarily many documents, parallelizable | Many LLM calls (expensive, slower), may lose cross-document context |
| refine | Processes documents sequentially -- generates an initial answer from the first document, then refines it with each subsequent document | Builds progressively detailed answers | Sequential (not parallelizable), slow for many documents |
Exam trap: stuff is the default and best choice for most RAG use cases where retrieved context fits within the context window. Only use map_reduce or refine when the total retrieved text exceeds the model's token limit. map_reduce is parallelizable; refine is not.
RAG Prompt Template Pattern
from langchain_core.prompts import ChatPromptTemplate
rag_prompt = ChatPromptTemplate.from_template("""
Answer the question based only on the provided context.
If the context does not contain enough information, say "I don't have enough information."
Context: {context}
Question: {question}
Answer:""")
Complete RAG Chain with LCEL
from langchain_oci import ChatOCIGenAI, OCIGenAIEmbeddings
from langchain_oracledb.vectorstores.oraclevs import OracleVS
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
# Initialize components
embeddings = OCIGenAIEmbeddings(model_id="cohere.embed-english-v3.0", ...)
vector_store = OracleVS(client=conn, table_name="KNOWLEDGE_BASE",
embedding_function=embeddings)
llm = ChatOCIGenAI(model_id="cohere.command-r-plus-08-2024", ...)
# RAG function
def ask(question: str) -> str:
docs = vector_store.similarity_search(question, k=3)
context = "\n\n".join([doc.page_content for doc in docs])
chain = rag_prompt | llm | StrOutputParser()
return chain.invoke({"context": context, "question": question})
9. Conversational RAG
Conversational RAG extends single-turn RAG by maintaining chat history across multiple exchanges. This allows follow-up questions that reference earlier turns.
Memory Types
| Memory Class | Behavior | Use Case |
|---|---|---|
ConversationBufferMemory |
Stores all messages verbatim | Short conversations, full context needed |
ConversationBufferWindowMemory |
Stores last k exchanges only |
Longer conversations, bounded memory |
ConversationSummaryMemory |
Summarizes conversation history using an LLM | Very long conversations, reduces token usage |
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
Exam trap: ConversationBufferMemory stores everything and will eventually exceed the context window for long conversations. ConversationBufferWindowMemory(k=5) only keeps the last 5 exchanges. ConversationSummaryMemory uses an additional LLM call to summarize history, which costs tokens but preserves the gist of long conversations.
Conversational RAG Architecture
In conversational RAG, the user's follow-up question (e.g., "What about its performance?") must be combined with chat history to reformulate a standalone query for retrieval. The typical pattern:
- User submits follow-up question
- Chat history + follow-up question are sent to an LLM to generate a standalone question
- Standalone question is embedded and used for similarity search
- Retrieved documents + chat history + question are sent to the LLM for final response
- Response and question are added to memory
This prevents the retriever from searching for "its performance" without knowing what "it" refers to.
10. Quick-Reference Summary Table
| Concept | Key Fact for Exam |
|---|---|
| RAG vs. fine-tuning | RAG = retrieval at inference, no weight changes. Fine-tuning = weight changes through training. |
langchain-oci |
Current recommended package (replaces langchain-community for OCI) |
embed_query() vs. embed_documents() |
Different Cohere input_type values -- do not interchange |
| Cohere Embed v3 dimensions | Full models: 1024. Light models: 384. |
| Max embedding inputs per batch | 96 |
| Max tokens per embedding input | 512 (text-only models) |
| Default distance metric (Oracle 23ai) | Cosine |
| HNSW vs. IVF | HNSW = in-memory graph, faster search. IVF = partition-based, handles larger datasets. |
FETCH EXACT FIRST k ROWS ONLY |
Forces exact search, bypasses vector index |
MMR lambda_mult |
0.0 = max diversity, 1.0 = max relevance |
stuff chain type |
Default. Concatenates all docs. Fails if context exceeds window. |
map_reduce chain type |
Parallel LLM calls per doc, then combines. Many API calls. |
refine chain type |
Sequential refinement. Not parallelizable. |
| Rerank model | cohere.rerank.v3-5 -- reorders retrieved docs by relevance |
ConversationBufferWindowMemory(k=N) |
Keeps only last N exchanges |
Oracle VECTOR data type |
Requires COMPATIBLE parameter 23.4.0+ |
| Chunk size sweet spot | 400-512 tokens with 10-20% overlap for most RAG use cases |
References
- OCI Generative AI Service Documentation
- OCI Generative AI Pretrained Models
- OCI Generative AI Embedding Models
- Cohere Embed English 3 Specifications
- Cohere Rerank 3.5 on OCI
- OCI LangChain Integration
- LangChain ChatOCIGenAI Documentation
- LangChain Oracle AI Vector Search Integration
- Oracle AI Vector Search Overview
- Oracle Vector Index Types
- K21 Academy OCI Generative AI Certification Guide
- OCI Generative AI and LangChain Enterprise Applications