Reference

Domain 3: Implement RAG using OCI Generative AI Service (20%)

Domain 3 of the 1Z0-1127-25 OCI 2025 Generative AI Professional exam tests your ability to build Retrieval-Augmented Generation pipelines using OCI Generative AI, LangChain, and Oracle Database 23ai. This domain represents approximately 10 questions on the exam. It is the most hands-on domain -- expect questions that require understanding of specific classes, parameters, SQL syntax, and end-to-end pipeline architecture rather than abstract concepts.

1. RAG Architecture: End-to-End Workflow

What RAG Is and Why It Matters

Retrieval-Augmented Generation (RAG) combines a retrieval step with a generation step. Instead of relying solely on the LLM's training data (which may be outdated, incomplete, or hallucinated), RAG first retrieves relevant documents from an external knowledge base and injects them into the prompt as context. The LLM then generates a response grounded in those retrieved facts.

Problem	How RAG Solves It
Hallucination	Grounds responses in retrieved factual documents
Stale knowledge	Knowledge base can be updated without retraining the model
Private data	Incorporates enterprise-specific data the LLM was never trained on
Cost	Far cheaper than fine-tuning a model on proprietary data
Auditability	Retrieved source documents can be cited and traced

Exam trap: RAG vs. fine-tuning. RAG retrieves external context at inference time -- it does not modify model weights. Fine-tuning (covered in Domain 2) changes model weights through additional training. The exam will test whether you understand when to use each approach. RAG is preferred when you need up-to-date or private data without the cost and complexity of retraining. Fine-tuning is preferred when you need to change the model's style, tone, or specialized capability.

The RAG Pipeline (Five Stages)

Documents --> Load --> Split/Chunk --> Embed --> Store in Vector DB
                                                       |
User Query --> Embed Query --> Similarity Search -------+
                                                       |
                                    Retrieved Context --+--> Prompt + LLM --> Response

Stage	Component	OCI/LangChain Tool
1. Document Loading	Ingest raw data (PDF, text, web, CSV)	LangChain Document Loaders
2. Text Splitting	Break documents into chunks	`RecursiveCharacterTextSplitter`, `TokenTextSplitter`
3. Embedding	Convert chunks to vector representations	`OCIGenAIEmbeddings` with Cohere Embed models
4. Vector Storage	Store and index embeddings	Oracle 23ai (`OracleVS`), FAISS, ChromaDB
5. Retrieval + Generation	Search for relevant chunks, generate response	`similarity_search()`, `ChatOCIGenAI`, LCEL chains

2. LangChain Integration with OCI Generative AI

Installation

pip install langchain-oci langchain-oracledb langchain-community oracledb

The langchain-oci package is the current recommended package (replaces the deprecated langchain-community OCI integration). Source: OCI LangChain Docs

Core OCI LangChain Classes

Class	Package	Purpose
`ChatOCIGenAI`	`langchain_oci`	Chat/conversational model interface
`OCIGenAI`	`langchain_oci`	Text completion (non-chat) interface
`OCIGenAIEmbeddings`	`langchain_oci`	Embedding generation
`OracleVS`	`langchain_oracledb`	Oracle 23ai vector store

ChatOCIGenAI Configuration

from langchain_oci import ChatOCIGenAI

llm = ChatOCIGenAI(
    model_id="cohere.command-r-plus-08-2024",
    service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
    compartment_id="ocid1.compartment.oc1...",
    model_kwargs={"temperature": 0.3, "max_tokens": 1000},
)

Key constructor parameters:

Parameter	Description
`model_id`	OCI model identifier (e.g., `cohere.command-r-plus-08-2024`, `meta.llama-3.3-70b-instruct`)
`service_endpoint`	Regional inference endpoint URL
`compartment_id`	OCI compartment OCID
`model_kwargs`	Dict with `temperature`, `max_tokens`, etc.
`auth_profile`	OCI config profile name (optional)
`is_stream`	Enable streaming responses (optional)

Authentication methods (same as all OCI services): API Key (default), Session Token, Instance Principal, Resource Principal. Source: LangChain OCI Chat Docs

OCIGenAIEmbeddings Configuration

from langchain_oci import OCIGenAIEmbeddings

embeddings = OCIGenAIEmbeddings(
    model_id="cohere.embed-english-v3.0",
    service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
    compartment_id="ocid1.compartment.oc1...",
)

# Single query embedding
query_vector = embeddings.embed_query("What is Oracle RAC?")

# Batch document embeddings
doc_vectors = embeddings.embed_documents(["Document 1 text", "Document 2 text"])

Exam trap: The embed_query() and embed_documents() methods use different Cohere input types internally. embed_query() uses input_type="search_query" and embed_documents() uses input_type="search_document". You must use the correct method for each purpose -- using embed_documents() for a query will produce suboptimal search results because the embedding model optimizes differently for each input type.

LangChain Expression Language (LCEL)

LCEL provides a declarative way to compose chains using the pipe (|) operator. This is the modern LangChain pattern replacing the older LLMChain and SequentialChain classes.

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_oci import ChatOCIGenAI

prompt = ChatPromptTemplate.from_template(
    "Answer based on context:\n\nContext: {context}\n\nQuestion: {question}\n\nAnswer:"
)
llm = ChatOCIGenAI(model_id="cohere.command-r-plus-08-2024", ...)
output_parser = StrOutputParser()

chain = prompt | llm | output_parser
result = chain.invoke({"context": "...", "question": "What is RAG?"})

The LCEL chain reads left to right: prompt template formats input, passes to LLM, output parser extracts the string response. Source: LangChain LCEL Docs

3. Document Loading

LangChain Document Loaders convert external data sources into Document objects (containing page_content and metadata). Know these loaders:

Loader	Source Type	Import
`PyPDFLoader`	PDF files	`langchain_community.document_loaders`
`TextLoader`	Plain text files	`langchain_community.document_loaders`
`CSVLoader`	CSV files (one row = one document)	`langchain_community.document_loaders`
`WebBaseLoader`	Web pages (uses requests + BeautifulSoup)	`langchain_community.document_loaders`
`DirectoryLoader`	All files in a directory	`langchain_community.document_loaders`
`UnstructuredFileLoader`	Multiple formats (PDF, HTML, DOCX, etc.)	`langchain_community.document_loaders`

Exam trap: DirectoryLoader uses UnstructuredLoader by default, not TextLoader. It can handle mixed file types in a single directory.

4. Text Splitting and Chunking

After loading documents, you must split them into chunks small enough for embedding and retrieval. Chunk quality directly impacts retrieval quality.

RecursiveCharacterTextSplitter (Most Important)

The recommended default splitter for most RAG use cases. It splits text recursively using a hierarchy of separators: ["\n\n", "\n", " ", ""]. It tries the first separator first; if chunks are still too large, it falls back to the next separator.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    length_function=len,
    separators=["\n\n", "\n", " ", ""],
)
chunks = splitter.split_documents(documents)

TokenTextSplitter

Splits based on token count rather than character count. Important when working with models that have specific token limits.

from langchain.text_splitter import TokenTextSplitter

splitter = TokenTextSplitter(chunk_size=512, chunk_overlap=50)

Chunking Strategy Guidelines

Parameter	Guidance	Impact
`chunk_size`	400-512 tokens for most use cases	Too large: noise dilutes relevant content. Too small: loses context.
`chunk_overlap`	10-20% of chunk size (e.g., 50-100 for 512)	Prevents important information from being split across chunk boundaries
Factoid Q&A	256-512 tokens optimal	Short, focused answers need precise retrieval
Analytical Q&A	1024+ tokens may be needed	Complex answers need broader context

Exam trap: Increasing chunk size does not always improve retrieval quality. Larger chunks contain more noise and can dilute the relevance signal during similarity search. The right chunk size depends on the query type and document structure.

5. Embedding Models in OCI Generative AI

OCI Generative AI provides Cohere embedding models. Know the model variants and their specifications.

OCI Embedding Model Specifications

Model	Model ID	Dimensions	Input Type
Embed 4 (multimodal)	`cohere.embed-v4.0`	1024	Text + images
Embed English v3	`cohere.embed-english-v3.0`	1024	Text only
Embed English Light v3	`cohere.embed-english-light-v3.0`	384	Text only
Embed Multilingual v3	`cohere.embed-multilingual-v3.0`	1024	Text only
Embed Multilingual Light v3	`cohere.embed-multilingual-light-v3.0`	384	Text only
Embed English Image v3	`cohere.embed-english-image-v3.0`	1024	Text + images
Embed Multilingual Image v3	`cohere.embed-multilingual-image-v3.0`	1024	Text + images

Source: OCI Embedding Models, Cohere Embed English 3

Key Embedding Constraints

Constraint	Value
Max inputs per batch	96
Max tokens per input (text models)	512
Max tokens per input (image models)	128,000 combined
Truncation options	`START`, `END`, `NONE` (default returns error)
Output format	Array of floating-point vectors

Cohere Input Types

Cohere Embed v3+ models require specifying an input_type. This is handled automatically by LangChain's embed_query() and embed_documents() methods.

Input Type	When Used	LangChain Method
`search_document`	Embedding documents for storage in vector DB	`embed_documents()`
`search_query`	Embedding user queries for similarity search	`embed_query()`
`classification`	Text classification tasks	Manual API call
`clustering`	Text clustering tasks	Manual API call

Exam trap: Full models produce 1024-dimensional vectors. Light models produce 384-dimensional vectors. Light models are faster and cheaper but less accurate. You cannot mix embeddings from different models in the same vector store -- the dimensions and semantic spaces will not align.

6. Vector Storage and Indexing

Vector Store Options

Vector Store	Type	Best For
Oracle Database 23ai (`OracleVS`)	Production relational + vector DB	Enterprise RAG with existing Oracle infrastructure, combined relational + vector queries
FAISS	In-memory, local	Prototyping, small-to-medium datasets, no database infrastructure needed
ChromaDB	Embedded/client-server	Development, lightweight persistent storage

Oracle 23ai Vector Store Integration

Oracle Database 23ai introduces the VECTOR data type and AI Vector Search. This is the exam's primary vector store.

from langchain_oracledb.vectorstores.oraclevs import OracleVS
from langchain_community.vectorstores.utils import DistanceStrategy
import oracledb

conn = oracledb.connect(user="username", password="password", dsn="host:port/service")

vector_store = OracleVS.from_documents(
    documents,
    embeddings,
    client=conn,
    table_name="KNOWLEDGE_BASE",
    distance_strategy=DistanceStrategy.COSINE,
)

Source: LangChain Oracle Vector Store

Oracle 23ai SQL Vector Operations

-- Create table with VECTOR column
CREATE TABLE docs (
    doc_id INT,
    doc_text CLOB,
    doc_vector VECTOR
);

-- Exact similarity search (top-K)
SELECT doc_id, doc_text
FROM docs
ORDER BY VECTOR_DISTANCE(doc_vector, :query_vector, COSINE)
FETCH EXACT FIRST 5 ROWS ONLY;

-- Approximate similarity search (uses vector index if available)
SELECT doc_id, doc_text
FROM docs
ORDER BY VECTOR_DISTANCE(doc_vector, :query_vector, COSINE)
FETCH FIRST 5 ROWS ONLY;

Source: Oracle AI Vector Search

Exam trap: The EXACT keyword in FETCH EXACT FIRST k ROWS ONLY forces exact (brute-force) search even when a vector index exists. Omitting EXACT allows the optimizer to use approximate search via vector indexes. On Autonomous Database Serverless (ADB-S), omitting EXACT automatically attempts approximate search if an index is available.

Vector Index Types

Index Type	Organization	Best For	Key Parameters
HNSW	In-Memory Neighbor Graph	High search quality, fast queries, smaller datasets that fit in memory	`neighbors` (max edges per node), `efConstruction` (search width during build)
IVF	Neighbor Partitions	Large datasets, disk-based, balanced speed/quality	`neighbor_part` (number of partitions)

-- HNSW index creation
CREATE VECTOR INDEX docs_hnsw_idx ON docs (doc_vector)
ORGANIZATION INMEMORY NEIGHBOR GRAPH
DISTANCE COSINE
WITH TARGET ACCURACY 95
PARAMETERS (type HNSW, neighbors 40, efConstruction 500);

-- IVF index creation
CREATE VECTOR INDEX docs_ivf_idx ON docs (doc_vector)
ORGANIZATION NEIGHBOR PARTITIONS
DISTANCE COSINE
WITH TARGET ACCURACY 90;

Source: Oracle Vector Index Types

Exam trap: If you do not specify a DISTANCE metric in the index definition, the default is COSINE. The distance metric used at index creation must match the metric used during search queries -- otherwise, Oracle falls back to exact search, bypassing the index entirely.

LangChain Index Creation (Python)

from langchain_oracledb.vectorstores import oraclevs

# HNSW with target accuracy
oraclevs.create_index(conn, vector_store, params={
    "idx_name": "hnsw_idx", "idx_type": "HNSW", "accuracy": 95, "parallel": 8
})

# IVF with target accuracy
oraclevs.create_index(conn, vector_store, params={
    "idx_name": "ivf_idx", "idx_type": "IVF", "accuracy": 90, "parallel": 16
})

7. Similarity Search and Retrieval

Distance Metrics

Metric	`VECTOR_DISTANCE` Name	`DistanceStrategy`	Behavior
Cosine	`COSINE`	`DistanceStrategy.COSINE`	Measures angle between vectors (0 = identical, 1 = orthogonal). Default in Oracle 23ai.
Dot Product	`DOT`	`DistanceStrategy.DOT_PRODUCT`	Higher = more similar. Sensitive to vector magnitude.
Euclidean	`EUCLIDEAN`	`DistanceStrategy.EUCLIDEAN_DISTANCE`	Straight-line distance. Lower = more similar.
Euclidean Squared	`EUCLIDEAN_SQUARED`	N/A	Faster variant (skips square root). Same ordering as Euclidean.

Exam trap: Always use the distance metric that matches what the embedding model was trained with. Cohere Embed v3 models are optimized for cosine similarity. Using dot product with a cosine-trained model degrades results.

Search Methods in LangChain

# Basic similarity search (top-k)
results = vector_store.similarity_search("What is Oracle RAC?", k=3)

# Similarity search with relevance scores
results = vector_store.similarity_search_with_score("What is Oracle RAC?", k=3)

# Maximum Marginal Relevance (MMR) -- balances relevance and diversity
results = vector_store.max_marginal_relevance_search(
    "What is Oracle RAC?",
    k=3,           # Number of results to return
    fetch_k=20,    # Candidates to consider before MMR reranking
    lambda_mult=0.5  # 0=max diversity, 1=max relevance
)

Maximum Marginal Relevance (MMR) reduces redundancy in retrieved documents. Standard similarity search may return multiple chunks saying the same thing. MMR fetches a larger candidate set (fetch_k) then iteratively selects results that are both relevant to the query and diverse from each other. The lambda_mult parameter controls the trade-off: 0.0 maximizes diversity, 1.0 maximizes relevance (equivalent to standard similarity search).

Reranking with Cohere Rerank

OCI provides the cohere.rerank.v3-5 model to reorder retrieved documents by relevance. In a RAG pipeline, reranking is applied after initial retrieval (first-stage) to surface the most contextually relevant results before passing them to the LLM.

Reranking Benefit	Description
Improved precision	Reorders results based on deep query-document relevance analysis
Reduced token usage	Filters to top-N most relevant documents before sending to LLM
Lower latency	Fewer documents in the prompt = faster LLM inference
Better than embedding-only retrieval	Embedding similarity is an approximation; reranking cross-encodes query + document pairs

Source: Cohere Rerank 3.5 on OCI

8. Response Generation: Chain Types

After retrieving relevant chunks, you must pass them to the LLM for response generation. LangChain provides several chain types for this step.

RetrievalQA Chain Types

Chain Type	How It Works	Pros	Cons
stuff	Concatenates all retrieved documents into a single prompt, sends one LLM call	Simple, fast, single API call	Fails if combined context exceeds model's context window
map_reduce	Sends each document to the LLM separately, then combines answers in a final LLM call	Handles arbitrarily many documents, parallelizable	Many LLM calls (expensive, slower), may lose cross-document context
refine	Processes documents sequentially -- generates an initial answer from the first document, then refines it with each subsequent document	Builds progressively detailed answers	Sequential (not parallelizable), slow for many documents

Exam trap: stuff is the default and best choice for most RAG use cases where retrieved context fits within the context window. Only use map_reduce or refine when the total retrieved text exceeds the model's token limit. map_reduce is parallelizable; refine is not.

RAG Prompt Template Pattern

from langchain_core.prompts import ChatPromptTemplate

rag_prompt = ChatPromptTemplate.from_template("""
Answer the question based only on the provided context.
If the context does not contain enough information, say "I don't have enough information."

Context: {context}

Question: {question}

Answer:""")

Complete RAG Chain with LCEL

from langchain_oci import ChatOCIGenAI, OCIGenAIEmbeddings
from langchain_oracledb.vectorstores.oraclevs import OracleVS
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Initialize components
embeddings = OCIGenAIEmbeddings(model_id="cohere.embed-english-v3.0", ...)
vector_store = OracleVS(client=conn, table_name="KNOWLEDGE_BASE",
                        embedding_function=embeddings)
llm = ChatOCIGenAI(model_id="cohere.command-r-plus-08-2024", ...)

# RAG function
def ask(question: str) -> str:
    docs = vector_store.similarity_search(question, k=3)
    context = "\n\n".join([doc.page_content for doc in docs])
    chain = rag_prompt | llm | StrOutputParser()
    return chain.invoke({"context": context, "question": question})

9. Conversational RAG

Conversational RAG extends single-turn RAG by maintaining chat history across multiple exchanges. This allows follow-up questions that reference earlier turns.

Memory Types

Memory Class	Behavior	Use Case
`ConversationBufferMemory`	Stores all messages verbatim	Short conversations, full context needed
`ConversationBufferWindowMemory`	Stores last `k` exchanges only	Longer conversations, bounded memory
`ConversationSummaryMemory`	Summarizes conversation history using an LLM	Very long conversations, reduces token usage

from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

Exam trap: ConversationBufferMemory stores everything and will eventually exceed the context window for long conversations. ConversationBufferWindowMemory(k=5) only keeps the last 5 exchanges. ConversationSummaryMemory uses an additional LLM call to summarize history, which costs tokens but preserves the gist of long conversations.

Conversational RAG Architecture

In conversational RAG, the user's follow-up question (e.g., "What about its performance?") must be combined with chat history to reformulate a standalone query for retrieval. The typical pattern:

User submits follow-up question
Chat history + follow-up question are sent to an LLM to generate a standalone question
Standalone question is embedded and used for similarity search
Retrieved documents + chat history + question are sent to the LLM for final response
Response and question are added to memory

This prevents the retriever from searching for "its performance" without knowing what "it" refers to.

10. Quick-Reference Summary Table

Concept	Key Fact for Exam
RAG vs. fine-tuning	RAG = retrieval at inference, no weight changes. Fine-tuning = weight changes through training.
`langchain-oci`	Current recommended package (replaces `langchain-community` for OCI)
`embed_query()` vs. `embed_documents()`	Different Cohere `input_type` values -- do not interchange
Cohere Embed v3 dimensions	Full models: 1024. Light models: 384.
Max embedding inputs per batch	96
Max tokens per embedding input	512 (text-only models)
Default distance metric (Oracle 23ai)	Cosine
HNSW vs. IVF	HNSW = in-memory graph, faster search. IVF = partition-based, handles larger datasets.
`FETCH EXACT FIRST k ROWS ONLY`	Forces exact search, bypasses vector index
MMR `lambda_mult`	0.0 = max diversity, 1.0 = max relevance
`stuff` chain type	Default. Concatenates all docs. Fails if context exceeds window.
`map_reduce` chain type	Parallel LLM calls per doc, then combines. Many API calls.
`refine` chain type	Sequential refinement. Not parallelizable.
Rerank model	`cohere.rerank.v3-5` -- reorders retrieved docs by relevance
`ConversationBufferWindowMemory(k=N)`	Keeps only last N exchanges
Oracle `VECTOR` data type	Requires `COMPATIBLE` parameter 23.4.0+
Chunk size sweet spot	400-512 tokens with 10-20% overlap for most RAG use cases