Reference

Domain 3: Implement RAG using OCI Generative AI Service (20%)

Domain 3 of the 1Z0-1127-25 OCI 2025 Generative AI Professional exam tests your ability to build Retrieval-Augmented Generation pipelines using OCI Generative AI, LangChain, and Oracle Database 23ai. This domain represents approximately 10 questions on the exam. It is the most hands-on domain -- expect questions that require understanding of specific classes, parameters, SQL syntax, and end-to-end pipeline architecture rather than abstract concepts.

1. RAG Architecture: End-to-End Workflow

What RAG Is and Why It Matters

Retrieval-Augmented Generation (RAG) combines a retrieval step with a generation step. Instead of relying solely on the LLM's training data (which may be outdated, incomplete, or hallucinated), RAG first retrieves relevant documents from an external knowledge base and injects them into the prompt as context. The LLM then generates a response grounded in those retrieved facts.

Problem How RAG Solves It
Hallucination Grounds responses in retrieved factual documents
Stale knowledge Knowledge base can be updated without retraining the model
Private data Incorporates enterprise-specific data the LLM was never trained on
Cost Far cheaper than fine-tuning a model on proprietary data
Auditability Retrieved source documents can be cited and traced

Exam trap: RAG vs. fine-tuning. RAG retrieves external context at inference time -- it does not modify model weights. Fine-tuning (covered in Domain 2) changes model weights through additional training. The exam will test whether you understand when to use each approach. RAG is preferred when you need up-to-date or private data without the cost and complexity of retraining. Fine-tuning is preferred when you need to change the model's style, tone, or specialized capability.

The RAG Pipeline (Five Stages)

Documents --> Load --> Split/Chunk --> Embed --> Store in Vector DB
                                                       |
User Query --> Embed Query --> Similarity Search -------+
                                                       |
                                    Retrieved Context --+--> Prompt + LLM --> Response
Stage Component OCI/LangChain Tool
1. Document Loading Ingest raw data (PDF, text, web, CSV) LangChain Document Loaders
2. Text Splitting Break documents into chunks RecursiveCharacterTextSplitter, TokenTextSplitter
3. Embedding Convert chunks to vector representations OCIGenAIEmbeddings with Cohere Embed models
4. Vector Storage Store and index embeddings Oracle 23ai (OracleVS), FAISS, ChromaDB
5. Retrieval + Generation Search for relevant chunks, generate response similarity_search(), ChatOCIGenAI, LCEL chains

2. LangChain Integration with OCI Generative AI

Installation

pip install langchain-oci langchain-oracledb langchain-community oracledb

The langchain-oci package is the current recommended package (replaces the deprecated langchain-community OCI integration). Source: OCI LangChain Docs

Core OCI LangChain Classes

Class Package Purpose
ChatOCIGenAI langchain_oci Chat/conversational model interface
OCIGenAI langchain_oci Text completion (non-chat) interface
OCIGenAIEmbeddings langchain_oci Embedding generation
OracleVS langchain_oracledb Oracle 23ai vector store

ChatOCIGenAI Configuration

from langchain_oci import ChatOCIGenAI

llm = ChatOCIGenAI(
    model_id="cohere.command-r-plus-08-2024",
    service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
    compartment_id="ocid1.compartment.oc1...",
    model_kwargs={"temperature": 0.3, "max_tokens": 1000},
)

Key constructor parameters:

Parameter Description
model_id OCI model identifier (e.g., cohere.command-r-plus-08-2024, meta.llama-3.3-70b-instruct)
service_endpoint Regional inference endpoint URL
compartment_id OCI compartment OCID
model_kwargs Dict with temperature, max_tokens, etc.
auth_profile OCI config profile name (optional)
is_stream Enable streaming responses (optional)

Authentication methods (same as all OCI services): API Key (default), Session Token, Instance Principal, Resource Principal. Source: LangChain OCI Chat Docs

OCIGenAIEmbeddings Configuration

from langchain_oci import OCIGenAIEmbeddings

embeddings = OCIGenAIEmbeddings(
    model_id="cohere.embed-english-v3.0",
    service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
    compartment_id="ocid1.compartment.oc1...",
)

# Single query embedding
query_vector = embeddings.embed_query("What is Oracle RAC?")

# Batch document embeddings
doc_vectors = embeddings.embed_documents(["Document 1 text", "Document 2 text"])

Exam trap: The embed_query() and embed_documents() methods use different Cohere input types internally. embed_query() uses input_type="search_query" and embed_documents() uses input_type="search_document". You must use the correct method for each purpose -- using embed_documents() for a query will produce suboptimal search results because the embedding model optimizes differently for each input type.

LangChain Expression Language (LCEL)

LCEL provides a declarative way to compose chains using the pipe (|) operator. This is the modern LangChain pattern replacing the older LLMChain and SequentialChain classes.

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_oci import ChatOCIGenAI

prompt = ChatPromptTemplate.from_template(
    "Answer based on context:\n\nContext: {context}\n\nQuestion: {question}\n\nAnswer:"
)
llm = ChatOCIGenAI(model_id="cohere.command-r-plus-08-2024", ...)
output_parser = StrOutputParser()

chain = prompt | llm | output_parser
result = chain.invoke({"context": "...", "question": "What is RAG?"})

The LCEL chain reads left to right: prompt template formats input, passes to LLM, output parser extracts the string response. Source: LangChain LCEL Docs

3. Document Loading

LangChain Document Loaders convert external data sources into Document objects (containing page_content and metadata). Know these loaders:

Loader Source Type Import
PyPDFLoader PDF files langchain_community.document_loaders
TextLoader Plain text files langchain_community.document_loaders
CSVLoader CSV files (one row = one document) langchain_community.document_loaders
WebBaseLoader Web pages (uses requests + BeautifulSoup) langchain_community.document_loaders
DirectoryLoader All files in a directory langchain_community.document_loaders
UnstructuredFileLoader Multiple formats (PDF, HTML, DOCX, etc.) langchain_community.document_loaders

Exam trap: DirectoryLoader uses UnstructuredLoader by default, not TextLoader. It can handle mixed file types in a single directory.

4. Text Splitting and Chunking

After loading documents, you must split them into chunks small enough for embedding and retrieval. Chunk quality directly impacts retrieval quality.

RecursiveCharacterTextSplitter (Most Important)

The recommended default splitter for most RAG use cases. It splits text recursively using a hierarchy of separators: ["\n\n", "\n", " ", ""]. It tries the first separator first; if chunks are still too large, it falls back to the next separator.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    length_function=len,
    separators=["\n\n", "\n", " ", ""],
)
chunks = splitter.split_documents(documents)

TokenTextSplitter

Splits based on token count rather than character count. Important when working with models that have specific token limits.

from langchain.text_splitter import TokenTextSplitter

splitter = TokenTextSplitter(chunk_size=512, chunk_overlap=50)

Chunking Strategy Guidelines

Parameter Guidance Impact
chunk_size 400-512 tokens for most use cases Too large: noise dilutes relevant content. Too small: loses context.
chunk_overlap 10-20% of chunk size (e.g., 50-100 for 512) Prevents important information from being split across chunk boundaries
Factoid Q&A 256-512 tokens optimal Short, focused answers need precise retrieval
Analytical Q&A 1024+ tokens may be needed Complex answers need broader context

Exam trap: Increasing chunk size does not always improve retrieval quality. Larger chunks contain more noise and can dilute the relevance signal during similarity search. The right chunk size depends on the query type and document structure.

5. Embedding Models in OCI Generative AI

OCI Generative AI provides Cohere embedding models. Know the model variants and their specifications.

OCI Embedding Model Specifications

Model Model ID Dimensions Input Type
Embed 4 (multimodal) cohere.embed-v4.0 1024 Text + images
Embed English v3 cohere.embed-english-v3.0 1024 Text only
Embed English Light v3 cohere.embed-english-light-v3.0 384 Text only
Embed Multilingual v3 cohere.embed-multilingual-v3.0 1024 Text only
Embed Multilingual Light v3 cohere.embed-multilingual-light-v3.0 384 Text only
Embed English Image v3 cohere.embed-english-image-v3.0 1024 Text + images
Embed Multilingual Image v3 cohere.embed-multilingual-image-v3.0 1024 Text + images

Source: OCI Embedding Models, Cohere Embed English 3

Key Embedding Constraints

Constraint Value
Max inputs per batch 96
Max tokens per input (text models) 512
Max tokens per input (image models) 128,000 combined
Truncation options START, END, NONE (default returns error)
Output format Array of floating-point vectors

Cohere Input Types

Cohere Embed v3+ models require specifying an input_type. This is handled automatically by LangChain's embed_query() and embed_documents() methods.

Input Type When Used LangChain Method
search_document Embedding documents for storage in vector DB embed_documents()
search_query Embedding user queries for similarity search embed_query()
classification Text classification tasks Manual API call
clustering Text clustering tasks Manual API call

Exam trap: Full models produce 1024-dimensional vectors. Light models produce 384-dimensional vectors. Light models are faster and cheaper but less accurate. You cannot mix embeddings from different models in the same vector store -- the dimensions and semantic spaces will not align.

6. Vector Storage and Indexing

Vector Store Options

Vector Store Type Best For
Oracle Database 23ai (OracleVS) Production relational + vector DB Enterprise RAG with existing Oracle infrastructure, combined relational + vector queries
FAISS In-memory, local Prototyping, small-to-medium datasets, no database infrastructure needed
ChromaDB Embedded/client-server Development, lightweight persistent storage

Oracle 23ai Vector Store Integration

Oracle Database 23ai introduces the VECTOR data type and AI Vector Search. This is the exam's primary vector store.

from langchain_oracledb.vectorstores.oraclevs import OracleVS
from langchain_community.vectorstores.utils import DistanceStrategy
import oracledb

conn = oracledb.connect(user="username", password="password", dsn="host:port/service")

vector_store = OracleVS.from_documents(
    documents,
    embeddings,
    client=conn,
    table_name="KNOWLEDGE_BASE",
    distance_strategy=DistanceStrategy.COSINE,
)

Source: LangChain Oracle Vector Store

Oracle 23ai SQL Vector Operations

-- Create table with VECTOR column
CREATE TABLE docs (
    doc_id INT,
    doc_text CLOB,
    doc_vector VECTOR
);

-- Exact similarity search (top-K)
SELECT doc_id, doc_text
FROM docs
ORDER BY VECTOR_DISTANCE(doc_vector, :query_vector, COSINE)
FETCH EXACT FIRST 5 ROWS ONLY;

-- Approximate similarity search (uses vector index if available)
SELECT doc_id, doc_text
FROM docs
ORDER BY VECTOR_DISTANCE(doc_vector, :query_vector, COSINE)
FETCH FIRST 5 ROWS ONLY;

Source: Oracle AI Vector Search

Exam trap: The EXACT keyword in FETCH EXACT FIRST k ROWS ONLY forces exact (brute-force) search even when a vector index exists. Omitting EXACT allows the optimizer to use approximate search via vector indexes. On Autonomous Database Serverless (ADB-S), omitting EXACT automatically attempts approximate search if an index is available.

Vector Index Types

Index Type Organization Best For Key Parameters
HNSW In-Memory Neighbor Graph High search quality, fast queries, smaller datasets that fit in memory neighbors (max edges per node), efConstruction (search width during build)
IVF Neighbor Partitions Large datasets, disk-based, balanced speed/quality neighbor_part (number of partitions)
-- HNSW index creation
CREATE VECTOR INDEX docs_hnsw_idx ON docs (doc_vector)
ORGANIZATION INMEMORY NEIGHBOR GRAPH
DISTANCE COSINE
WITH TARGET ACCURACY 95
PARAMETERS (type HNSW, neighbors 40, efConstruction 500);

-- IVF index creation
CREATE VECTOR INDEX docs_ivf_idx ON docs (doc_vector)
ORGANIZATION NEIGHBOR PARTITIONS
DISTANCE COSINE
WITH TARGET ACCURACY 90;

Source: Oracle Vector Index Types

Exam trap: If you do not specify a DISTANCE metric in the index definition, the default is COSINE. The distance metric used at index creation must match the metric used during search queries -- otherwise, Oracle falls back to exact search, bypassing the index entirely.

LangChain Index Creation (Python)

from langchain_oracledb.vectorstores import oraclevs

# HNSW with target accuracy
oraclevs.create_index(conn, vector_store, params={
    "idx_name": "hnsw_idx", "idx_type": "HNSW", "accuracy": 95, "parallel": 8
})

# IVF with target accuracy
oraclevs.create_index(conn, vector_store, params={
    "idx_name": "ivf_idx", "idx_type": "IVF", "accuracy": 90, "parallel": 16
})

7. Similarity Search and Retrieval

Distance Metrics

Metric VECTOR_DISTANCE Name DistanceStrategy Behavior
Cosine COSINE DistanceStrategy.COSINE Measures angle between vectors (0 = identical, 1 = orthogonal). Default in Oracle 23ai.
Dot Product DOT DistanceStrategy.DOT_PRODUCT Higher = more similar. Sensitive to vector magnitude.
Euclidean EUCLIDEAN DistanceStrategy.EUCLIDEAN_DISTANCE Straight-line distance. Lower = more similar.
Euclidean Squared EUCLIDEAN_SQUARED N/A Faster variant (skips square root). Same ordering as Euclidean.

Exam trap: Always use the distance metric that matches what the embedding model was trained with. Cohere Embed v3 models are optimized for cosine similarity. Using dot product with a cosine-trained model degrades results.

Search Methods in LangChain

# Basic similarity search (top-k)
results = vector_store.similarity_search("What is Oracle RAC?", k=3)

# Similarity search with relevance scores
results = vector_store.similarity_search_with_score("What is Oracle RAC?", k=3)

# Maximum Marginal Relevance (MMR) -- balances relevance and diversity
results = vector_store.max_marginal_relevance_search(
    "What is Oracle RAC?",
    k=3,           # Number of results to return
    fetch_k=20,    # Candidates to consider before MMR reranking
    lambda_mult=0.5  # 0=max diversity, 1=max relevance
)

Maximum Marginal Relevance (MMR) reduces redundancy in retrieved documents. Standard similarity search may return multiple chunks saying the same thing. MMR fetches a larger candidate set (fetch_k) then iteratively selects results that are both relevant to the query and diverse from each other. The lambda_mult parameter controls the trade-off: 0.0 maximizes diversity, 1.0 maximizes relevance (equivalent to standard similarity search).

Reranking with Cohere Rerank

OCI provides the cohere.rerank.v3-5 model to reorder retrieved documents by relevance. In a RAG pipeline, reranking is applied after initial retrieval (first-stage) to surface the most contextually relevant results before passing them to the LLM.

Reranking Benefit Description
Improved precision Reorders results based on deep query-document relevance analysis
Reduced token usage Filters to top-N most relevant documents before sending to LLM
Lower latency Fewer documents in the prompt = faster LLM inference
Better than embedding-only retrieval Embedding similarity is an approximation; reranking cross-encodes query + document pairs

Source: Cohere Rerank 3.5 on OCI

8. Response Generation: Chain Types

After retrieving relevant chunks, you must pass them to the LLM for response generation. LangChain provides several chain types for this step.

RetrievalQA Chain Types

Chain Type How It Works Pros Cons
stuff Concatenates all retrieved documents into a single prompt, sends one LLM call Simple, fast, single API call Fails if combined context exceeds model's context window
map_reduce Sends each document to the LLM separately, then combines answers in a final LLM call Handles arbitrarily many documents, parallelizable Many LLM calls (expensive, slower), may lose cross-document context
refine Processes documents sequentially -- generates an initial answer from the first document, then refines it with each subsequent document Builds progressively detailed answers Sequential (not parallelizable), slow for many documents

Exam trap: stuff is the default and best choice for most RAG use cases where retrieved context fits within the context window. Only use map_reduce or refine when the total retrieved text exceeds the model's token limit. map_reduce is parallelizable; refine is not.

RAG Prompt Template Pattern

from langchain_core.prompts import ChatPromptTemplate

rag_prompt = ChatPromptTemplate.from_template("""
Answer the question based only on the provided context.
If the context does not contain enough information, say "I don't have enough information."

Context: {context}

Question: {question}

Answer:""")

Complete RAG Chain with LCEL

from langchain_oci import ChatOCIGenAI, OCIGenAIEmbeddings
from langchain_oracledb.vectorstores.oraclevs import OracleVS
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Initialize components
embeddings = OCIGenAIEmbeddings(model_id="cohere.embed-english-v3.0", ...)
vector_store = OracleVS(client=conn, table_name="KNOWLEDGE_BASE",
                        embedding_function=embeddings)
llm = ChatOCIGenAI(model_id="cohere.command-r-plus-08-2024", ...)

# RAG function
def ask(question: str) -> str:
    docs = vector_store.similarity_search(question, k=3)
    context = "\n\n".join([doc.page_content for doc in docs])
    chain = rag_prompt | llm | StrOutputParser()
    return chain.invoke({"context": context, "question": question})

9. Conversational RAG

Conversational RAG extends single-turn RAG by maintaining chat history across multiple exchanges. This allows follow-up questions that reference earlier turns.

Memory Types

Memory Class Behavior Use Case
ConversationBufferMemory Stores all messages verbatim Short conversations, full context needed
ConversationBufferWindowMemory Stores last k exchanges only Longer conversations, bounded memory
ConversationSummaryMemory Summarizes conversation history using an LLM Very long conversations, reduces token usage
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

Exam trap: ConversationBufferMemory stores everything and will eventually exceed the context window for long conversations. ConversationBufferWindowMemory(k=5) only keeps the last 5 exchanges. ConversationSummaryMemory uses an additional LLM call to summarize history, which costs tokens but preserves the gist of long conversations.

Conversational RAG Architecture

In conversational RAG, the user's follow-up question (e.g., "What about its performance?") must be combined with chat history to reformulate a standalone query for retrieval. The typical pattern:

  1. User submits follow-up question
  2. Chat history + follow-up question are sent to an LLM to generate a standalone question
  3. Standalone question is embedded and used for similarity search
  4. Retrieved documents + chat history + question are sent to the LLM for final response
  5. Response and question are added to memory

This prevents the retriever from searching for "its performance" without knowing what "it" refers to.

10. Quick-Reference Summary Table

Concept Key Fact for Exam
RAG vs. fine-tuning RAG = retrieval at inference, no weight changes. Fine-tuning = weight changes through training.
langchain-oci Current recommended package (replaces langchain-community for OCI)
embed_query() vs. embed_documents() Different Cohere input_type values -- do not interchange
Cohere Embed v3 dimensions Full models: 1024. Light models: 384.
Max embedding inputs per batch 96
Max tokens per embedding input 512 (text-only models)
Default distance metric (Oracle 23ai) Cosine
HNSW vs. IVF HNSW = in-memory graph, faster search. IVF = partition-based, handles larger datasets.
FETCH EXACT FIRST k ROWS ONLY Forces exact search, bypasses vector index
MMR lambda_mult 0.0 = max diversity, 1.0 = max relevance
stuff chain type Default. Concatenates all docs. Fails if context exceeds window.
map_reduce chain type Parallel LLM calls per doc, then combines. Many API calls.
refine chain type Sequential refinement. Not parallelizable.
Rerank model cohere.rerank.v3-5 -- reorders retrieved docs by relevance
ConversationBufferWindowMemory(k=N) Keeps only last N exchanges
Oracle VECTOR data type Requires COMPATIBLE parameter 23.4.0+
Chunk size sweet spot 400-512 tokens with 10-20% overlap for most RAG use cases

References