Reference

Domain 2: Using OCI Generative AI Service (40%)

Domain 2 of the 1Z0-1127-25 Oracle Cloud Infrastructure 2025 Generative AI Professional exam covers the OCI Generative AI managed service end to end: pretrained models, dedicated AI clusters, fine-tuning, endpoints, inference APIs, security, and the Playground. At 40% of the exam (approximately 20 out of 50 questions), this is by far the heaviest domain. The exam syllabus identifies these topic areas:

Chat and embedding foundational models
Dedicated AI clusters (fine-tuning and hosting)
Fine-tuning base models with custom datasets
Model endpoints and deployment
Inference API parameters
Security architecture and IAM policies
OCI GenAI Playground

The exam format is 50 multiple-choice questions in 90 minutes with a passing score of 68%. Questions are scenario-based -- expect questions that test specific model IDs, parameter ranges, cluster unit types, and IAM resource names.

1. OCI Generative AI Service Fundamentals

OCI Generative AI is a fully managed Oracle Cloud service providing state-of-the-art large language models for chat, text generation, text embedding, and reranking. The service is accessed through the OCI Console (Playground), REST APIs, OCI CLI, and SDKs (Python, Java). (Overview)

Console path: Navigation Menu > Analytics & AI > AI Services > Generative AI

Two Operating Modes

Mode	Description	Use Case
On-Demand	Pay-per-inference; shared infrastructure; no cluster setup	Experimentation, PoC, model evaluation
Dedicated AI Cluster	Single-tenant GPU resources; customer-exclusive	Production workloads, fine-tuning, custom model hosting

On-demand mode caps response length at 4,000 tokens per run. Dedicated mode is uncapped up to the model's full context window. (Concepts)

Exam trap: On-demand text generation and summarization APIs are retired. Only chat and embedding are available on-demand. Generation/summarization models (e.g., cohere.command) can still run on dedicated clusters but not on-demand.

2. Pretrained Foundational Models

The service offers models from multiple providers. For exam purposes, the core models to know are the Cohere and Meta families. (Pretrained Models)

2.1 Chat Models

Cohere Command Family

Model	Model ID	Context Window	Key Capabilities
Command A (03-2025)	`cohere.command-a-03-2025`	256K tokens	Most performant Cohere chat; agentic enterprise tasks
Command R+ (08-2024)	`cohere.command-r-plus-08-2024`	128K tokens	Complex tasks, Q&A, sentiment analysis, multilingual RAG
Command R (08-2024)	`cohere.command-r-08-2024`	128K tokens	Same capabilities as R+; more cost-efficient; supports fine-tuning
Command R (16K)	`cohere.command-r-16k`	16K tokens	Retired -- general language tasks
Command R+	`cohere.command-r-plus`	128K tokens	Retired

Exam trap: The older cohere.command-r-16k model has a 16K context window, not 128K. Do not confuse it with cohere.command-r-08-2024 which has 128K.

Meta Llama Family

Model	Model ID	Parameters	Context Window	Key Capabilities
Llama 3.3 (70B)	`meta.llama-3.3-70b-instruct`	70B	128K	Best 70B performance; on-demand and dedicated; supports fine-tuning
Llama 3.2 (90B Vision)	`meta.llama-3.2-90b-vision-instruct`	90B	128K	Multimodal (text + image)
Llama 3.2 (11B Vision)	`meta.llama-3.2-11b-vision-instruct`	11B	128K	Compact multimodal; dedicated only
Llama 3.1 (405B)	`meta.llama-3.1-405b-instruct`	405B	128K	Largest; advanced reasoning, coding, math, tool use
Llama 3.1 (70B)	`meta.llama-3.1-70b-instruct`	70B	128K	Retired -- predecessor to 3.3

Exam trap: Llama 3.1 405B on-demand is only available in US Midwest (Chicago). All other regions require a dedicated cluster. The required cluster unit type is Large Generic 2 (not Large Generic 4, which was the older type).

Additional Providers (Newer Additions)

The service also offers Google Gemini (via Oracle Interconnect for Google Cloud, on-demand only), OpenAI gpt-oss models, and xAI Grok models. These are noted here for completeness but are less likely to be heavily tested since the exam syllabus was written around the Cohere/Meta core.

2.2 Embedding Models

All embedding models in OCI GenAI are Cohere Embed models. (Embed Models)

Model	Model ID	Dimensions	Input	Notes
Embed 4	`cohere.embed-v4.0`	256, 512, 1024, 1536 (configurable)	Text + Image	Latest; multimodal; configurable output dimensions
Embed English 3	`cohere.embed-english-v3.0`	1024	Text only	English; 512 tokens/input; max 96 inputs/run
Embed English Light 3	`cohere.embed-english-light-v3.0`	384	Text only	Lightweight variant
Embed Multilingual 3	`cohere.embed-multilingual-v3.0`	1024	Text only	100+ languages
Embed Multilingual Light 3	`cohere.embed-multilingual-light-v3.0`	384	Text only	Lightweight multilingual
Embed English Image 3	`cohere.embed-english-image-v3.0`	1024	Text + Image	Multimodal English
Embed Multilingual Image 3	`cohere.embed-multilingual-image-v3.0`	1024	Text + Image	Multimodal multilingual

Key facts for the exam:

Standard models output 1024 dimensions; Light models output 384 dimensions
Maximum 96 inputs per run for text-only models
Maximum 512 tokens per input for text-only Embed v3 models
Text + image models support up to 128,000 tokens total across all inputs
A 512x512 image consumes approximately 1,610 tokens
Image input is API only -- not available in the Console Playground
Embedding models cannot be fine-tuned

2.3 Rerank Model

Model	Model ID	Function
Cohere Rerank 3.5	`cohere.rerank.v3-5`	Takes a query + list of texts, returns ranked array with relevance scores

Reranking is used in RAG pipelines to re-order retrieved documents by relevance before passing them to the LLM.

3. Dedicated AI Clusters

Dedicated AI clusters are single-tenant GPU compute resources for fine-tuning custom models or hosting endpoints. They are not shared with other tenancies. (Managing Dedicated AI Clusters)

3.1 Cluster Types

Type	Purpose	GPU Requirement
Fine-tuning	Train custom models from base models	Higher GPU count than hosting
Hosting	Serve endpoints for pretrained, custom, or imported models	Lower GPU count

Exam trap: Fine-tuning clusters require significantly more GPU resources than hosting clusters. You cannot use a hosting cluster for fine-tuning.

3.2 GPU Unit Shapes

Unit shape names follow the format: <Instance Type>_<Number of Cards>. Examples: H100_X1 = H100 with 1 card. For A100 shapes, the memory size distinguishes variants: A100-80G vs A100-40G. The unit shape cannot be changed after cluster creation. (Creating Hosting Clusters)

3.3 Cluster Unit Types by Model

Each model requires a specific cluster unit type. These are critical for the exam:

Model	Hosting Unit Type	Units for Hosting	Fine-Tuning Units	Fine-Tuning Method
`cohere.command-r-08-2024`	Small Cohere V2	1	8	T-Few or LoRA
`cohere.command-r-plus-08-2024`	Large Cohere V2_2	1	N/A	Not supported
`meta.llama-3.3-70b-instruct`	Large Generic	1	LoRA units	LoRA
`meta.llama-3.1-405b-instruct`	Large Generic 2	1 (x4 multiplier)	N/A	Not supported
`cohere.embed-english-v3.0`	Embed Cohere	1	N/A	Not supported

Key facts:

Maximum 50 endpoints per cluster (increase requestable)
Multiple endpoints on the same cluster must use the same base model -- you cannot mix base models and custom models on one cluster
Model replicas: Increase throughput by adding units (each replica = 1 additional unit)
Cluster creation requires accepting commitment unit hours terms

3.4 Capacity and Scaling

Default: 1 unit created per cluster
Scale up by editing the cluster to add model replicas
Each replica increases throughput proportionally
Service limits control maximum units per shape (e.g., dedicated-unit-small-cohere-count, dedicated-unit-llama2-70-count)

Exam trap: To increase the Llama 3.1 405B hosting limit, you must request an increase of 4 units (not 1) because the multiplier is x4.

4. Fine-Tuning Base Models

Fine-tuning creates a custom model by training a copy of a pretrained base model on your own dataset. (Fine-Tune Models)

4.1 Fine-Tuning Methods

OCI GenAI supports two fine-tuning methods. The system automatically selects the method based on the chosen base model -- you do not manually choose. (Selecting a Fine-Tuning Method)

Method	Supported Models	Description
T-Few	`cohere.command-r-08-2024`	Adds learned vectors to transformer attention; trains only a few additional parameters. Oracle's efficient approach for Cohere models.
LoRA (Low-Rank Adaptation)	`cohere.command-r-08-2024`, `meta.llama-3.3-70b-instruct`, `meta.llama-3.1-70b-instruct`	Adds low-rank update matrices to attention layers; widely used parameter-efficient method.

Exam trap: cohere.command-r-08-2024 supports both T-Few and LoRA. The Llama models support only LoRA. cohere.command-r-plus-08-2024 does not support fine-tuning at all.

4.2 Training Dataset Requirements

(Training Data Requirements)

Requirement	Specification
File format	JSONL (JSON Lines)
Encoding	UTF-8
Line format	`{"prompt": "<prompt>", "completion": "<response>"}`
Minimum samples	32 prompt/completion pairs
Maximum datasets per model	1
Data split	Automatic: 80% training / 20% validation
Storage	OCI Object Storage bucket
Fine-tuning token limits (Command R)	Prompt up to 16,000 tokens; completion up to 4,000 tokens

Exam trap: The dataset must have exactly two fields: prompt and completion. Any other field structure will fail. The minimum is 32 pairs -- not 10, not 100.

4.3 Fine-Tuning Workflow

Create the training dataset in JSONL format
Upload the dataset to an OCI Object Storage bucket
Create a fine-tuning dedicated AI cluster (select the base model)
Create a new custom model (or new version of existing model)
Create a hosting dedicated AI cluster
Create an endpoint for the custom model on the hosting cluster
Test in the Playground or call via API

4.4 Hyperparameters

LoRA Hyperparameters (Meta Llama Models)

(Fine-Tuning Hyperparameters)

Parameter	Range	Default	Description
Total training epochs	1+ (integer)	3	Iterations through entire dataset
Learning rate	0 to 1.0	0.0002	Speed of weight updates
Training batch size	8 to 16	8	Samples per mini-batch
Early stopping patience	0 or 1+	15	Grace periods after threshold; 0 disables
Early stopping threshold	0 or positive	0.0001	Minimum loss improvement
LoRA r (rank)	1 to 64	8	Attention dimension of update matrices
LoRA alpha	1 to 128	8	Scaling parameter (weight = alpha / r)
LoRA dropout	0 to < 1	0.1	Dropout probability for LoRA layers
Log interval	Fixed	10 steps	Not tunable

T-Few Hyperparameters (Cohere Models)

Parameter	Range	Default	Description
Total training epochs	1 to 10	1	Iterations through entire dataset
Learning rate	0.000005 to 0.1	0.01	Speed of weight updates
Training batch size	8 to 32	16	Samples per mini-batch
Early stopping patience	0 or 1 to 16	10	Grace periods after threshold; 0 disables
Early stopping threshold	0.001 to 0.1	0.001	Minimum loss improvement
Log interval	Fixed	1 step	Not tunable

Total training steps formula:

totalTrainingSteps = (totalTrainingEpochs * datasetSize) / trainingBatchSize

Exam trap: T-Few defaults to 1 epoch with learning rate 0.01. LoRA defaults to 3 epochs with learning rate 0.0002. These are very different -- know which is which. Also note T-Few batch size range is 8-32 (default 16) while LoRA is 8-16 (default 8).

4.5 Fine-Tuning vs. Prompt Engineering

Criteria	Prompt Engineering	Fine-Tuning
Cost	Low (no training)	High (dedicated cluster + training time)
Data required	None (examples in prompt)	Minimum 32 labeled samples
Setup time	Immediate	Hours to train
Best for	General tasks, format control	Domain-specific knowledge, specialized outputs
Model change	None	Creates new custom model
Maintenance	Update prompts as needed	Retrain when data changes

5. Creating Model Endpoints

An endpoint makes a model available for inference. Every model (pretrained, custom, or imported) requires an endpoint on a dedicated cluster for dedicated mode. On-demand models do not require explicit endpoint creation. (Creating Endpoints)

5.1 Endpoint Types

Type	Availability	Description
Public endpoint	All model types	Default; accessible over the internet
Private endpoint	Pretrained and custom models only	Runs inside a VCN private subnet; requires pre-created private endpoint resource

Exam trap: Imported models support public endpoints only -- private endpoints are not available for imported models.

5.2 Endpoint Configuration

Key settings during endpoint creation:

Compartment: Where the endpoint lives (recommended: same as model)
Model selection: Choose pretrained, custom, or imported model
Dedicated AI cluster: Select existing active cluster or create new one
Networking: Public (default) or Private endpoint
Guardrails: Content moderation, prompt injection defense, PII handling (pretrained and custom only)
Name: Auto-generated as generativeaiendpoint<timestamp> if not specified

5.3 Guardrails

Guardrails are optional safety controls configurable at the endpoint level for pretrained and custom models. They are not available for imported models. (Guardrails)

Guardrail	Function	Scoring
Content Moderation (CM)	Detects hate, harassment, sexual content, violence, self-harm	Binary: 0.0 (safe) / 1.0 (unsafe) + BLOCKLIST check
Prompt Injection (PI)	Detects "ignore previous instructions", system prompt exfiltration, hidden instructions	Binary: 0.0 / 1.0
PII Detection	Identifies names, emails, phone numbers, IDs, financial data	Confidence score 0.0-1.0 per detected entity

Key facts:

Guardrails are disabled by default -- must be explicitly enabled
Enabled via the ApplyGuardrails API or during endpoint creation in Console
PII detection returns specific fields: text, label, offset, length, score
Content Moderation tested on RTPLX dataset (38+ languages)

6. Inference API

The OCI Generative AI Inference API provides three primary operations: Chat, EmbedText, and ApplyGuardrails.

6.1 Chat API Parameters

These parameters control model output behavior. Know the ranges and defaults for each. (Cohere Command R+ (08-2024), Meta Llama 3.1 (405B))

Parameter	Description	Cohere Default	Llama Default
Temperature	Controls randomness; 0 = deterministic, higher = more creative	Start at 0 or < 1	Start at 0 or < 1
Top P	Nucleus sampling; cumulative probability threshold (0-1)	Model-specific	Model-specific
Top K	Samples from top K most likely tokens	0 (disabled; consider all)	-1 (consider all)
Frequency Penalty	Penalizes frequently appearing tokens; reduces repetition	0	0
Presence Penalty	Penalizes tokens already used; encourages diversity	0	0
Max Output Tokens	Maximum tokens generated per response	Up to 4,000 (on-demand) / 128K (dedicated)	Up to 4,000 (on-demand) / 128K (dedicated)
Seed	Deterministic output; Console max 9,999; API unlimited	null	null
Stop Sequences	Token sequences that halt generation	None	None

Exam trap: Top K default for Cohere is 0 (disabled). Top K default for Llama is -1 (consider all tokens). Both effectively consider all tokens, but the numeric defaults differ.

6.2 Cohere-Specific Parameters

Parameter	Values	Description
Preamble Override	Free text	System prompt; defaults to "You are Command..."
Safety Mode	`CONTEXTUAL` (default), `STRICT`, `OFF`	Controls content safety filtering

Contextual: Fewer constraints; allows profanity and explicit content (entertainment/academic)
Strict: Avoids sensitive topics (corporate/customer-facing)
Off: No safety filtering

6.3 Embedding API (EmbedText)

(EmbedTextDetails API)

Required parameters:

inputs (List[str]): Texts to embed (max 512 tokens each for v3 text-only)
compartment_id: Target compartment OCID
serving_mode: On-demand or dedicated

Optional parameters:

Parameter	Values	Description
input_type	`SEARCH_DOCUMENT`, `SEARCH_QUERY`, `CLASSIFICATION`, `CLUSTERING`, `IMAGE`	Optimizes embedding for intended use
truncate	`NONE` (default), `START`, `END`	Behavior when input exceeds token limit
embedding_types	`float`, `int8`, `uint8`, `binary`, `ubinary`, `base64`	Output format
output_dimensions	256, 512, 1024, 1536	Only for Embed v4+ models
is_echo	Boolean	Include original inputs in response

Exam trap: The input_type parameter is critical for RAG. Use SEARCH_DOCUMENT when embedding documents for storage. Use SEARCH_QUERY when embedding the user's search query. Using the wrong type degrades retrieval quality.

6.4 Token Estimation

Approximately 4 characters per token. This is a rough estimate used consistently across OCI GenAI documentation.

6.5 On-Demand vs. Dedicated Response Limits

Mode	Max Response Tokens	Context Window
On-demand	4,000 tokens	Model's full context (e.g., 128K)
Dedicated	Uncapped (up to context window)	Model's full context

7. Security Architecture

7.1 IAM Policies

OCI GenAI uses standard OCI IAM for access control. Only the Administrators group has access by default; all other users require explicit policies. (IAM Policies)

Aggregate resource type: generative-ai-family (covers all 11 individual resource types)

Individual Resource Types

Resource Type	Controls
`generative-ai-chat`	Chat inference
`generative-ai-text-generation`	Text generation inference
`generative-ai-text-summarization`	Summarization inference
`generative-ai-text-embedding`	Embedding inference
`generative-ai-model`	Custom models
`generative-ai-imported-model`	Imported models
`generative-ai-dedicated-ai-cluster`	Dedicated AI clusters
`generative-ai-endpoint`	Model endpoints
`generative-ai-private-endpoint`	Private endpoints
`generative-ai-apikey`	API keys
`generative-ai-work-request`	Work requests

Permission Verbs (Cumulative)

Verb	Includes	Operations
inspect	--	List resources
read	inspect	View details
use	read	Update, invoke inference
manage	use	Create, delete, move

Common policy examples:

-- Full access at tenancy level
allow group GenAI-Admins to manage generative-ai-family in tenancy

-- Compartment-scoped inference only
allow group GenAI-Users to use generative-ai-chat in compartment AI-Prod

-- Embedding inference only
allow group GenAI-Users to use generative-ai-text-embedding in compartment AI-Prod

-- Manage clusters and endpoints
allow group GenAI-Ops to manage generative-ai-dedicated-ai-cluster in compartment AI-Prod
allow group GenAI-Ops to manage generative-ai-endpoint in compartment AI-Prod

Exam trap: To use the chat API, the verb is use on resource generative-ai-chat, not manage. The use verb is sufficient for inference. manage is only needed for creating/deleting resources.

Fine-Tuning Data Access

Training datasets in Object Storage require separate policies:

-- Upload datasets
allow group GenAI-Admins to manage object-family in compartment Data-Bucket

-- Read datasets during model creation
allow group GenAI-Admins to use object-family in compartment Data-Bucket

If training data and custom models are in different compartments, the user creating the model needs use object-family in the compartment containing the bucket.

7.2 Network Security

Private Endpoints: Restrict model access to traffic from within a VCN. (Prerequisites for Private Endpoints)

Prerequisites:

Create a VCN in the tenancy
Create a private subnet in the VCN
IAM policies: manage generative-ai-private-endpoint and manage virtual-network-family

The private endpoint is deployed as a VNIC in the private subnet. You manage the subnet's security rules and can optionally add network security groups (NSGs).

Service Gateway: For private network access to OCI services without internet traversal, use a VCN service gateway to reach OCI Generative AI through the Oracle Services Network.

7.3 Data Privacy and Model Isolation

Dedicated AI clusters are single-tenant -- your data and models are not shared
Custom model training data stays in your Object Storage bucket
Fine-tuned models are private to your tenancy
Guardrails provide additional content and PII controls

8. OCI GenAI Playground

The Playground is the Console-based no-code interface for testing models. It supports three modes:

Mode	Function	Available Models
Chat	Conversational interaction with chat models	All chat models (Cohere, Llama, etc.)
Embedding	Generate text/image embeddings	All Cohere Embed models
Generation	Text generation (legacy)	Retired from on-demand; dedicated clusters only

Playground Features

Parameter tuning: Adjust temperature, top-p, top-k, penalties, max tokens in the UI
Token display: Shows input and output token counts after each generation
Code export: "View code" generates code in multiple languages (Python, Java, etc.) with authentication pre-configured
Vision support: Upload .png or .jpg images (max 5 MB) for multimodal models
Example prompts: Pre-built prompts for quick testing
Embedding visualization: 2D projection of embedding vectors showing semantic similarity

Embedding Playground specifics:

Maximum 96 inputs per run
File upload: .txt files only, newline-separated entries
Truncate parameter defaults to NONE (returns error if exceeded)
Export embeddings as JSON

Exam trap: In the embedding Playground, the Truncate parameter resets to NONE every time you click Clear. You must re-set it before each run if you want truncation.

9. Regional Availability

Model availability varies significantly by region. Key patterns to know: (Models by Region)

Region	On-Demand	Dedicated	Notes
US Midwest (Chicago)	Broadest on-demand availability	Full dedicated support	Primary region for all models
US East (Ashburn)	Limited	Dedicated for most models	No on-demand for many models
Germany Central (Frankfurt)	Google Gemini (via interconnect)	Cohere, Llama, OpenAI	EU data residency
UK South (London)	Limited	Cohere, Llama
Japan Central (Osaka)	Limited	Cohere, Llama	APAC

Availability symbols in Oracle docs:

Check mark: Available (on-demand and dedicated)
Check mark + o: On-demand only
Check mark + d: Dedicated AI clusters only
Check mark + G: Available through Oracle Interconnect for Google Cloud only

Exam trap: Google Gemini and xAI Grok models make external calls -- they route through Google Cloud or xAI infrastructure respectively. If data sovereignty is a concern, these models may not be appropriate. Cohere and Meta models run on OCI infrastructure.

10. Exam Focus: Common Traps and Pitfalls

On-demand response cap: Always 4,000 tokens. Dedicated mode is uncapped. If a question asks about maximum response length, the answer depends on the deployment mode.
Fine-tuning model support: Only cohere.command-r-08-2024, meta.llama-3.3-70b-instruct, and meta.llama-3.1-70b-instruct support fine-tuning. Command R+ does not. The 405B model does not. Embedding models do not.
T-Few vs. LoRA: The system auto-selects. T-Few is for Cohere; LoRA is for Llama (and also available for Cohere Command R 08-2024). You do not manually pick during model creation.
JSONL format: The only accepted format for training data is JSONL with exactly {"prompt": "...", "completion": "..."}. Minimum 32 pairs. One dataset per model. Auto-split 80/20.
Cluster unit types: Each model has a specific required unit type. You cannot mix models requiring different unit types on the same cluster. Know Small Cohere V2, Large Cohere V2_2, Large Generic 2, Embed Cohere.
IAM resource names: generative-ai-family is the aggregate type. For chat inference, the resource is generative-ai-chat with the use verb. For managing clusters, you need manage on generative-ai-dedicated-ai-cluster.
Private endpoints: Only for pretrained and custom models. Imported models are public only. Requires VCN + private subnet + IAM policies for both GenAI and virtual-network-family.
Embedding input_type: SEARCH_DOCUMENT for indexing documents, SEARCH_QUERY for queries. Using the correct type optimizes retrieval. Also available: CLASSIFICATION, CLUSTERING, IMAGE.
Embed dimensions: Standard = 1024, Light = 384. Embed v4 supports configurable dimensions (256, 512, 1024, 1536).
Guardrails disabled by default: You must explicitly enable content moderation, prompt injection defense, and PII detection. They do not run automatically.