Reference

Domain 2: Using OCI Generative AI Service (40%)

Domain 2 of the 1Z0-1127-25 Oracle Cloud Infrastructure 2025 Generative AI Professional exam covers the OCI Generative AI managed service end to end: pretrained models, dedicated AI clusters, fine-tuning, endpoints, inference APIs, security, and the Playground. At 40% of the exam (approximately 20 out of 50 questions), this is by far the heaviest domain. The exam syllabus identifies these topic areas:

  1. Chat and embedding foundational models
  2. Dedicated AI clusters (fine-tuning and hosting)
  3. Fine-tuning base models with custom datasets
  4. Model endpoints and deployment
  5. Inference API parameters
  6. Security architecture and IAM policies
  7. OCI GenAI Playground

The exam format is 50 multiple-choice questions in 90 minutes with a passing score of 68%. Questions are scenario-based -- expect questions that test specific model IDs, parameter ranges, cluster unit types, and IAM resource names.


1. OCI Generative AI Service Fundamentals

OCI Generative AI is a fully managed Oracle Cloud service providing state-of-the-art large language models for chat, text generation, text embedding, and reranking. The service is accessed through the OCI Console (Playground), REST APIs, OCI CLI, and SDKs (Python, Java). (Overview)

Console path: Navigation Menu > Analytics & AI > AI Services > Generative AI

Two Operating Modes

Mode Description Use Case
On-Demand Pay-per-inference; shared infrastructure; no cluster setup Experimentation, PoC, model evaluation
Dedicated AI Cluster Single-tenant GPU resources; customer-exclusive Production workloads, fine-tuning, custom model hosting

On-demand mode caps response length at 4,000 tokens per run. Dedicated mode is uncapped up to the model's full context window. (Concepts)

Exam trap: On-demand text generation and summarization APIs are retired. Only chat and embedding are available on-demand. Generation/summarization models (e.g., cohere.command) can still run on dedicated clusters but not on-demand.


2. Pretrained Foundational Models

The service offers models from multiple providers. For exam purposes, the core models to know are the Cohere and Meta families. (Pretrained Models)

2.1 Chat Models

Cohere Command Family

Model Model ID Context Window Key Capabilities
Command A (03-2025) cohere.command-a-03-2025 256K tokens Most performant Cohere chat; agentic enterprise tasks
Command R+ (08-2024) cohere.command-r-plus-08-2024 128K tokens Complex tasks, Q&A, sentiment analysis, multilingual RAG
Command R (08-2024) cohere.command-r-08-2024 128K tokens Same capabilities as R+; more cost-efficient; supports fine-tuning
Command R (16K) cohere.command-r-16k 16K tokens Retired -- general language tasks
Command R+ cohere.command-r-plus 128K tokens Retired

Exam trap: The older cohere.command-r-16k model has a 16K context window, not 128K. Do not confuse it with cohere.command-r-08-2024 which has 128K.

Meta Llama Family

Model Model ID Parameters Context Window Key Capabilities
Llama 3.3 (70B) meta.llama-3.3-70b-instruct 70B 128K Best 70B performance; on-demand and dedicated; supports fine-tuning
Llama 3.2 (90B Vision) meta.llama-3.2-90b-vision-instruct 90B 128K Multimodal (text + image)
Llama 3.2 (11B Vision) meta.llama-3.2-11b-vision-instruct 11B 128K Compact multimodal; dedicated only
Llama 3.1 (405B) meta.llama-3.1-405b-instruct 405B 128K Largest; advanced reasoning, coding, math, tool use
Llama 3.1 (70B) meta.llama-3.1-70b-instruct 70B 128K Retired -- predecessor to 3.3

Exam trap: Llama 3.1 405B on-demand is only available in US Midwest (Chicago). All other regions require a dedicated cluster. The required cluster unit type is Large Generic 2 (not Large Generic 4, which was the older type).

Additional Providers (Newer Additions)

The service also offers Google Gemini (via Oracle Interconnect for Google Cloud, on-demand only), OpenAI gpt-oss models, and xAI Grok models. These are noted here for completeness but are less likely to be heavily tested since the exam syllabus was written around the Cohere/Meta core.

2.2 Embedding Models

All embedding models in OCI GenAI are Cohere Embed models. (Embed Models)

Model Model ID Dimensions Input Notes
Embed 4 cohere.embed-v4.0 256, 512, 1024, 1536 (configurable) Text + Image Latest; multimodal; configurable output dimensions
Embed English 3 cohere.embed-english-v3.0 1024 Text only English; 512 tokens/input; max 96 inputs/run
Embed English Light 3 cohere.embed-english-light-v3.0 384 Text only Lightweight variant
Embed Multilingual 3 cohere.embed-multilingual-v3.0 1024 Text only 100+ languages
Embed Multilingual Light 3 cohere.embed-multilingual-light-v3.0 384 Text only Lightweight multilingual
Embed English Image 3 cohere.embed-english-image-v3.0 1024 Text + Image Multimodal English
Embed Multilingual Image 3 cohere.embed-multilingual-image-v3.0 1024 Text + Image Multimodal multilingual

Key facts for the exam:

  • Standard models output 1024 dimensions; Light models output 384 dimensions
  • Maximum 96 inputs per run for text-only models
  • Maximum 512 tokens per input for text-only Embed v3 models
  • Text + image models support up to 128,000 tokens total across all inputs
  • A 512x512 image consumes approximately 1,610 tokens
  • Image input is API only -- not available in the Console Playground
  • Embedding models cannot be fine-tuned

2.3 Rerank Model

Model Model ID Function
Cohere Rerank 3.5 cohere.rerank.v3-5 Takes a query + list of texts, returns ranked array with relevance scores

Reranking is used in RAG pipelines to re-order retrieved documents by relevance before passing them to the LLM.


3. Dedicated AI Clusters

Dedicated AI clusters are single-tenant GPU compute resources for fine-tuning custom models or hosting endpoints. They are not shared with other tenancies. (Managing Dedicated AI Clusters)

3.1 Cluster Types

Type Purpose GPU Requirement
Fine-tuning Train custom models from base models Higher GPU count than hosting
Hosting Serve endpoints for pretrained, custom, or imported models Lower GPU count

Exam trap: Fine-tuning clusters require significantly more GPU resources than hosting clusters. You cannot use a hosting cluster for fine-tuning.

3.2 GPU Unit Shapes

Unit shape names follow the format: <Instance Type>_<Number of Cards>. Examples: H100_X1 = H100 with 1 card. For A100 shapes, the memory size distinguishes variants: A100-80G vs A100-40G. The unit shape cannot be changed after cluster creation. (Creating Hosting Clusters)

3.3 Cluster Unit Types by Model

Each model requires a specific cluster unit type. These are critical for the exam:

Model Hosting Unit Type Units for Hosting Fine-Tuning Units Fine-Tuning Method
cohere.command-r-08-2024 Small Cohere V2 1 8 T-Few or LoRA
cohere.command-r-plus-08-2024 Large Cohere V2_2 1 N/A Not supported
meta.llama-3.3-70b-instruct Large Generic 1 LoRA units LoRA
meta.llama-3.1-405b-instruct Large Generic 2 1 (x4 multiplier) N/A Not supported
cohere.embed-english-v3.0 Embed Cohere 1 N/A Not supported

Key facts:

  • Maximum 50 endpoints per cluster (increase requestable)
  • Multiple endpoints on the same cluster must use the same base model -- you cannot mix base models and custom models on one cluster
  • Model replicas: Increase throughput by adding units (each replica = 1 additional unit)
  • Cluster creation requires accepting commitment unit hours terms

3.4 Capacity and Scaling

  • Default: 1 unit created per cluster
  • Scale up by editing the cluster to add model replicas
  • Each replica increases throughput proportionally
  • Service limits control maximum units per shape (e.g., dedicated-unit-small-cohere-count, dedicated-unit-llama2-70-count)

Exam trap: To increase the Llama 3.1 405B hosting limit, you must request an increase of 4 units (not 1) because the multiplier is x4.


4. Fine-Tuning Base Models

Fine-tuning creates a custom model by training a copy of a pretrained base model on your own dataset. (Fine-Tune Models)

4.1 Fine-Tuning Methods

OCI GenAI supports two fine-tuning methods. The system automatically selects the method based on the chosen base model -- you do not manually choose. (Selecting a Fine-Tuning Method)

Method Supported Models Description
T-Few cohere.command-r-08-2024 Adds learned vectors to transformer attention; trains only a few additional parameters. Oracle's efficient approach for Cohere models.
LoRA (Low-Rank Adaptation) cohere.command-r-08-2024, meta.llama-3.3-70b-instruct, meta.llama-3.1-70b-instruct Adds low-rank update matrices to attention layers; widely used parameter-efficient method.

Exam trap: cohere.command-r-08-2024 supports both T-Few and LoRA. The Llama models support only LoRA. cohere.command-r-plus-08-2024 does not support fine-tuning at all.

4.2 Training Dataset Requirements

(Training Data Requirements)

Requirement Specification
File format JSONL (JSON Lines)
Encoding UTF-8
Line format {"prompt": "<prompt>", "completion": "<response>"}
Minimum samples 32 prompt/completion pairs
Maximum datasets per model 1
Data split Automatic: 80% training / 20% validation
Storage OCI Object Storage bucket
Fine-tuning token limits (Command R) Prompt up to 16,000 tokens; completion up to 4,000 tokens

Exam trap: The dataset must have exactly two fields: prompt and completion. Any other field structure will fail. The minimum is 32 pairs -- not 10, not 100.

4.3 Fine-Tuning Workflow

  1. Create the training dataset in JSONL format
  2. Upload the dataset to an OCI Object Storage bucket
  3. Create a fine-tuning dedicated AI cluster (select the base model)
  4. Create a new custom model (or new version of existing model)
  5. Create a hosting dedicated AI cluster
  6. Create an endpoint for the custom model on the hosting cluster
  7. Test in the Playground or call via API

4.4 Hyperparameters

LoRA Hyperparameters (Meta Llama Models)

(Fine-Tuning Hyperparameters)

Parameter Range Default Description
Total training epochs 1+ (integer) 3 Iterations through entire dataset
Learning rate 0 to 1.0 0.0002 Speed of weight updates
Training batch size 8 to 16 8 Samples per mini-batch
Early stopping patience 0 or 1+ 15 Grace periods after threshold; 0 disables
Early stopping threshold 0 or positive 0.0001 Minimum loss improvement
LoRA r (rank) 1 to 64 8 Attention dimension of update matrices
LoRA alpha 1 to 128 8 Scaling parameter (weight = alpha / r)
LoRA dropout 0 to < 1 0.1 Dropout probability for LoRA layers
Log interval Fixed 10 steps Not tunable

T-Few Hyperparameters (Cohere Models)

Parameter Range Default Description
Total training epochs 1 to 10 1 Iterations through entire dataset
Learning rate 0.000005 to 0.1 0.01 Speed of weight updates
Training batch size 8 to 32 16 Samples per mini-batch
Early stopping patience 0 or 1 to 16 10 Grace periods after threshold; 0 disables
Early stopping threshold 0.001 to 0.1 0.001 Minimum loss improvement
Log interval Fixed 1 step Not tunable

Total training steps formula:

totalTrainingSteps = (totalTrainingEpochs * datasetSize) / trainingBatchSize

Exam trap: T-Few defaults to 1 epoch with learning rate 0.01. LoRA defaults to 3 epochs with learning rate 0.0002. These are very different -- know which is which. Also note T-Few batch size range is 8-32 (default 16) while LoRA is 8-16 (default 8).

4.5 Fine-Tuning vs. Prompt Engineering

Criteria Prompt Engineering Fine-Tuning
Cost Low (no training) High (dedicated cluster + training time)
Data required None (examples in prompt) Minimum 32 labeled samples
Setup time Immediate Hours to train
Best for General tasks, format control Domain-specific knowledge, specialized outputs
Model change None Creates new custom model
Maintenance Update prompts as needed Retrain when data changes

5. Creating Model Endpoints

An endpoint makes a model available for inference. Every model (pretrained, custom, or imported) requires an endpoint on a dedicated cluster for dedicated mode. On-demand models do not require explicit endpoint creation. (Creating Endpoints)

5.1 Endpoint Types

Type Availability Description
Public endpoint All model types Default; accessible over the internet
Private endpoint Pretrained and custom models only Runs inside a VCN private subnet; requires pre-created private endpoint resource

Exam trap: Imported models support public endpoints only -- private endpoints are not available for imported models.

5.2 Endpoint Configuration

Key settings during endpoint creation:

  • Compartment: Where the endpoint lives (recommended: same as model)
  • Model selection: Choose pretrained, custom, or imported model
  • Dedicated AI cluster: Select existing active cluster or create new one
  • Networking: Public (default) or Private endpoint
  • Guardrails: Content moderation, prompt injection defense, PII handling (pretrained and custom only)
  • Name: Auto-generated as generativeaiendpoint<timestamp> if not specified

5.3 Guardrails

Guardrails are optional safety controls configurable at the endpoint level for pretrained and custom models. They are not available for imported models. (Guardrails)

Guardrail Function Scoring
Content Moderation (CM) Detects hate, harassment, sexual content, violence, self-harm Binary: 0.0 (safe) / 1.0 (unsafe) + BLOCKLIST check
Prompt Injection (PI) Detects "ignore previous instructions", system prompt exfiltration, hidden instructions Binary: 0.0 / 1.0
PII Detection Identifies names, emails, phone numbers, IDs, financial data Confidence score 0.0-1.0 per detected entity

Key facts:

  • Guardrails are disabled by default -- must be explicitly enabled
  • Enabled via the ApplyGuardrails API or during endpoint creation in Console
  • PII detection returns specific fields: text, label, offset, length, score
  • Content Moderation tested on RTPLX dataset (38+ languages)

6. Inference API

The OCI Generative AI Inference API provides three primary operations: Chat, EmbedText, and ApplyGuardrails.

6.1 Chat API Parameters

These parameters control model output behavior. Know the ranges and defaults for each. (Cohere Command R+ (08-2024), Meta Llama 3.1 (405B))

Parameter Description Cohere Default Llama Default
Temperature Controls randomness; 0 = deterministic, higher = more creative Start at 0 or < 1 Start at 0 or < 1
Top P Nucleus sampling; cumulative probability threshold (0-1) Model-specific Model-specific
Top K Samples from top K most likely tokens 0 (disabled; consider all) -1 (consider all)
Frequency Penalty Penalizes frequently appearing tokens; reduces repetition 0 0
Presence Penalty Penalizes tokens already used; encourages diversity 0 0
Max Output Tokens Maximum tokens generated per response Up to 4,000 (on-demand) / 128K (dedicated) Up to 4,000 (on-demand) / 128K (dedicated)
Seed Deterministic output; Console max 9,999; API unlimited null null
Stop Sequences Token sequences that halt generation None None

Exam trap: Top K default for Cohere is 0 (disabled). Top K default for Llama is -1 (consider all tokens). Both effectively consider all tokens, but the numeric defaults differ.

6.2 Cohere-Specific Parameters

Parameter Values Description
Preamble Override Free text System prompt; defaults to "You are Command..."
Safety Mode CONTEXTUAL (default), STRICT, OFF Controls content safety filtering
  • Contextual: Fewer constraints; allows profanity and explicit content (entertainment/academic)
  • Strict: Avoids sensitive topics (corporate/customer-facing)
  • Off: No safety filtering

6.3 Embedding API (EmbedText)

(EmbedTextDetails API)

Required parameters:

  • inputs (List[str]): Texts to embed (max 512 tokens each for v3 text-only)
  • compartment_id: Target compartment OCID
  • serving_mode: On-demand or dedicated

Optional parameters:

Parameter Values Description
input_type SEARCH_DOCUMENT, SEARCH_QUERY, CLASSIFICATION, CLUSTERING, IMAGE Optimizes embedding for intended use
truncate NONE (default), START, END Behavior when input exceeds token limit
embedding_types float, int8, uint8, binary, ubinary, base64 Output format
output_dimensions 256, 512, 1024, 1536 Only for Embed v4+ models
is_echo Boolean Include original inputs in response

Exam trap: The input_type parameter is critical for RAG. Use SEARCH_DOCUMENT when embedding documents for storage. Use SEARCH_QUERY when embedding the user's search query. Using the wrong type degrades retrieval quality.

6.4 Token Estimation

Approximately 4 characters per token. This is a rough estimate used consistently across OCI GenAI documentation.

6.5 On-Demand vs. Dedicated Response Limits

Mode Max Response Tokens Context Window
On-demand 4,000 tokens Model's full context (e.g., 128K)
Dedicated Uncapped (up to context window) Model's full context

7. Security Architecture

7.1 IAM Policies

OCI GenAI uses standard OCI IAM for access control. Only the Administrators group has access by default; all other users require explicit policies. (IAM Policies)

Aggregate resource type: generative-ai-family (covers all 11 individual resource types)

Individual Resource Types

Resource Type Controls
generative-ai-chat Chat inference
generative-ai-text-generation Text generation inference
generative-ai-text-summarization Summarization inference
generative-ai-text-embedding Embedding inference
generative-ai-model Custom models
generative-ai-imported-model Imported models
generative-ai-dedicated-ai-cluster Dedicated AI clusters
generative-ai-endpoint Model endpoints
generative-ai-private-endpoint Private endpoints
generative-ai-apikey API keys
generative-ai-work-request Work requests

Permission Verbs (Cumulative)

Verb Includes Operations
inspect -- List resources
read inspect View details
use read Update, invoke inference
manage use Create, delete, move

Common policy examples:

-- Full access at tenancy level
allow group GenAI-Admins to manage generative-ai-family in tenancy

-- Compartment-scoped inference only
allow group GenAI-Users to use generative-ai-chat in compartment AI-Prod

-- Embedding inference only
allow group GenAI-Users to use generative-ai-text-embedding in compartment AI-Prod

-- Manage clusters and endpoints
allow group GenAI-Ops to manage generative-ai-dedicated-ai-cluster in compartment AI-Prod
allow group GenAI-Ops to manage generative-ai-endpoint in compartment AI-Prod

Exam trap: To use the chat API, the verb is use on resource generative-ai-chat, not manage. The use verb is sufficient for inference. manage is only needed for creating/deleting resources.

Fine-Tuning Data Access

Training datasets in Object Storage require separate policies:

-- Upload datasets
allow group GenAI-Admins to manage object-family in compartment Data-Bucket

-- Read datasets during model creation
allow group GenAI-Admins to use object-family in compartment Data-Bucket

If training data and custom models are in different compartments, the user creating the model needs use object-family in the compartment containing the bucket.

7.2 Network Security

Private Endpoints: Restrict model access to traffic from within a VCN. (Prerequisites for Private Endpoints)

Prerequisites:

  1. Create a VCN in the tenancy
  2. Create a private subnet in the VCN
  3. IAM policies: manage generative-ai-private-endpoint and manage virtual-network-family

The private endpoint is deployed as a VNIC in the private subnet. You manage the subnet's security rules and can optionally add network security groups (NSGs).

Service Gateway: For private network access to OCI services without internet traversal, use a VCN service gateway to reach OCI Generative AI through the Oracle Services Network.

7.3 Data Privacy and Model Isolation

  • Dedicated AI clusters are single-tenant -- your data and models are not shared
  • Custom model training data stays in your Object Storage bucket
  • Fine-tuned models are private to your tenancy
  • Guardrails provide additional content and PII controls

8. OCI GenAI Playground

The Playground is the Console-based no-code interface for testing models. It supports three modes:

Mode Function Available Models
Chat Conversational interaction with chat models All chat models (Cohere, Llama, etc.)
Embedding Generate text/image embeddings All Cohere Embed models
Generation Text generation (legacy) Retired from on-demand; dedicated clusters only

Playground Features

  • Parameter tuning: Adjust temperature, top-p, top-k, penalties, max tokens in the UI
  • Token display: Shows input and output token counts after each generation
  • Code export: "View code" generates code in multiple languages (Python, Java, etc.) with authentication pre-configured
  • Vision support: Upload .png or .jpg images (max 5 MB) for multimodal models
  • Example prompts: Pre-built prompts for quick testing
  • Embedding visualization: 2D projection of embedding vectors showing semantic similarity

Embedding Playground specifics:

  • Maximum 96 inputs per run
  • File upload: .txt files only, newline-separated entries
  • Truncate parameter defaults to NONE (returns error if exceeded)
  • Export embeddings as JSON

Exam trap: In the embedding Playground, the Truncate parameter resets to NONE every time you click Clear. You must re-set it before each run if you want truncation.


9. Regional Availability

Model availability varies significantly by region. Key patterns to know: (Models by Region)

Region On-Demand Dedicated Notes
US Midwest (Chicago) Broadest on-demand availability Full dedicated support Primary region for all models
US East (Ashburn) Limited Dedicated for most models No on-demand for many models
Germany Central (Frankfurt) Google Gemini (via interconnect) Cohere, Llama, OpenAI EU data residency
UK South (London) Limited Cohere, Llama
Japan Central (Osaka) Limited Cohere, Llama APAC

Availability symbols in Oracle docs:

  • Check mark: Available (on-demand and dedicated)
  • Check mark + o: On-demand only
  • Check mark + d: Dedicated AI clusters only
  • Check mark + G: Available through Oracle Interconnect for Google Cloud only

Exam trap: Google Gemini and xAI Grok models make external calls -- they route through Google Cloud or xAI infrastructure respectively. If data sovereignty is a concern, these models may not be appropriate. Cohere and Meta models run on OCI infrastructure.


10. Exam Focus: Common Traps and Pitfalls

  1. On-demand response cap: Always 4,000 tokens. Dedicated mode is uncapped. If a question asks about maximum response length, the answer depends on the deployment mode.

  2. Fine-tuning model support: Only cohere.command-r-08-2024, meta.llama-3.3-70b-instruct, and meta.llama-3.1-70b-instruct support fine-tuning. Command R+ does not. The 405B model does not. Embedding models do not.

  3. T-Few vs. LoRA: The system auto-selects. T-Few is for Cohere; LoRA is for Llama (and also available for Cohere Command R 08-2024). You do not manually pick during model creation.

  4. JSONL format: The only accepted format for training data is JSONL with exactly {"prompt": "...", "completion": "..."}. Minimum 32 pairs. One dataset per model. Auto-split 80/20.

  5. Cluster unit types: Each model has a specific required unit type. You cannot mix models requiring different unit types on the same cluster. Know Small Cohere V2, Large Cohere V2_2, Large Generic 2, Embed Cohere.

  6. IAM resource names: generative-ai-family is the aggregate type. For chat inference, the resource is generative-ai-chat with the use verb. For managing clusters, you need manage on generative-ai-dedicated-ai-cluster.

  7. Private endpoints: Only for pretrained and custom models. Imported models are public only. Requires VCN + private subnet + IAM policies for both GenAI and virtual-network-family.

  8. Embedding input_type: SEARCH_DOCUMENT for indexing documents, SEARCH_QUERY for queries. Using the correct type optimizes retrieval. Also available: CLASSIFICATION, CLUSTERING, IMAGE.

  9. Embed dimensions: Standard = 1024, Light = 384. Embed v4 supports configurable dimensions (256, 512, 1024, 1536).

  10. Guardrails disabled by default: You must explicitly enable content moderation, prompt injection defense, and PII detection. They do not run automatically.


References