Reference

Domain 1: Fundamentals of Large Language Models (20%)

Domain 1 of the 1Z0-1127-25 Oracle Cloud Infrastructure 2025 Generative AI Professional exam covers the theoretical and practical foundations of large language models. This domain represents approximately 10 questions on the exam. It is the conceptual bedrock for the remaining three domains -- every question about OCI Generative AI service, RAG, and agents assumes you understand these fundamentals.

The exam tests six areas within this domain:

  1. Transformer architecture and LLM types
  2. Tokenization and embeddings
  3. Prompt engineering techniques
  4. Decoding strategies and inference parameters
  5. Fine-tuning methods (with emphasis on T-Few and LoRA)
  6. Emerging LLM topics (code models, multi-modal models, language agents, hallucination)

1. Transformer Architecture and LLM Types

The Transformer

The Transformer architecture, introduced in the 2017 paper "Attention Is All You Need," is the foundation of every modern LLM. The exam expects you to understand its core mechanisms, not implement them.

Key components:

Component Function Exam Relevance
Self-Attention Allows each token to attend to every other token in the sequence, computing relevance scores Core mechanism -- understand that it enables context-awareness across the full input
Multi-Head Attention Runs multiple self-attention operations in parallel, each learning different relationship patterns Enables the model to capture different types of dependencies (syntactic, semantic, positional) simultaneously
Positional Encoding Injects sequence order information since Transformers process all tokens in parallel, not sequentially Without it, "dog bites man" and "man bites dog" would be identical to the model
Feed-Forward Network Applied to each position independently after attention; adds non-linear transformation capacity Present in every transformer layer
Layer Normalization Stabilizes training by normalizing activations within each layer Mentioned in context of training stability

Three Model Architectures

The Transformer spawned three architectural patterns. This distinction is heavily tested:

Architecture Examples What It Does Use Cases
Encoder-only BERT, MiniLM Produces embeddings (vector representations of input text); bidirectional context Semantic search, text classification, sentiment analysis, named entity recognition
Decoder-only GPT-4, Llama, Cohere Command Generates text token-by-token autoregressively (left-to-right) Text generation, chat, summarization, code generation
Encoder-Decoder T5, BART Encodes full input, then decodes output; bidirectional encoding with autoregressive decoding Translation, summarization, question answering

Exam trap: BERT is an encoder -- it does not generate text. GPT and Llama are decoders -- they generate text but are not designed to produce embeddings natively (though embeddings can be extracted). If a question asks which model type produces embeddings for semantic search, the answer is encoder models.

What Makes a Model "Large"

A large language model (LLM) is a probabilistic model of text with a very large number of parameters. There is no agreed-upon threshold for "large." The exam focuses on the practical implication: LLMs can perform tasks they were not explicitly trained on through in-context learning. (OCI GenAI Concepts)

2. Tokenization and Embeddings

Tokenization

Tokens are the fundamental input units for LLMs -- words, subwords, or individual characters/punctuation. The model never sees raw text; it sees token IDs. (OCI GenAI Concepts)

Examples from Oracle documentation:

  • "apple" = 1 token
  • "friendship" = 2 tokens ("friend" + "ship")
  • "don't" = 2 tokens ("don" + "'t")
  • Rule of thumb: ~4 characters per token

Tokenization algorithms (know the names and distinctions):

Algorithm Key Characteristic Used By
Byte Pair Encoding (BPE) Iteratively merges the most frequent character pairs into subword tokens GPT family, Llama
WordPiece Similar to BPE but uses likelihood-based merging rather than frequency BERT
SentencePiece Language-agnostic; treats the input as raw bytes, no pre-tokenization needed T5, multilingual models

Exam trap: The exam may ask about the relationship between context window size (measured in tokens) and model capability. Larger context windows allow more input but increase computational cost quadratically with self-attention.

Embeddings

Embeddings are numerical vector representations that capture the semantic meaning of text. They transform words, phrases, or entire documents into arrays of numbers (typically 384 or 1024 dimensions in OCI). (OCI GenAI Concepts)

Similarity measurement methods:

Method What It Measures Key Detail
Cosine Similarity Directional similarity between vectors A cosine distance of 0 means the vectors point in the same direction (similar). Requires normalized vectors.
Dot Product Combined magnitude and direction Higher values indicate vectors pointing in the same direction
K-Nearest Neighbors / ANN Approximate nearest neighbor search Algorithms like HNSW, FAISS, and Annoy enable efficient similarity retrieval at scale

OCI embedding models: Cohere Embed 4 (multimodal -- text + images), Cohere Embed English 3, Cohere Embed Multilingual 3, plus Light variants for lower latency. (OCI Pretrained Models)

Exam trap: Embedding-based search is semantic (meaning-based), not keyword-based. A keyword search for "car" will not match "automobile." Embedding search will, because their vectors are close in semantic space.

3. Prompt Engineering

Prompt engineering is the iterative process of crafting input text to optimize LLM outputs. It modifies the model's output probability distribution without changing any model parameters. This distinction between prompting (no parameter changes) and fine-tuning (parameter changes) is fundamental to the exam. (OCI GenAI Concepts)

In-Context Learning Techniques

Technique Description When to Use
Zero-shot No examples provided; task description only Model already understands the task well
Few-shot (k-shot) Task description plus k examples of input-output pairs Model needs demonstrations to understand the desired format or behavior
Chain-of-Thought (CoT) Instructs the model to emit intermediate reasoning steps before the final answer Complex multi-step reasoning tasks (math, logic, multi-hop questions)
Zero-shot CoT Adds "Let's think step by step" without providing reasoning examples Simpler than full CoT but still improves reasoning
Least-to-Most Decomposes a complex problem into subproblems, solves easiest first Problems that build on intermediate results
Step-Back Identifies high-level concepts or principles before answering the specific question Requires abstraction from specifics to general principles

Soft Prompting

Soft prompting (also called prompt tuning) is a hybrid between prompting and fine-tuning. It adds a small number of trainable parameters (soft prompts) to the model's input layer while keeping all original model weights frozen. (Jvikraman GenAI Notes)

Exam trap: Soft prompting is NOT the same as prompt engineering. Prompt engineering uses natural language text as input. Soft prompting uses learned continuous vectors prepended to the input. Soft prompting modifies parameters (the soft prompt vectors); prompt engineering does not.

Preamble (System Prompt)

In OCI Generative AI, the preamble is the initial context or guiding message sent to a chat model. It sets the model's role, tone, and behavioral constraints. The default preamble for Cohere Command R+ is: "You are Command. You are an extremely capable large language model built by Cohere." This is customizable per API call. (OCI GenAI Concepts)

Prompt Security

Two attack vectors the exam covers:

  • Prompt Injection / Jailbreaking: Malicious input designed to override system instructions or produce harmful outputs. OCI Guardrails includes a dedicated prompt injection defense. (OCI GenAI Concepts)
  • Memorization Attacks: Attempts to coerce the model into repeating its training data or system prompt verbatim.

4. Decoding Strategies and Inference Parameters

Decoding is the process of generating output text from an LLM. It happens iteratively, one token at a time: the model predicts a probability distribution over the vocabulary, selects a token, appends it, and repeats. (Jvikraman GenAI Notes)

Decoding Methods

Method How It Works Characteristics
Greedy Decoding Selects the highest-probability token at each step Deterministic, fast, but can produce repetitive or suboptimal text
Beam Search Maintains top-N candidate sequences at each step, selecting the overall highest-probability sequence Better than greedy but more expensive; not commonly used in chat models
Sampling Randomly selects from the probability distribution, controlled by temperature/top-k/top-p Non-deterministic; produces more varied, natural-sounding output

Inference Parameters (Critical for Exam)

These parameters are directly configurable in the OCI Generative AI Playground and API:

Parameter What It Controls Effect of Higher Values Effect of Lower Values OCI Default
Temperature Sharpness of the probability distribution Flattens distribution, more random/creative output, higher hallucination risk Peaks distribution around most likely token, more deterministic Model-dependent
Top-k Number of candidate tokens considered More candidates = more randomness Fewer candidates = more focused 0 or -1 (all tokens)
Top-p (Nucleus Sampling) Cumulative probability threshold for candidate tokens Higher p = more tokens eligible = more variety Lower p = fewer tokens = more focused Model-dependent
Frequency Penalty Penalizes tokens based on how many times they have appeared Reduces repetition proportional to frequency No penalty (0), negative values encourage repetition 0
Presence Penalty Penalizes tokens that have appeared at all, regardless of frequency Encourages novel tokens No penalty (0), negative values encourage repetition 0
Max Output Tokens Hard limit on generated sequence length Longer responses allowed Shorter responses Model-dependent
Stop Sequences Strings that terminate generation when produced N/A N/A None

(OCI GenAI Concepts)

Exam trap: Temperature = 0 produces deterministic output (use with a seed parameter for identical results). High temperature (approaching 1.0+) introduces hallucinations. The exam frequently tests which parameter combination is most likely to cause hallucinations -- the answer is high temperature + high top-p + low penalties.

Exam trap: Frequency penalty and presence penalty are different. Frequency penalty scales with how many times a token appeared. Presence penalty applies equally to all tokens that appeared at least once, regardless of count. Know this distinction.

How top-k and top-p interact: When both are set, the model first selects the top-k tokens, then from those, keeps only tokens whose cumulative probability reaches p. For example, if k=20 but the top 10 tokens already sum to 0.75 (and p=0.75), only those 10 are considered.

5. Fine-Tuning Methods

Fine-tuning modifies model parameters using task-specific data. This is the key distinction from prompting, which leaves parameters unchanged. The exam heavily tests the differences between fine-tuning approaches. (Jvikraman GenAI Notes)

Fine-Tuning Methods Comparison

Method What It Modifies Data Requirement Compute Cost Overfitting Risk
Vanilla (Full) Fine-Tuning All or most model parameters Large labeled dataset Very high High (especially with small datasets)
T-Few Only a fraction of weights in specific transformer layers (~0.01% of parameters) Small labeled dataset Low Low
LoRA (Low-Rank Adaptation) Adds small trainable rank-decomposition matrices alongside frozen original weights Moderate labeled dataset Moderate Moderate
QLoRA LoRA applied to a 4-bit quantized base model Moderate labeled dataset Lower than LoRA Moderate
Soft Prompting Only the prepended soft prompt vectors; all model weights frozen Small labeled dataset Very low Low

T-Few in Depth (Heavily Tested)

T-Few is an additive Parameter-Efficient Fine-Tuning (PEFT) technique based on the (IA)^3 method from the paper "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning". Key facts:

  • Mechanism: Inserts learned vectors that rescale inner activations within attention and feed-forward modules. The original pre-trained weights remain frozen.
  • Parameter efficiency: Adds approximately 0.01% of the baseline model's size in new parameters.
  • Performance: Outperforms GPT-3 in-context learning on benchmarks while using over 1,000x fewer FLOPs during inference.
  • OCI usage: OCI Generative AI uses T-Few for Cohere Command R (08-2024) models. (OCI Fine-Tuning Methods)
  • Training: Requires a dedicated AI cluster with 2 model units. Minimum 1 unit-hour per job.

T-Few hyperparameters in OCI:

Parameter Default Range
Total training epochs 1 1-10
Learning rate 0.01 0.000005-0.1
Training batch size 16 8-32
Early stopping patience 10 0 (disabled) or 1-16
Early stopping threshold 0.001 0.001-0.1

(OCI Fine-Tuning Parameters)

LoRA in Depth

LoRA adds small trainable rank-decomposition matrices (low-rank matrices) that transform inputs and outputs, rather than updating all original parameters. The weight matrices are scaled by LoRA alpha / LoRA r.

  • OCI usage: OCI uses LoRA for Meta Llama 3.3 70B, Llama 3.1 70B, and Cohere Command R (08-2024). (OCI Fine-Tuning Methods)

LoRA hyperparameters in OCI:

Parameter Default Range
Total training epochs 3 1+
Learning rate 0.0002 0-1.0
Training batch size 8 8-16
LoRA r (rank) 8 1-64
LoRA alpha 8 1-128
LoRA dropout 0.1 Decimal < 1
Early stopping patience 15 0 (disabled) or 1+
Early stopping threshold 0.0001 0 or positive

(OCI Fine-Tuning Parameters)

Training steps formula: totalTrainingSteps = (totalTrainingEpochs x size(trainingDataset)) / trainingBatchSize

OCI Model-to-Method Mapping (Memorize This)

Base Model Supported Fine-Tuning Method(s)
meta.llama-3.3-70b-instruct LoRA
meta.llama-3.1-70b-instruct LoRA
cohere.command-r-08-2024 T-Few, LoRA

(OCI Fine-Tuning Methods)

Exam trap: Cohere Command R supports both T-Few and LoRA. Meta Llama models support only LoRA. If a question asks which method is available for Llama fine-tuning, T-Few is wrong.

Fine-Tuning Evaluation Metrics

Metric What It Measures Direction
Accuracy Proportion of correct predictions out of total predictions on validation data Higher is better
Loss Magnitude of prediction error Lower is better (should decrease during training)

Exam trap: The exam tests the difference. Accuracy is "how many right." Loss is "how wrong." A model can have improving loss but fluctuating accuracy during early training. Both are reported on the OCI model details page under "Model Performance."

6. Emerging LLM Topics

Code Models

Code models are LLMs trained on code, comments, and documentation. They assist with code generation, completion, debugging, and explanation. Key examples: Codex (OpenAI), Code Llama (Meta), StarCoder (BigCode). In OCI, xAI Grok Code Fast 1 is available as a coding-focused model supporting TypeScript, Python, Java, Rust, C++, and Go. (OCI Pretrained Models)

Multi-Modal Models

Multi-modal LLMs process and generate across multiple data types (text, images, audio, video). In OCI, several multi-modal models are available:

Model Modalities Notes
Cohere Command A Vision Text + Images Visual data understanding (images, charts, documents)
Llama 3.2 90B Vision Text + Images Largest vision-capable Llama model
Llama 3.2 11B Vision Text + Images Compact multimodal option
Llama 4 Maverick / Scout Text + Images Mixture of Experts (MoE) architecture
Gemini 2.5 Pro/Flash Text + Images Google multimodal models
Cohere Embed 4 Text + Images (embeddings) Multimodal embeddings for search

(OCI Pretrained Models)

Diffusion models (DALL-E, Stable Diffusion) produce complex outputs simultaneously rather than token-by-token. They work well for images because image data is continuous. They are difficult to apply to text generation because text is categorical/discrete. (DBExam Sample Questions)

Exam trap: The exam asks why diffusion models are hard to apply to text. The answer is that text representation is categorical (discrete tokens), unlike continuous image pixel data.

Language Agents

Language agents use LLMs as reasoning engines that can take actions in the real world. Key concepts:

Concept Description
ReAct Combines Reasoning and Acting -- the LLM alternates between thinking about what to do and taking actions (tool calls) based on that reasoning
Toolformer LLM that learns to call external APIs/tools autonomously
Function Calling Structured mechanism where the LLM outputs a function name and arguments, which external code executes
Bootstrapped Reasoning Agent improves its reasoning through iterative self-generated examples

In OCI, agentic capabilities are supported by models like Cohere Command A Reasoning, Grok 4.1 Fast, and Llama 4 Maverick/Scout. (OCI Pretrained Models)

Hallucination

Hallucination is the phenomenon where an LLM generates factually incorrect information or unrelated content as if it were true. There is no known methodology to reliably eliminate hallucination entirely. (Jvikraman GenAI Notes)

Causes:

  • Training data gaps, biases, or errors
  • High temperature during decoding (flattened probability distribution)
  • Lack of grounding in source documents
  • Ambiguous or poorly constructed prompts

Mitigation strategies:

Strategy How It Helps
Lower temperature Sharpens probability distribution toward most likely (factual) tokens
RAG (Retrieval-Augmented Generation) Grounds responses in retrieved source documents
Attribution/Grounding Trains models to output citations; generated text is "grounded" if a document supports it
OCI Guardrails Content moderation, prompt injection defense, and PII handling -- configurable safety controls on OCI endpoints
Fine-tuning on domain data Teaches the model domain-specific facts
Human evaluation Review outputs for factual correctness

Exam trap: RAG does not eliminate hallucination. It reduces it by providing context. The model can still hallucinate information not present in the retrieved documents. The RAG Triad evaluation framework (Context Relevance, Groundedness, Answer Relevance) tests whether the full pipeline is working correctly.

Model Evaluation Metrics

Metric What It Measures Used For
Perplexity How well the model predicts a sample; lower is better General language model quality
BLEU N-gram overlap between generated and reference text Machine translation
ROUGE Recall-oriented overlap between generated and reference text Summarization
Human Evaluation Human judges rate quality, factual accuracy, coherence Gold standard but expensive and slow
Accuracy Correct predictions / total predictions Classification tasks, fine-tuning evaluation
Loss Prediction error magnitude (cross-entropy loss) Training monitoring

7. OCI Generative AI Service -- Domain 1 Context

While OCI service details are primarily tested in Domain 2, Domain 1 questions may reference these service concepts:

Feature Description
On-Demand Inference Pay-per-call, no upfront commitment, suitable for experimentation
Dedicated AI Clusters Reserved GPU resources for hosting and fine-tuning; required for custom models
Model Endpoints Designated points on dedicated clusters that accept inference requests
Guardrails Content moderation, prompt injection defense, PII handling -- must be explicitly enabled on endpoints
Playground Interactive testing interface for pretrained and custom model endpoints

(OCI GenAI Overview)

Quick-Reference: Decision Tree for Customization

Use this to answer "which technique should you use" questions:

Is the model performing well enough with standard prompting?
  YES --> Use zero-shot or few-shot prompting (no cost, no training)
  NO --> Do you have labeled training data?
    NO --> Use few-shot prompting with more examples, or RAG for grounding
    YES --> Is the dataset small (<100K samples)?
      YES --> Use T-Few (Cohere) or LoRA (Llama) for parameter-efficient fine-tuning
      NO --> Consider vanilla fine-tuning (if compute budget allows)
Does the data change frequently?
  YES --> Use RAG (retrieves current information, no retraining needed)
  NO --> Fine-tuning is acceptable

References