Reference

Domain 1: Fundamentals of Large Language Models (20%)

Domain 1 of the 1Z0-1127-25 Oracle Cloud Infrastructure 2025 Generative AI Professional exam covers the theoretical and practical foundations of large language models. This domain represents approximately 10 questions on the exam. It is the conceptual bedrock for the remaining three domains -- every question about OCI Generative AI service, RAG, and agents assumes you understand these fundamentals.

The exam tests six areas within this domain:

Transformer architecture and LLM types
Tokenization and embeddings
Prompt engineering techniques
Decoding strategies and inference parameters
Fine-tuning methods (with emphasis on T-Few and LoRA)
Emerging LLM topics (code models, multi-modal models, language agents, hallucination)

1. Transformer Architecture and LLM Types

The Transformer

The Transformer architecture, introduced in the 2017 paper "Attention Is All You Need," is the foundation of every modern LLM. The exam expects you to understand its core mechanisms, not implement them.

Key components:

Component	Function	Exam Relevance
Self-Attention	Allows each token to attend to every other token in the sequence, computing relevance scores	Core mechanism -- understand that it enables context-awareness across the full input
Multi-Head Attention	Runs multiple self-attention operations in parallel, each learning different relationship patterns	Enables the model to capture different types of dependencies (syntactic, semantic, positional) simultaneously
Positional Encoding	Injects sequence order information since Transformers process all tokens in parallel, not sequentially	Without it, "dog bites man" and "man bites dog" would be identical to the model
Feed-Forward Network	Applied to each position independently after attention; adds non-linear transformation capacity	Present in every transformer layer
Layer Normalization	Stabilizes training by normalizing activations within each layer	Mentioned in context of training stability

Three Model Architectures

The Transformer spawned three architectural patterns. This distinction is heavily tested:

Architecture	Examples	What It Does	Use Cases
Encoder-only	BERT, MiniLM	Produces embeddings (vector representations of input text); bidirectional context	Semantic search, text classification, sentiment analysis, named entity recognition
Decoder-only	GPT-4, Llama, Cohere Command	Generates text token-by-token autoregressively (left-to-right)	Text generation, chat, summarization, code generation
Encoder-Decoder	T5, BART	Encodes full input, then decodes output; bidirectional encoding with autoregressive decoding	Translation, summarization, question answering

Exam trap: BERT is an encoder -- it does not generate text. GPT and Llama are decoders -- they generate text but are not designed to produce embeddings natively (though embeddings can be extracted). If a question asks which model type produces embeddings for semantic search, the answer is encoder models.

What Makes a Model "Large"

A large language model (LLM) is a probabilistic model of text with a very large number of parameters. There is no agreed-upon threshold for "large." The exam focuses on the practical implication: LLMs can perform tasks they were not explicitly trained on through in-context learning. (OCI GenAI Concepts)

2. Tokenization and Embeddings

Tokenization

Tokens are the fundamental input units for LLMs -- words, subwords, or individual characters/punctuation. The model never sees raw text; it sees token IDs. (OCI GenAI Concepts)

Examples from Oracle documentation:

"apple" = 1 token
"friendship" = 2 tokens ("friend" + "ship")
"don't" = 2 tokens ("don" + "'t")
Rule of thumb: ~4 characters per token

Tokenization algorithms (know the names and distinctions):

Algorithm	Key Characteristic	Used By
Byte Pair Encoding (BPE)	Iteratively merges the most frequent character pairs into subword tokens	GPT family, Llama
WordPiece	Similar to BPE but uses likelihood-based merging rather than frequency	BERT
SentencePiece	Language-agnostic; treats the input as raw bytes, no pre-tokenization needed	T5, multilingual models

Exam trap: The exam may ask about the relationship between context window size (measured in tokens) and model capability. Larger context windows allow more input but increase computational cost quadratically with self-attention.

Embeddings

Embeddings are numerical vector representations that capture the semantic meaning of text. They transform words, phrases, or entire documents into arrays of numbers (typically 384 or 1024 dimensions in OCI). (OCI GenAI Concepts)

Similarity measurement methods:

Method	What It Measures	Key Detail
Cosine Similarity	Directional similarity between vectors	A cosine distance of 0 means the vectors point in the same direction (similar). Requires normalized vectors.
Dot Product	Combined magnitude and direction	Higher values indicate vectors pointing in the same direction
K-Nearest Neighbors / ANN	Approximate nearest neighbor search	Algorithms like HNSW, FAISS, and Annoy enable efficient similarity retrieval at scale

OCI embedding models: Cohere Embed 4 (multimodal -- text + images), Cohere Embed English 3, Cohere Embed Multilingual 3, plus Light variants for lower latency. (OCI Pretrained Models)

Exam trap: Embedding-based search is semantic (meaning-based), not keyword-based. A keyword search for "car" will not match "automobile." Embedding search will, because their vectors are close in semantic space.

3. Prompt Engineering

Prompt engineering is the iterative process of crafting input text to optimize LLM outputs. It modifies the model's output probability distribution without changing any model parameters. This distinction between prompting (no parameter changes) and fine-tuning (parameter changes) is fundamental to the exam. (OCI GenAI Concepts)

In-Context Learning Techniques

Technique	Description	When to Use
Zero-shot	No examples provided; task description only	Model already understands the task well
Few-shot (k-shot)	Task description plus k examples of input-output pairs	Model needs demonstrations to understand the desired format or behavior
Chain-of-Thought (CoT)	Instructs the model to emit intermediate reasoning steps before the final answer	Complex multi-step reasoning tasks (math, logic, multi-hop questions)
Zero-shot CoT	Adds "Let's think step by step" without providing reasoning examples	Simpler than full CoT but still improves reasoning
Least-to-Most	Decomposes a complex problem into subproblems, solves easiest first	Problems that build on intermediate results
Step-Back	Identifies high-level concepts or principles before answering the specific question	Requires abstraction from specifics to general principles

Soft Prompting

Soft prompting (also called prompt tuning) is a hybrid between prompting and fine-tuning. It adds a small number of trainable parameters (soft prompts) to the model's input layer while keeping all original model weights frozen. (Jvikraman GenAI Notes)

Exam trap: Soft prompting is NOT the same as prompt engineering. Prompt engineering uses natural language text as input. Soft prompting uses learned continuous vectors prepended to the input. Soft prompting modifies parameters (the soft prompt vectors); prompt engineering does not.

Preamble (System Prompt)

In OCI Generative AI, the preamble is the initial context or guiding message sent to a chat model. It sets the model's role, tone, and behavioral constraints. The default preamble for Cohere Command R+ is: "You are Command. You are an extremely capable large language model built by Cohere." This is customizable per API call. (OCI GenAI Concepts)

Prompt Security

Two attack vectors the exam covers:

Prompt Injection / Jailbreaking: Malicious input designed to override system instructions or produce harmful outputs. OCI Guardrails includes a dedicated prompt injection defense. (OCI GenAI Concepts)
Memorization Attacks: Attempts to coerce the model into repeating its training data or system prompt verbatim.

4. Decoding Strategies and Inference Parameters

Decoding is the process of generating output text from an LLM. It happens iteratively, one token at a time: the model predicts a probability distribution over the vocabulary, selects a token, appends it, and repeats. (Jvikraman GenAI Notes)

Decoding Methods

Method	How It Works	Characteristics
Greedy Decoding	Selects the highest-probability token at each step	Deterministic, fast, but can produce repetitive or suboptimal text
Beam Search	Maintains top-N candidate sequences at each step, selecting the overall highest-probability sequence	Better than greedy but more expensive; not commonly used in chat models
Sampling	Randomly selects from the probability distribution, controlled by temperature/top-k/top-p	Non-deterministic; produces more varied, natural-sounding output

Inference Parameters (Critical for Exam)

These parameters are directly configurable in the OCI Generative AI Playground and API:

Parameter	What It Controls	Effect of Higher Values	Effect of Lower Values	OCI Default
Temperature	Sharpness of the probability distribution	Flattens distribution, more random/creative output, higher hallucination risk	Peaks distribution around most likely token, more deterministic	Model-dependent
Top-k	Number of candidate tokens considered	More candidates = more randomness	Fewer candidates = more focused	0 or -1 (all tokens)
Top-p (Nucleus Sampling)	Cumulative probability threshold for candidate tokens	Higher p = more tokens eligible = more variety	Lower p = fewer tokens = more focused	Model-dependent
Frequency Penalty	Penalizes tokens based on how many times they have appeared	Reduces repetition proportional to frequency	No penalty (0), negative values encourage repetition	0
Presence Penalty	Penalizes tokens that have appeared at all, regardless of frequency	Encourages novel tokens	No penalty (0), negative values encourage repetition	0
Max Output Tokens	Hard limit on generated sequence length	Longer responses allowed	Shorter responses	Model-dependent
Stop Sequences	Strings that terminate generation when produced	N/A	N/A	None

(OCI GenAI Concepts)

Exam trap: Temperature = 0 produces deterministic output (use with a seed parameter for identical results). High temperature (approaching 1.0+) introduces hallucinations. The exam frequently tests which parameter combination is most likely to cause hallucinations -- the answer is high temperature + high top-p + low penalties.

Exam trap: Frequency penalty and presence penalty are different. Frequency penalty scales with how many times a token appeared. Presence penalty applies equally to all tokens that appeared at least once, regardless of count. Know this distinction.

How top-k and top-p interact: When both are set, the model first selects the top-k tokens, then from those, keeps only tokens whose cumulative probability reaches p. For example, if k=20 but the top 10 tokens already sum to 0.75 (and p=0.75), only those 10 are considered.

5. Fine-Tuning Methods

Fine-tuning modifies model parameters using task-specific data. This is the key distinction from prompting, which leaves parameters unchanged. The exam heavily tests the differences between fine-tuning approaches. (Jvikraman GenAI Notes)

Fine-Tuning Methods Comparison

Method	What It Modifies	Data Requirement	Compute Cost	Overfitting Risk
Vanilla (Full) Fine-Tuning	All or most model parameters	Large labeled dataset	Very high	High (especially with small datasets)
T-Few	Only a fraction of weights in specific transformer layers (~0.01% of parameters)	Small labeled dataset	Low	Low
LoRA (Low-Rank Adaptation)	Adds small trainable rank-decomposition matrices alongside frozen original weights	Moderate labeled dataset	Moderate	Moderate
QLoRA	LoRA applied to a 4-bit quantized base model	Moderate labeled dataset	Lower than LoRA	Moderate
Soft Prompting	Only the prepended soft prompt vectors; all model weights frozen	Small labeled dataset	Very low	Low

T-Few in Depth (Heavily Tested)

T-Few is an additive Parameter-Efficient Fine-Tuning (PEFT) technique based on the (IA)^3 method from the paper "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning". Key facts:

Mechanism: Inserts learned vectors that rescale inner activations within attention and feed-forward modules. The original pre-trained weights remain frozen.
Parameter efficiency: Adds approximately 0.01% of the baseline model's size in new parameters.
Performance: Outperforms GPT-3 in-context learning on benchmarks while using over 1,000x fewer FLOPs during inference.
OCI usage: OCI Generative AI uses T-Few for Cohere Command R (08-2024) models. (OCI Fine-Tuning Methods)
Training: Requires a dedicated AI cluster with 2 model units. Minimum 1 unit-hour per job.

T-Few hyperparameters in OCI:

Parameter	Default	Range
Total training epochs	1	1-10
Learning rate	0.01	0.000005-0.1
Training batch size	16	8-32
Early stopping patience	10	0 (disabled) or 1-16
Early stopping threshold	0.001	0.001-0.1

(OCI Fine-Tuning Parameters)

LoRA in Depth

LoRA adds small trainable rank-decomposition matrices (low-rank matrices) that transform inputs and outputs, rather than updating all original parameters. The weight matrices are scaled by LoRA alpha / LoRA r.

OCI usage: OCI uses LoRA for Meta Llama 3.3 70B, Llama 3.1 70B, and Cohere Command R (08-2024). (OCI Fine-Tuning Methods)

LoRA hyperparameters in OCI:

Parameter	Default	Range
Total training epochs	3	1+
Learning rate	0.0002	0-1.0
Training batch size	8	8-16
LoRA r (rank)	8	1-64
LoRA alpha	8	1-128
LoRA dropout	0.1	Decimal < 1
Early stopping patience	15	0 (disabled) or 1+
Early stopping threshold	0.0001	0 or positive

(OCI Fine-Tuning Parameters)

Training steps formula: totalTrainingSteps = (totalTrainingEpochs x size(trainingDataset)) / trainingBatchSize

OCI Model-to-Method Mapping (Memorize This)

Base Model	Supported Fine-Tuning Method(s)
meta.llama-3.3-70b-instruct	LoRA
meta.llama-3.1-70b-instruct	LoRA
cohere.command-r-08-2024	T-Few, LoRA

(OCI Fine-Tuning Methods)

Exam trap: Cohere Command R supports both T-Few and LoRA. Meta Llama models support only LoRA. If a question asks which method is available for Llama fine-tuning, T-Few is wrong.

Fine-Tuning Evaluation Metrics

Metric	What It Measures	Direction
Accuracy	Proportion of correct predictions out of total predictions on validation data	Higher is better
Loss	Magnitude of prediction error	Lower is better (should decrease during training)

Exam trap: The exam tests the difference. Accuracy is "how many right." Loss is "how wrong." A model can have improving loss but fluctuating accuracy during early training. Both are reported on the OCI model details page under "Model Performance."

6. Emerging LLM Topics

Code Models

Code models are LLMs trained on code, comments, and documentation. They assist with code generation, completion, debugging, and explanation. Key examples: Codex (OpenAI), Code Llama (Meta), StarCoder (BigCode). In OCI, xAI Grok Code Fast 1 is available as a coding-focused model supporting TypeScript, Python, Java, Rust, C++, and Go. (OCI Pretrained Models)

Multi-Modal Models

Multi-modal LLMs process and generate across multiple data types (text, images, audio, video). In OCI, several multi-modal models are available:

Model	Modalities	Notes
Cohere Command A Vision	Text + Images	Visual data understanding (images, charts, documents)
Llama 3.2 90B Vision	Text + Images	Largest vision-capable Llama model
Llama 3.2 11B Vision	Text + Images	Compact multimodal option
Llama 4 Maverick / Scout	Text + Images	Mixture of Experts (MoE) architecture
Gemini 2.5 Pro/Flash	Text + Images	Google multimodal models
Cohere Embed 4	Text + Images (embeddings)	Multimodal embeddings for search

(OCI Pretrained Models)

Diffusion models (DALL-E, Stable Diffusion) produce complex outputs simultaneously rather than token-by-token. They work well for images because image data is continuous. They are difficult to apply to text generation because text is categorical/discrete. (DBExam Sample Questions)

Exam trap: The exam asks why diffusion models are hard to apply to text. The answer is that text representation is categorical (discrete tokens), unlike continuous image pixel data.

Language Agents

Language agents use LLMs as reasoning engines that can take actions in the real world. Key concepts:

Concept	Description
ReAct	Combines Reasoning and Acting -- the LLM alternates between thinking about what to do and taking actions (tool calls) based on that reasoning
Toolformer	LLM that learns to call external APIs/tools autonomously
Function Calling	Structured mechanism where the LLM outputs a function name and arguments, which external code executes
Bootstrapped Reasoning	Agent improves its reasoning through iterative self-generated examples

In OCI, agentic capabilities are supported by models like Cohere Command A Reasoning, Grok 4.1 Fast, and Llama 4 Maverick/Scout. (OCI Pretrained Models)

Hallucination

Hallucination is the phenomenon where an LLM generates factually incorrect information or unrelated content as if it were true. There is no known methodology to reliably eliminate hallucination entirely. (Jvikraman GenAI Notes)

Causes:

Training data gaps, biases, or errors
High temperature during decoding (flattened probability distribution)
Lack of grounding in source documents
Ambiguous or poorly constructed prompts

Mitigation strategies:

Strategy	How It Helps
Lower temperature	Sharpens probability distribution toward most likely (factual) tokens
RAG (Retrieval-Augmented Generation)	Grounds responses in retrieved source documents
Attribution/Grounding	Trains models to output citations; generated text is "grounded" if a document supports it
OCI Guardrails	Content moderation, prompt injection defense, and PII handling -- configurable safety controls on OCI endpoints
Fine-tuning on domain data	Teaches the model domain-specific facts
Human evaluation	Review outputs for factual correctness

Exam trap: RAG does not eliminate hallucination. It reduces it by providing context. The model can still hallucinate information not present in the retrieved documents. The RAG Triad evaluation framework (Context Relevance, Groundedness, Answer Relevance) tests whether the full pipeline is working correctly.

Model Evaluation Metrics

Metric	What It Measures	Used For
Perplexity	How well the model predicts a sample; lower is better	General language model quality
BLEU	N-gram overlap between generated and reference text	Machine translation
ROUGE	Recall-oriented overlap between generated and reference text	Summarization
Human Evaluation	Human judges rate quality, factual accuracy, coherence	Gold standard but expensive and slow
Accuracy	Correct predictions / total predictions	Classification tasks, fine-tuning evaluation
Loss	Prediction error magnitude (cross-entropy loss)	Training monitoring

7. OCI Generative AI Service -- Domain 1 Context

While OCI service details are primarily tested in Domain 2, Domain 1 questions may reference these service concepts:

Feature	Description
On-Demand Inference	Pay-per-call, no upfront commitment, suitable for experimentation
Dedicated AI Clusters	Reserved GPU resources for hosting and fine-tuning; required for custom models
Model Endpoints	Designated points on dedicated clusters that accept inference requests
Guardrails	Content moderation, prompt injection defense, PII handling -- must be explicitly enabled on endpoints
Playground	Interactive testing interface for pretrained and custom model endpoints

(OCI GenAI Overview)

Quick-Reference: Decision Tree for Customization

Use this to answer "which technique should you use" questions:

Is the model performing well enough with standard prompting?
  YES --> Use zero-shot or few-shot prompting (no cost, no training)
  NO --> Do you have labeled training data?
    NO --> Use few-shot prompting with more examples, or RAG for grounding
    YES --> Is the dataset small (<100K samples)?
      YES --> Use T-Few (Cohere) or LoRA (Llama) for parameter-efficient fine-tuning
      NO --> Consider vanilla fine-tuning (if compute budget allows)
Does the data change frequently?
  YES --> Use RAG (retrieves current information, no retraining needed)
  NO --> Fine-tuning is acceptable