Domain 1: Fundamentals of Large Language Models (20%)
Domain 1 of the 1Z0-1127-25 Oracle Cloud Infrastructure 2025 Generative AI Professional exam covers the theoretical and practical foundations of large language models. This domain represents approximately 10 questions on the exam. It is the conceptual bedrock for the remaining three domains -- every question about OCI Generative AI service, RAG, and agents assumes you understand these fundamentals.
The exam tests six areas within this domain:
- Transformer architecture and LLM types
- Tokenization and embeddings
- Prompt engineering techniques
- Decoding strategies and inference parameters
- Fine-tuning methods (with emphasis on T-Few and LoRA)
- Emerging LLM topics (code models, multi-modal models, language agents, hallucination)
1. Transformer Architecture and LLM Types
The Transformer
The Transformer architecture, introduced in the 2017 paper "Attention Is All You Need," is the foundation of every modern LLM. The exam expects you to understand its core mechanisms, not implement them.
Key components:
| Component | Function | Exam Relevance |
|---|---|---|
| Self-Attention | Allows each token to attend to every other token in the sequence, computing relevance scores | Core mechanism -- understand that it enables context-awareness across the full input |
| Multi-Head Attention | Runs multiple self-attention operations in parallel, each learning different relationship patterns | Enables the model to capture different types of dependencies (syntactic, semantic, positional) simultaneously |
| Positional Encoding | Injects sequence order information since Transformers process all tokens in parallel, not sequentially | Without it, "dog bites man" and "man bites dog" would be identical to the model |
| Feed-Forward Network | Applied to each position independently after attention; adds non-linear transformation capacity | Present in every transformer layer |
| Layer Normalization | Stabilizes training by normalizing activations within each layer | Mentioned in context of training stability |
Three Model Architectures
The Transformer spawned three architectural patterns. This distinction is heavily tested:
| Architecture | Examples | What It Does | Use Cases |
|---|---|---|---|
| Encoder-only | BERT, MiniLM | Produces embeddings (vector representations of input text); bidirectional context | Semantic search, text classification, sentiment analysis, named entity recognition |
| Decoder-only | GPT-4, Llama, Cohere Command | Generates text token-by-token autoregressively (left-to-right) | Text generation, chat, summarization, code generation |
| Encoder-Decoder | T5, BART | Encodes full input, then decodes output; bidirectional encoding with autoregressive decoding | Translation, summarization, question answering |
Exam trap: BERT is an encoder -- it does not generate text. GPT and Llama are decoders -- they generate text but are not designed to produce embeddings natively (though embeddings can be extracted). If a question asks which model type produces embeddings for semantic search, the answer is encoder models.
What Makes a Model "Large"
A large language model (LLM) is a probabilistic model of text with a very large number of parameters. There is no agreed-upon threshold for "large." The exam focuses on the practical implication: LLMs can perform tasks they were not explicitly trained on through in-context learning. (OCI GenAI Concepts)
2. Tokenization and Embeddings
Tokenization
Tokens are the fundamental input units for LLMs -- words, subwords, or individual characters/punctuation. The model never sees raw text; it sees token IDs. (OCI GenAI Concepts)
Examples from Oracle documentation:
- "apple" = 1 token
- "friendship" = 2 tokens ("friend" + "ship")
- "don't" = 2 tokens ("don" + "'t")
- Rule of thumb: ~4 characters per token
Tokenization algorithms (know the names and distinctions):
| Algorithm | Key Characteristic | Used By |
|---|---|---|
| Byte Pair Encoding (BPE) | Iteratively merges the most frequent character pairs into subword tokens | GPT family, Llama |
| WordPiece | Similar to BPE but uses likelihood-based merging rather than frequency | BERT |
| SentencePiece | Language-agnostic; treats the input as raw bytes, no pre-tokenization needed | T5, multilingual models |
Exam trap: The exam may ask about the relationship between context window size (measured in tokens) and model capability. Larger context windows allow more input but increase computational cost quadratically with self-attention.
Embeddings
Embeddings are numerical vector representations that capture the semantic meaning of text. They transform words, phrases, or entire documents into arrays of numbers (typically 384 or 1024 dimensions in OCI). (OCI GenAI Concepts)
Similarity measurement methods:
| Method | What It Measures | Key Detail |
|---|---|---|
| Cosine Similarity | Directional similarity between vectors | A cosine distance of 0 means the vectors point in the same direction (similar). Requires normalized vectors. |
| Dot Product | Combined magnitude and direction | Higher values indicate vectors pointing in the same direction |
| K-Nearest Neighbors / ANN | Approximate nearest neighbor search | Algorithms like HNSW, FAISS, and Annoy enable efficient similarity retrieval at scale |
OCI embedding models: Cohere Embed 4 (multimodal -- text + images), Cohere Embed English 3, Cohere Embed Multilingual 3, plus Light variants for lower latency. (OCI Pretrained Models)
Exam trap: Embedding-based search is semantic (meaning-based), not keyword-based. A keyword search for "car" will not match "automobile." Embedding search will, because their vectors are close in semantic space.
3. Prompt Engineering
Prompt engineering is the iterative process of crafting input text to optimize LLM outputs. It modifies the model's output probability distribution without changing any model parameters. This distinction between prompting (no parameter changes) and fine-tuning (parameter changes) is fundamental to the exam. (OCI GenAI Concepts)
In-Context Learning Techniques
| Technique | Description | When to Use |
|---|---|---|
| Zero-shot | No examples provided; task description only | Model already understands the task well |
| Few-shot (k-shot) | Task description plus k examples of input-output pairs | Model needs demonstrations to understand the desired format or behavior |
| Chain-of-Thought (CoT) | Instructs the model to emit intermediate reasoning steps before the final answer | Complex multi-step reasoning tasks (math, logic, multi-hop questions) |
| Zero-shot CoT | Adds "Let's think step by step" without providing reasoning examples | Simpler than full CoT but still improves reasoning |
| Least-to-Most | Decomposes a complex problem into subproblems, solves easiest first | Problems that build on intermediate results |
| Step-Back | Identifies high-level concepts or principles before answering the specific question | Requires abstraction from specifics to general principles |
Soft Prompting
Soft prompting (also called prompt tuning) is a hybrid between prompting and fine-tuning. It adds a small number of trainable parameters (soft prompts) to the model's input layer while keeping all original model weights frozen. (Jvikraman GenAI Notes)
Exam trap: Soft prompting is NOT the same as prompt engineering. Prompt engineering uses natural language text as input. Soft prompting uses learned continuous vectors prepended to the input. Soft prompting modifies parameters (the soft prompt vectors); prompt engineering does not.
Preamble (System Prompt)
In OCI Generative AI, the preamble is the initial context or guiding message sent to a chat model. It sets the model's role, tone, and behavioral constraints. The default preamble for Cohere Command R+ is: "You are Command. You are an extremely capable large language model built by Cohere." This is customizable per API call. (OCI GenAI Concepts)
Prompt Security
Two attack vectors the exam covers:
- Prompt Injection / Jailbreaking: Malicious input designed to override system instructions or produce harmful outputs. OCI Guardrails includes a dedicated prompt injection defense. (OCI GenAI Concepts)
- Memorization Attacks: Attempts to coerce the model into repeating its training data or system prompt verbatim.
4. Decoding Strategies and Inference Parameters
Decoding is the process of generating output text from an LLM. It happens iteratively, one token at a time: the model predicts a probability distribution over the vocabulary, selects a token, appends it, and repeats. (Jvikraman GenAI Notes)
Decoding Methods
| Method | How It Works | Characteristics |
|---|---|---|
| Greedy Decoding | Selects the highest-probability token at each step | Deterministic, fast, but can produce repetitive or suboptimal text |
| Beam Search | Maintains top-N candidate sequences at each step, selecting the overall highest-probability sequence | Better than greedy but more expensive; not commonly used in chat models |
| Sampling | Randomly selects from the probability distribution, controlled by temperature/top-k/top-p | Non-deterministic; produces more varied, natural-sounding output |
Inference Parameters (Critical for Exam)
These parameters are directly configurable in the OCI Generative AI Playground and API:
| Parameter | What It Controls | Effect of Higher Values | Effect of Lower Values | OCI Default |
|---|---|---|---|---|
| Temperature | Sharpness of the probability distribution | Flattens distribution, more random/creative output, higher hallucination risk | Peaks distribution around most likely token, more deterministic | Model-dependent |
| Top-k | Number of candidate tokens considered | More candidates = more randomness | Fewer candidates = more focused | 0 or -1 (all tokens) |
| Top-p (Nucleus Sampling) | Cumulative probability threshold for candidate tokens | Higher p = more tokens eligible = more variety | Lower p = fewer tokens = more focused | Model-dependent |
| Frequency Penalty | Penalizes tokens based on how many times they have appeared | Reduces repetition proportional to frequency | No penalty (0), negative values encourage repetition | 0 |
| Presence Penalty | Penalizes tokens that have appeared at all, regardless of frequency | Encourages novel tokens | No penalty (0), negative values encourage repetition | 0 |
| Max Output Tokens | Hard limit on generated sequence length | Longer responses allowed | Shorter responses | Model-dependent |
| Stop Sequences | Strings that terminate generation when produced | N/A | N/A | None |
Exam trap: Temperature = 0 produces deterministic output (use with a seed parameter for identical results). High temperature (approaching 1.0+) introduces hallucinations. The exam frequently tests which parameter combination is most likely to cause hallucinations -- the answer is high temperature + high top-p + low penalties.
Exam trap: Frequency penalty and presence penalty are different. Frequency penalty scales with how many times a token appeared. Presence penalty applies equally to all tokens that appeared at least once, regardless of count. Know this distinction.
How top-k and top-p interact: When both are set, the model first selects the top-k tokens, then from those, keeps only tokens whose cumulative probability reaches p. For example, if k=20 but the top 10 tokens already sum to 0.75 (and p=0.75), only those 10 are considered.
5. Fine-Tuning Methods
Fine-tuning modifies model parameters using task-specific data. This is the key distinction from prompting, which leaves parameters unchanged. The exam heavily tests the differences between fine-tuning approaches. (Jvikraman GenAI Notes)
Fine-Tuning Methods Comparison
| Method | What It Modifies | Data Requirement | Compute Cost | Overfitting Risk |
|---|---|---|---|---|
| Vanilla (Full) Fine-Tuning | All or most model parameters | Large labeled dataset | Very high | High (especially with small datasets) |
| T-Few | Only a fraction of weights in specific transformer layers (~0.01% of parameters) | Small labeled dataset | Low | Low |
| LoRA (Low-Rank Adaptation) | Adds small trainable rank-decomposition matrices alongside frozen original weights | Moderate labeled dataset | Moderate | Moderate |
| QLoRA | LoRA applied to a 4-bit quantized base model | Moderate labeled dataset | Lower than LoRA | Moderate |
| Soft Prompting | Only the prepended soft prompt vectors; all model weights frozen | Small labeled dataset | Very low | Low |
T-Few in Depth (Heavily Tested)
T-Few is an additive Parameter-Efficient Fine-Tuning (PEFT) technique based on the (IA)^3 method from the paper "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning". Key facts:
- Mechanism: Inserts learned vectors that rescale inner activations within attention and feed-forward modules. The original pre-trained weights remain frozen.
- Parameter efficiency: Adds approximately 0.01% of the baseline model's size in new parameters.
- Performance: Outperforms GPT-3 in-context learning on benchmarks while using over 1,000x fewer FLOPs during inference.
- OCI usage: OCI Generative AI uses T-Few for Cohere Command R (08-2024) models. (OCI Fine-Tuning Methods)
- Training: Requires a dedicated AI cluster with 2 model units. Minimum 1 unit-hour per job.
T-Few hyperparameters in OCI:
| Parameter | Default | Range |
|---|---|---|
| Total training epochs | 1 | 1-10 |
| Learning rate | 0.01 | 0.000005-0.1 |
| Training batch size | 16 | 8-32 |
| Early stopping patience | 10 | 0 (disabled) or 1-16 |
| Early stopping threshold | 0.001 | 0.001-0.1 |
LoRA in Depth
LoRA adds small trainable rank-decomposition matrices (low-rank matrices) that transform inputs and outputs, rather than updating all original parameters. The weight matrices are scaled by LoRA alpha / LoRA r.
- OCI usage: OCI uses LoRA for Meta Llama 3.3 70B, Llama 3.1 70B, and Cohere Command R (08-2024). (OCI Fine-Tuning Methods)
LoRA hyperparameters in OCI:
| Parameter | Default | Range |
|---|---|---|
| Total training epochs | 3 | 1+ |
| Learning rate | 0.0002 | 0-1.0 |
| Training batch size | 8 | 8-16 |
| LoRA r (rank) | 8 | 1-64 |
| LoRA alpha | 8 | 1-128 |
| LoRA dropout | 0.1 | Decimal < 1 |
| Early stopping patience | 15 | 0 (disabled) or 1+ |
| Early stopping threshold | 0.0001 | 0 or positive |
Training steps formula: totalTrainingSteps = (totalTrainingEpochs x size(trainingDataset)) / trainingBatchSize
OCI Model-to-Method Mapping (Memorize This)
| Base Model | Supported Fine-Tuning Method(s) |
|---|---|
| meta.llama-3.3-70b-instruct | LoRA |
| meta.llama-3.1-70b-instruct | LoRA |
| cohere.command-r-08-2024 | T-Few, LoRA |
Exam trap: Cohere Command R supports both T-Few and LoRA. Meta Llama models support only LoRA. If a question asks which method is available for Llama fine-tuning, T-Few is wrong.
Fine-Tuning Evaluation Metrics
| Metric | What It Measures | Direction |
|---|---|---|
| Accuracy | Proportion of correct predictions out of total predictions on validation data | Higher is better |
| Loss | Magnitude of prediction error | Lower is better (should decrease during training) |
Exam trap: The exam tests the difference. Accuracy is "how many right." Loss is "how wrong." A model can have improving loss but fluctuating accuracy during early training. Both are reported on the OCI model details page under "Model Performance."
6. Emerging LLM Topics
Code Models
Code models are LLMs trained on code, comments, and documentation. They assist with code generation, completion, debugging, and explanation. Key examples: Codex (OpenAI), Code Llama (Meta), StarCoder (BigCode). In OCI, xAI Grok Code Fast 1 is available as a coding-focused model supporting TypeScript, Python, Java, Rust, C++, and Go. (OCI Pretrained Models)
Multi-Modal Models
Multi-modal LLMs process and generate across multiple data types (text, images, audio, video). In OCI, several multi-modal models are available:
| Model | Modalities | Notes |
|---|---|---|
| Cohere Command A Vision | Text + Images | Visual data understanding (images, charts, documents) |
| Llama 3.2 90B Vision | Text + Images | Largest vision-capable Llama model |
| Llama 3.2 11B Vision | Text + Images | Compact multimodal option |
| Llama 4 Maverick / Scout | Text + Images | Mixture of Experts (MoE) architecture |
| Gemini 2.5 Pro/Flash | Text + Images | Google multimodal models |
| Cohere Embed 4 | Text + Images (embeddings) | Multimodal embeddings for search |
Diffusion models (DALL-E, Stable Diffusion) produce complex outputs simultaneously rather than token-by-token. They work well for images because image data is continuous. They are difficult to apply to text generation because text is categorical/discrete. (DBExam Sample Questions)
Exam trap: The exam asks why diffusion models are hard to apply to text. The answer is that text representation is categorical (discrete tokens), unlike continuous image pixel data.
Language Agents
Language agents use LLMs as reasoning engines that can take actions in the real world. Key concepts:
| Concept | Description |
|---|---|
| ReAct | Combines Reasoning and Acting -- the LLM alternates between thinking about what to do and taking actions (tool calls) based on that reasoning |
| Toolformer | LLM that learns to call external APIs/tools autonomously |
| Function Calling | Structured mechanism where the LLM outputs a function name and arguments, which external code executes |
| Bootstrapped Reasoning | Agent improves its reasoning through iterative self-generated examples |
In OCI, agentic capabilities are supported by models like Cohere Command A Reasoning, Grok 4.1 Fast, and Llama 4 Maverick/Scout. (OCI Pretrained Models)
Hallucination
Hallucination is the phenomenon where an LLM generates factually incorrect information or unrelated content as if it were true. There is no known methodology to reliably eliminate hallucination entirely. (Jvikraman GenAI Notes)
Causes:
- Training data gaps, biases, or errors
- High temperature during decoding (flattened probability distribution)
- Lack of grounding in source documents
- Ambiguous or poorly constructed prompts
Mitigation strategies:
| Strategy | How It Helps |
|---|---|
| Lower temperature | Sharpens probability distribution toward most likely (factual) tokens |
| RAG (Retrieval-Augmented Generation) | Grounds responses in retrieved source documents |
| Attribution/Grounding | Trains models to output citations; generated text is "grounded" if a document supports it |
| OCI Guardrails | Content moderation, prompt injection defense, and PII handling -- configurable safety controls on OCI endpoints |
| Fine-tuning on domain data | Teaches the model domain-specific facts |
| Human evaluation | Review outputs for factual correctness |
Exam trap: RAG does not eliminate hallucination. It reduces it by providing context. The model can still hallucinate information not present in the retrieved documents. The RAG Triad evaluation framework (Context Relevance, Groundedness, Answer Relevance) tests whether the full pipeline is working correctly.
Model Evaluation Metrics
| Metric | What It Measures | Used For |
|---|---|---|
| Perplexity | How well the model predicts a sample; lower is better | General language model quality |
| BLEU | N-gram overlap between generated and reference text | Machine translation |
| ROUGE | Recall-oriented overlap between generated and reference text | Summarization |
| Human Evaluation | Human judges rate quality, factual accuracy, coherence | Gold standard but expensive and slow |
| Accuracy | Correct predictions / total predictions | Classification tasks, fine-tuning evaluation |
| Loss | Prediction error magnitude (cross-entropy loss) | Training monitoring |
7. OCI Generative AI Service -- Domain 1 Context
While OCI service details are primarily tested in Domain 2, Domain 1 questions may reference these service concepts:
| Feature | Description |
|---|---|
| On-Demand Inference | Pay-per-call, no upfront commitment, suitable for experimentation |
| Dedicated AI Clusters | Reserved GPU resources for hosting and fine-tuning; required for custom models |
| Model Endpoints | Designated points on dedicated clusters that accept inference requests |
| Guardrails | Content moderation, prompt injection defense, PII handling -- must be explicitly enabled on endpoints |
| Playground | Interactive testing interface for pretrained and custom model endpoints |
Quick-Reference: Decision Tree for Customization
Use this to answer "which technique should you use" questions:
Is the model performing well enough with standard prompting?
YES --> Use zero-shot or few-shot prompting (no cost, no training)
NO --> Do you have labeled training data?
NO --> Use few-shot prompting with more examples, or RAG for grounding
YES --> Is the dataset small (<100K samples)?
YES --> Use T-Few (Cohere) or LoRA (Llama) for parameter-efficient fine-tuning
NO --> Consider vanilla fine-tuning (if compute budget allows)
Does the data change frequently?
YES --> Use RAG (retrieves current information, no retraining needed)
NO --> Fine-tuning is acceptable
References
- OCI Generative AI Service Overview
- OCI Generative AI Concepts
- Choosing a Fine-Tuning Method
- Fine-Tuning Hyperparameters
- OCI Pretrained Models
- K21 Academy OCI GenAI Certification Guide
- T-Few Paper: "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning"
- Jvikraman: Generative AI Concepts Study Notes
- DBExam: OCI 1Z0-1127-25 Sample Questions