Reference

Domain 4: Ensuring Successful Operation of a Cloud Solution (~20%)

Domain 4 accounts for approximately 20% of the Associate Cloud Engineer exam, translating to roughly 10-12 questions. This is one of the most operationally focused domains: you must know how to manage running resources, troubleshoot issues, configure monitoring and logging, and handle day-to-day operational tasks across Compute Engine, GKE, Cloud Run, storage, databases, and networking. The exam tests six sub-domains with heavy emphasis on gcloud commands and Console workflows.


4.1 Managing Compute Engine Resources

Starting, Stopping, and Deleting Instances

Compute Engine VMs have distinct lifecycle states that determine billing and behavior.

Action Command Billing Persistent Disk
Start gcloud compute instances start INSTANCE Resumes full billing Preserved
Stop gcloud compute instances stop INSTANCE No CPU/memory charge; disk charges continue Preserved
Suspend gcloud compute instances suspend INSTANCE No CPU/memory charge; charges for suspended state memory + disk Preserved + memory saved
Delete gcloud compute instances delete INSTANCE All billing stops Boot disk deleted by default; non-boot disks preserved unless --delete-disks=all
Reset gcloud compute instances reset INSTANCE Continues Preserved (hard reset, no graceful shutdown)
# Stop an instance
gcloud compute instances stop my-vm --zone=us-central1-a

# Start an instance
gcloud compute instances start my-vm --zone=us-central1-a

# Delete an instance but keep all disks
gcloud compute instances delete my-vm --zone=us-central1-a --keep-disks=all

# Delete an instance and all attached disks
gcloud compute instances delete my-vm --zone=us-central1-a --delete-disks=all

Exam trap: When you delete a VM, the boot disk is deleted by default. Non-boot (additional) disks are NOT deleted by default. If a question asks about preserving data after VM deletion, you need --keep-disks=all or detach the disk first.

Editing VM Configuration

Some properties can be changed while the VM is running; others require a stop first.

Property Requires Stop? Command
Machine type Yes gcloud compute instances set-machine-type INSTANCE --machine-type=e2-standard-4
Labels No gcloud compute instances update INSTANCE --update-labels=env=prod
Metadata No gcloud compute instances add-metadata INSTANCE --metadata=key=value
Tags (network) No gcloud compute instances add-tags INSTANCE --tags=http-server
Service account Yes Must stop, then use gcloud compute instances set-service-account
Attached disks No (attach); Yes (detach boot) gcloud compute instances attach-disk INSTANCE --disk=DISK_NAME

Exam trap: Changing the machine type requires stopping the instance first. You cannot resize a running VM's CPU/memory. This is a frequent exam question.

SSH and RDP Connections

# SSH into a Linux VM (uses IAP tunnel by default if no external IP)
gcloud compute ssh my-vm --zone=us-central1-a

# SSH through IAP explicitly
gcloud compute ssh my-vm --zone=us-central1-a --tunnel-through-iap

# SSH with a specific user
gcloud compute ssh user@my-vm --zone=us-central1-a

# Create an RDP tunnel for Windows VMs
gcloud compute start-iap-tunnel my-windows-vm 3389 --local-host-port=localhost:3389

OS Login is Google's recommended method for managing SSH access. When enabled, SSH keys are managed centrally through IAM rather than per-instance metadata.

Feature Metadata SSH Keys OS Login
Key management Per-instance or project-level metadata Centralized via Cloud Identity/Workspace
IAM integration None Full: roles/compute.osLogin (standard), roles/compute.osAdminLogin (sudo)
2FA support No Yes (with roles/compute.osLogin + 2FA configured)
Audit Limited Full audit via Cloud Audit Logs
# Enable OS Login at project level
gcloud compute project-info add-metadata --metadata enable-oslogin=TRUE

# Enable OS Login on a specific instance
gcloud compute instances add-metadata my-vm --metadata enable-oslogin=TRUE

Exam trap: OS Login overrides metadata-based SSH keys when enabled. If a user cannot SSH after OS Login is enabled, they need roles/compute.osLogin (or roles/compute.osAdminLogin for sudo access) on the project or instance.

Snapshots

Snapshots are incremental backups of persistent disks. After the first full snapshot, subsequent snapshots only store changed blocks.

# Create a snapshot
gcloud compute disks snapshot my-disk --zone=us-central1-a \
  --snapshot-names=my-snapshot

# Create a snapshot with a storage location
gcloud compute disks snapshot my-disk --zone=us-central1-a \
  --snapshot-names=my-snapshot --storage-location=us

# List snapshots
gcloud compute snapshots list

# Create a disk from a snapshot
gcloud compute disks create new-disk --source-snapshot=my-snapshot \
  --zone=us-east1-b

# Create a VM from a snapshot (create disk first, then VM)
gcloud compute instances create my-new-vm --zone=us-east1-b \
  --disk=name=new-disk,boot=yes

Snapshot schedules automate recurring backups:

# Create a snapshot schedule (daily, retain 7 days)
gcloud compute resource-policies create snapshot-schedule my-schedule \
  --region=us-central1 \
  --max-retention-days=7 \
  --daily-schedule \
  --start-time=02:00

# Attach schedule to a disk
gcloud compute disks add-resource-policies my-disk \
  --resource-policies=my-schedule --zone=us-central1-a
Snapshot Type Description Use Case
Standard Point-in-time backup of a persistent disk Disaster recovery, migration
Archive Lower-cost storage for long-term retention Compliance, long-term backups
Instant Rapid restore for zonal persistent disks Fast recovery (same zone only)

Exam trap: Snapshots are global resources but can be restricted to a specific storage location (multi-region or region). When restoring a snapshot to a different zone or region, you create a new disk from the snapshot in the target location -- the snapshot itself is not moved.

Images

Image Type Description Example
Public images Google-provided OS images debian-cloud/debian-12, ubuntu-os-cloud/ubuntu-2404-lts-amd64
Custom images Images you create from disks, snapshots, or other images my-project/my-custom-image
Image families Group of related images; always points to the latest non-deprecated image debian-12, ubuntu-2404-lts
# Create a custom image from a disk (stop VM first for consistency)
gcloud compute images create my-image --source-disk=my-disk \
  --source-disk-zone=us-central1-a --family=my-app-images

# Create a custom image from a snapshot
gcloud compute images create my-image --source-snapshot=my-snapshot

# Create a VM from the latest image in a family
gcloud compute instances create my-vm \
  --image-family=my-app-images --image-project=my-project

# Deprecate an image
gcloud compute images deprecate old-image --state=DEPRECATED \
  --replacement=new-image

Exam trap: When you specify --image-family, Compute Engine uses the most recent non-deprecated image in that family. This is the recommended approach for automation -- new deployments automatically pick up updated images without changing scripts.


4.2 Managing GKE Resources

Cluster and Node Pool Management

# View cluster details
gcloud container clusters describe my-cluster --zone=us-central1-a

# List clusters
gcloud container clusters list

# Get credentials for kubectl
gcloud container clusters get-credentials my-cluster --zone=us-central1-a

# Resize a node pool (manual scaling)
gcloud container clusters resize my-cluster --node-pool=default-pool \
  --num-nodes=5 --zone=us-central1-a

# Add a node pool
gcloud container node-pools create new-pool --cluster=my-cluster \
  --zone=us-central1-a --machine-type=e2-standard-4 --num-nodes=3

# Delete a node pool
gcloud container node-pools delete old-pool --cluster=my-cluster \
  --zone=us-central1-a

Cluster modes determine how nodes are managed:

Mode Node Management Billing Use Case
Standard You manage node pools, upgrades, scaling Per node (VM pricing) Full control, custom configurations
Autopilot Google manages nodes; you define pods Per pod resource request Hands-off operations, optimized costs

Exam trap: Autopilot clusters do not expose node pools for direct management. You cannot SSH into Autopilot nodes. If a question involves node-level configuration (custom kernel settings, specific machine types per node), Standard mode is required.

Artifact Registry

Artifact Registry is the recommended container image and package repository for GKE.

# Configure Docker authentication to Artifact Registry
gcloud auth configure-docker us-central1-docker.pkg.dev

# Tag and push an image
docker tag my-app us-central1-docker.pkg.dev/PROJECT_ID/my-repo/my-app:v1
docker push us-central1-docker.pkg.dev/PROJECT_ID/my-repo/my-app:v1

# List images in a repository
gcloud artifacts docker images list us-central1-docker.pkg.dev/PROJECT_ID/my-repo

GKE nodes need the roles/artifactregistry.reader role on the Artifact Registry repository to pull images. When using Workload Identity, ensure the GKE service account has this role.

Managing Kubernetes Workloads

Key kubectl commands for the exam:

# Pods
kubectl get pods --all-namespaces
kubectl describe pod POD_NAME
kubectl logs POD_NAME
kubectl logs POD_NAME --container=CONTAINER_NAME  # multi-container pod
kubectl delete pod POD_NAME
kubectl exec -it POD_NAME -- /bin/bash

# Deployments
kubectl get deployments
kubectl describe deployment DEPLOYMENT_NAME
kubectl scale deployment DEPLOYMENT_NAME --replicas=5
kubectl rollout status deployment DEPLOYMENT_NAME
kubectl rollout undo deployment DEPLOYMENT_NAME        # rollback
kubectl rollout history deployment DEPLOYMENT_NAME

# Services
kubectl get services
kubectl expose deployment DEPLOYMENT_NAME --type=LoadBalancer --port=80

# StatefulSets
kubectl get statefulsets
kubectl scale statefulset STATEFULSET_NAME --replicas=3
Workload Type Use Case Key Characteristic
Deployment Stateless applications Rolling updates, easy scaling, interchangeable pods
StatefulSet Stateful applications (databases) Stable network IDs, persistent storage, ordered deployment
DaemonSet One pod per node (monitoring agents) Automatically runs on every node
Job One-time tasks Runs to completion, then stops
CronJob Scheduled tasks Runs jobs on a cron schedule

Exam trap: StatefulSets provide stable, persistent storage per pod and ordered scaling (pods are created/deleted sequentially: pod-0, pod-1, pod-2). Deployments do NOT guarantee pod ordering or stable storage. If a question involves a database workload needing persistent identity, StatefulSet is the answer.

Autoscaling in GKE

Three autoscaling mechanisms work at different levels:

Autoscaler Level What It Scales Based On
Horizontal Pod Autoscaler (HPA) Pod Number of pod replicas CPU utilization, memory, or custom metrics
Vertical Pod Autoscaler (VPA) Pod CPU/memory requests per pod Historical resource usage
Cluster Autoscaler (CA) Node Number of nodes in a node pool Pending pods that cannot be scheduled
# Enable HPA for a deployment (target 70% CPU)
kubectl autoscale deployment my-app --min=2 --max=10 --cpu-percent=70

# View HPA status
kubectl get hpa

# Enable cluster autoscaler on a node pool
gcloud container clusters update my-cluster --zone=us-central1-a \
  --enable-autoscaling --min-nodes=1 --max-nodes=10 \
  --node-pool=default-pool

Exam trap: HPA and VPA should generally not be used together on the same metric (e.g., both targeting CPU). HPA changes replica count; VPA changes resource requests. Conflicting signals can cause instability. The exam may test whether you know this conflict exists.

Exam trap: Cluster Autoscaler only adds nodes when pods are pending due to insufficient resources. It does not scale based on CPU utilization of existing nodes. If all pods fit on existing nodes, the autoscaler does not add more -- even if nodes are highly utilized.


4.3 Managing Cloud Run Resources

Deploying and Managing Revisions

Every deployment to Cloud Run creates a new immutable revision. Revisions are point-in-time snapshots of your service configuration and container image.

# Deploy a new revision
gcloud run deploy my-service --image=us-central1-docker.pkg.dev/PROJECT/repo/app:v2 \
  --region=us-central1

# Deploy with specific configuration
gcloud run deploy my-service \
  --image=us-central1-docker.pkg.dev/PROJECT/repo/app:v2 \
  --region=us-central1 \
  --memory=512Mi --cpu=1 \
  --min-instances=1 --max-instances=10 \
  --concurrency=80 \
  --set-env-vars=DB_HOST=10.0.0.1

# List revisions
gcloud run revisions list --service=my-service --region=us-central1

# Describe a specific revision
gcloud run revisions describe my-service-00005-abc --region=us-central1

Traffic Splitting

Traffic splitting routes percentages of traffic to different revisions. This enables canary deployments and gradual rollouts.

# Route all traffic to the latest revision
gcloud run services update-traffic my-service --to-latest --region=us-central1

# Split traffic: 90% to current, 10% to new revision
gcloud run services update-traffic my-service \
  --to-revisions=my-service-00004-xyz=90,my-service-00005-abc=10 \
  --region=us-central1

# Roll back: send all traffic to a previous revision
gcloud run services update-traffic my-service \
  --to-revisions=my-service-00004-xyz=100 --region=us-central1

Exam trap: Setting --to-latest means the revision tagged as "latest" always gets all traffic. If you deploy a new revision, it immediately receives 100% traffic. For controlled rollouts, assign traffic to specific revision names instead of using --to-latest.

Scaling Parameters

Parameter Flag Description Default
Min instances --min-instances Minimum instances always running (avoids cold starts) 0
Max instances --max-instances Maximum instances to scale to 100
Concurrency --concurrency Max concurrent requests per instance 80
CPU allocation --cpu-throttling / --no-cpu-throttling CPU allocated only during requests vs. always Throttled (during requests only)

Setting --min-instances=1 or higher eliminates cold starts but incurs charges even when idle. Setting --no-cpu-throttling (CPU always allocated) enables background processing but increases cost.

Exam trap: Cloud Run scales to zero by default (--min-instances=0). This means there may be cold start latency on the first request. If a question describes latency-sensitive workloads, setting min instances above zero is the correct approach.


4.4 Managing Storage and Database Solutions

Cloud Storage Lifecycle Management

Object lifecycle management applies automatic actions to objects based on configurable conditions. Rules are set at the bucket level. An object must match ALL conditions in a rule for the action to trigger.

Action Description
Delete Remove objects matching conditions. Deleted objects become soft-deleted (recoverable for 7 days by default).
SetStorageClass Transition objects to a different storage class. Counts as a Class A operation but avoids retrieval fees.
AbortIncompleteMultipartUpload Clean up abandoned multipart uploads.
Condition Description
age Days since object creation
createdBefore Objects created before a UTC date
numNewerVersions For versioned buckets: applies when N newer versions exist
isLive true = current version; false = noncurrent (versioned buckets)
matchesStorageClass Filter by current storage class
daysSinceNoncurrentTime Days since object became noncurrent
matchesPrefix / matchesSuffix Pattern matching on object names
# View lifecycle configuration
gcloud storage buckets describe gs://my-bucket --format="json(lifecycle)"

# Set lifecycle from a JSON file
gcloud storage buckets update gs://my-bucket --lifecycle-file=lifecycle.json

Example lifecycle JSON to transition to Coldline after 90 days and delete after 365 days:

{
  "rule": [
    {
      "action": {"type": "SetStorageClass", "storageClass": "COLDLINE"},
      "condition": {"age": 90, "matchesStorageClass": ["STANDARD"]}
    },
    {
      "action": {"type": "Delete"},
      "condition": {"age": 365}
    }
  ]
}

Exam trap: Lifecycle changes can take up to 24 hours to take effect. This is a processing delay, not a real-time operation. Also, lifecycle rules can only transition objects to a colder storage class (Standard -> Nearline -> Coldline -> Archive). You cannot use lifecycle rules to move objects to a warmer class.

Object Versioning

When versioning is enabled, overwriting or deleting an object creates a noncurrent version rather than permanently removing it.

# Enable versioning
gcloud storage buckets update gs://my-bucket --versioning

# Disable versioning
gcloud storage buckets update gs://my-bucket --no-versioning

# List object versions
gcloud storage ls --all-versions gs://my-bucket/my-object

# Restore a noncurrent version (copy it to make it current)
gcloud storage cp gs://my-bucket/my-object#GENERATION gs://my-bucket/my-object

Exam trap: Disabling versioning does NOT delete existing noncurrent versions. They remain (and incur storage charges) until explicitly deleted or removed by a lifecycle rule. Use a lifecycle rule with numNewerVersions to manage noncurrent version cleanup.

Storage Class Transitions

Storage Class Minimum Duration Access Pattern Relative Cost (Storage) Relative Cost (Retrieval)
Standard None Frequent access Highest Lowest
Nearline 30 days Once per month Lower Higher
Coldline 90 days Once per quarter Lower still Higher still
Archive 365 days Once per year Lowest Highest

The minimum storage duration is a billing minimum. Objects deleted before the minimum duration are charged for the full minimum period. Access is always possible regardless of storage class -- classes differ only in pricing, not availability.

Running Queries Across Database Services

Service Query Method Key Command / Interface
Cloud SQL SQL via client tools, Cloud SQL proxy, Console gcloud sql connect INSTANCE --user=root
BigQuery SQL via Console, bq, client libraries bq query --use_legacy_sql=false 'SELECT ...'
Cloud Spanner SQL or Google Standard SQL Console Query Editor, gcloud spanner databases execute-sql
Firestore Document queries via SDKs, Console Console Data viewer, client libraries
AlloyDB PostgreSQL-compatible SQL gcloud alloydb instances connect, psql

BigQuery Dry Run (Cost Estimation)

A dry run validates a query and returns the estimated bytes processed without actually running it. This is essential for cost estimation before running expensive queries.

# Dry run from CLI
bq query --use_legacy_sql=false --dry_run 'SELECT * FROM `project.dataset.table`'

# Output shows: Query successfully validated. Estimated bytes processed: X

BigQuery charges based on bytes scanned. A dry run helps avoid unexpected costs on large tables. In the Console, the query validator shows estimated bytes before you click Run.

Exam trap: BigQuery on-demand pricing charges per TB scanned ($6.25/TB as of current pricing). A SELECT * on a 10 TB table costs approximately $62.50. The dry run flag is the correct answer when a question asks about estimating query cost before execution.

Backups and Restores

Service Backup Method Restore Method
Cloud SQL Automated daily backups + on-demand Restore to same or new instance: gcloud sql backups restore BACKUP_ID --restore-instance=INSTANCE
Cloud Spanner Automatic, configurable retention gcloud spanner backups create / gcloud spanner databases restore
AlloyDB Continuous backup (automatic) + on-demand Point-in-time recovery or backup restore
Firestore Managed export/import gcloud firestore export gs://bucket / gcloud firestore import gs://bucket/export
# Create an on-demand Cloud SQL backup
gcloud sql backups create --instance=my-instance

# List Cloud SQL backups
gcloud sql backups list --instance=my-instance

# Restore a Cloud SQL backup
gcloud sql backups restore BACKUP_ID --restore-instance=my-instance

Exam trap: Cloud SQL automated backups have a retention window (default 7 days, configurable up to 365 days). On-demand backups persist until you delete them. If a question asks about long-term backup retention, on-demand backups or export to Cloud Storage is the answer.

Monitoring Job Status

# Check BigQuery job status
bq show --job=true JOB_ID

# List recent BigQuery jobs
bq ls --jobs=true --max_results=10

# Check Dataflow job status
gcloud dataflow jobs list --region=us-central1
gcloud dataflow jobs describe JOB_ID --region=us-central1

# Cancel a Dataflow job
gcloud dataflow jobs cancel JOB_ID --region=us-central1

4.5 Managing Networking Resources

Subnets: Adding and Expanding

VPC networks contain subnets, and subnets can be expanded (but never shrunk).

# Add a subnet to an existing VPC
gcloud compute networks subnets create my-subnet \
  --network=my-vpc --region=us-central1 --range=10.0.1.0/24

# Expand a subnet's IP range (can only increase, never decrease)
gcloud compute networks subnets expand-ip-range my-subnet \
  --region=us-central1 --prefix-length=20

Exam trap: You can expand a subnet's CIDR range but you cannot shrink it. The new range must contain the original range. For example, expanding 10.0.1.0/24 to 10.0.0.0/20 is valid. This operation does not disrupt existing resources.

Static IP Addresses

Type Scope Use Case Billing
External static Regional or global Public-facing services, load balancers Charged when reserved but NOT attached to a running resource
Internal static Regional Fixed internal addressing Free
# Reserve a regional external static IP
gcloud compute addresses create my-external-ip --region=us-central1

# Reserve a global external static IP (for global load balancers)
gcloud compute addresses create my-global-ip --global

# Reserve an internal static IP
gcloud compute addresses create my-internal-ip --region=us-central1 \
  --subnet=my-subnet --addresses=10.0.1.50

# List reserved addresses
gcloud compute addresses list

# Assign a reserved external IP to a VM
gcloud compute instances create my-vm --zone=us-central1-a \
  --address=my-external-ip

# Promote an ephemeral IP to static
gcloud compute addresses create my-ip --addresses=EPHEMERAL_IP \
  --region=us-central1

Exam trap: Reserved static external IPs that are NOT attached to a running resource incur charges. This is a common cost surprise. If a VM is stopped or deleted but its static IP is still reserved, you continue paying for the IP. The exam tests whether you know to release unused static IPs.

Cloud DNS

Cloud DNS is a managed authoritative DNS service.

Zone Type Visibility Use Case
Public Internet Route external traffic to your services
Private Specific VPC networks Internal DNS resolution within VPCs
# Create a public managed zone
gcloud dns managed-zones create my-zone \
  --dns-name="example.com." --description="My public zone"

# Create a private managed zone
gcloud dns managed-zones create my-private-zone \
  --dns-name="internal.example.com." --description="Internal zone" \
  --visibility=private --networks=my-vpc

# Add an A record
gcloud dns record-sets create www.example.com. --zone=my-zone \
  --type=A --ttl=300 --rrdatas="34.120.1.1"

# Add a CNAME record
gcloud dns record-sets create app.example.com. --zone=my-zone \
  --type=CNAME --ttl=300 --rrdatas="my-service.example.com."

# List records in a zone
gcloud dns record-sets list --zone=my-zone

Exam trap: DNS names in Cloud DNS must end with a trailing dot (e.g., example.com.). Forgetting the trailing dot is a common configuration error.

Cloud NAT

Cloud NAT provides outbound internet access for resources without external IP addresses. It operates at the VPC network level through a Cloud Router.

# Create a Cloud Router (required for Cloud NAT)
gcloud compute routers create my-router --network=my-vpc --region=us-central1

# Create a Cloud NAT gateway
gcloud compute routers nats create my-nat --router=my-router \
  --region=us-central1 --auto-allocate-nat-external-ips \
  --nat-all-subnet-ip-ranges
Configuration Description
--auto-allocate-nat-external-ips Google automatically assigns external IPs
--nat-external-ip-pool=IP1,IP2 Use specific reserved static IPs for NAT
--nat-all-subnet-ip-ranges Apply NAT to all subnets in the region
--nat-custom-subnet-ip-ranges=SUBNET Apply NAT to specific subnets only

Exam trap: Cloud NAT is outbound only. It does not allow unsolicited inbound connections from the internet. If a question asks about allowing inbound traffic to VMs without external IPs, the answer is an internal load balancer or IAP, not Cloud NAT.


4.6 Monitoring and Logging

Cloud Monitoring Alerting Policies

Cloud Monitoring alerting policies define conditions that trigger notifications. An alerting policy has three components: the condition (what to monitor), the notification channel (how to alert), and the documentation (context for responders).

Alert condition types:

Condition Type Triggers When Duration
Metric threshold A metric value exceeds or falls below a threshold for a specified duration Configurable alignment period
Metric absence A monitored metric stops reporting data Up to 23.5 hours
Forecasted value A metric is predicted to breach a threshold 1-7 day prediction window
Log-based A specific log entry pattern is detected Immediate (rate-limited)
SQL-based A Log Analytics query returns matching results Public preview
# Create a notification channel (email)
gcloud beta monitoring channels create \
  --display-name="Ops Team Email" \
  --type=email \
  --channel-labels=email_address=ops@example.com

# List alerting policies
gcloud alpha monitoring policies list

# Describe an alerting policy
gcloud alpha monitoring policies describe POLICY_ID

Key concepts for the exam:

  • Notification channels: Email, SMS, Slack, PagerDuty, Pub/Sub, webhooks, and mobile app
  • Uptime checks: HTTP(S), TCP, or ICMP probes from global locations. Can trigger alerts on failure.
  • Snooze: Temporarily suppresses notifications without modifying the alert policy
  • Incidents: Created automatically when conditions are met; auto-close when conditions resolve

Exam trap: Log-based alerts and metric-based alerts are configured differently. Log-based alerts operate on log entries (Logs Explorer query syntax), while metric-based alerts operate on time-series data. You cannot use metric-threshold conditions to alert on specific log messages.

Custom Metrics

Custom metrics extend Cloud Monitoring beyond built-in GCP metrics. Create them via the Monitoring API or OpenTelemetry.

# Write a custom metric data point
gcloud monitoring metrics-descriptors create \
  custom.googleapis.com/my_app/request_latency \
  --type=DOUBLE --description="Application request latency"

Use custom metrics for application-specific KPIs (queue depth, business transactions, cache hit rate) that built-in metrics do not cover.

Log Sinks and Export

Log sinks route log entries to destinations for long-term storage, analysis, or integration with external tools. Every log entry is evaluated by all sinks in the resource; a sink routes entries that match its inclusion filter and do not match any exclusion filters.

Supported destinations:

Destination Use Case Format
Log buckets Retention in Cloud Logging, Logs Explorer, Log Analytics Structured
BigQuery SQL analysis, joining with business data Streaming inserts
Cloud Storage Long-term archival, compliance JSON files (batched hourly)
Pub/Sub Streaming to external tools (Splunk, Datadog) JSON messages
Google Cloud project Cross-project log routing Re-routed through destination sinks
# Create a sink that exports to Cloud Storage
gcloud logging sinks create my-storage-sink \
  storage.googleapis.com/my-log-bucket \
  --log-filter='resource.type="gce_instance" AND severity>=ERROR'

# Create a sink that exports to BigQuery
gcloud logging sinks create my-bq-sink \
  bigquery.googleapis.com/projects/PROJECT/datasets/my_logs \
  --log-filter='resource.type="gce_instance"'

# Create a sink that exports to Pub/Sub
gcloud logging sinks create my-pubsub-sink \
  pubsub.googleapis.com/projects/PROJECT/topics/my-topic \
  --log-filter='logName="projects/PROJECT/logs/my-app"'

# List sinks
gcloud logging sinks list

# Update a sink filter
gcloud logging sinks update my-storage-sink \
  --log-filter='resource.type="gce_instance" AND severity>=WARNING'

System-created sinks (cannot be deleted):

Sink Destination Modifiable?
_Required _Required log bucket No. Routes Admin Activity, System Event, and Access Transparency logs.
_Default _Default log bucket Yes. Routes everything else. Can add exclusion filters to reduce costs.

Exam trap: After creating a sink, you must grant the sink's service account writer permissions on the destination. The sink creation output displays the service account. For BigQuery, grant roles/bigquery.dataEditor; for Cloud Storage, grant roles/storage.objectCreator; for Pub/Sub, grant roles/pubsub.publisher.

Organization and folder sinks:

  • Non-intercepting (default): Routes matching logs but lets child resource sinks also process them
  • Intercepting: Blocks log entries from flowing to child resource sinks (except _Required). Useful for centralized logging where you do not want projects to also retain logs.

Log Buckets, Analytics, and Routers

Concept Description
Log bucket Storage container for log entries in Cloud Logging. Each project has _Required (400-day retention, immutable) and _Default (30-day default retention, modifiable).
Custom log bucket User-created bucket with configurable retention (1-3650 days). Can enable Log Analytics for SQL querying.
Log Analytics Enables BigQuery-compatible SQL queries directly on log buckets without exporting. Requires upgrading the bucket.
Log router The pipeline that evaluates all sinks against incoming log entries. Processes every entry, applies inclusion/exclusion filters, and routes to matched destinations.
# Create a custom log bucket with 90-day retention
gcloud logging buckets create my-custom-bucket \
  --location=us-central1 --retention-days=90

# Enable Log Analytics on a bucket
gcloud logging buckets update my-custom-bucket \
  --location=us-central1 --enable-analytics

# Update the _Default bucket retention to 90 days
gcloud logging buckets update _Default --location=global --retention-days=90

Exam trap: The _Required bucket has a fixed 400-day retention that cannot be changed. The _Default bucket has a 30-day default retention that CAN be customized. If a question asks about retaining Admin Activity audit logs for more than 400 days, you need to create a sink to export them to Cloud Storage or BigQuery.

Logs Explorer Filtering

Logs Explorer uses the Cloud Logging query language to filter log entries.

Common filter patterns for the exam:

# Filter by resource type
resource.type="gce_instance"

# Filter by severity
severity>=ERROR

# Filter by log name
logName="projects/PROJECT_ID/logs/cloudaudit.googleapis.com%2Factivity"

# Filter by text payload
textPayload:"error"

# Filter by JSON payload field
jsonPayload.status=500

# Combine conditions (AND is implicit between lines)
resource.type="gce_instance"
severity>=WARNING
resource.labels.instance_id="1234567890"

# Time range
timestamp>="2026-02-25T00:00:00Z"
timestamp<"2026-02-26T00:00:00Z"

Exam trap: In Logs Explorer, separate lines are implicitly joined with AND. To use OR, you must write it explicitly: severity=ERROR OR severity=CRITICAL. Newline-separated conditions are always AND.

Ops Agent and Managed Service for Prometheus

Ops Agent is the unified agent for collecting both metrics and logs from Compute Engine VMs. It replaces the legacy Monitoring and Logging agents.

# Install Ops Agent on a VM (run from within the VM)
curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
sudo bash add-google-cloud-ops-agent-repo.sh --also-install

# Verify the agent is running
sudo systemctl status google-cloud-ops-agent
Agent Collects Status
Ops Agent Metrics + logs (unified) Current, recommended
Legacy Monitoring Agent Metrics only Deprecated
Legacy Logging Agent Logs only Deprecated

Managed Service for Prometheus provides a fully managed, multi-cloud Prometheus-compatible monitoring solution. It stores metrics in Cloud Monitoring and supports PromQL queries.

  • Collects Prometheus metrics from GKE workloads and Compute Engine
  • Data stored in Monarch (Google's global monitoring backend)
  • Query using PromQL in Cloud Monitoring or Grafana

Exam trap: The Ops Agent must be installed on every VM you want to monitor. It is NOT installed by default. GKE nodes use a different mechanism (built-in integration with Cloud Monitoring). If a question asks about missing VM metrics/logs, checking whether the Ops Agent is installed is the first troubleshooting step.

Audit Logs

Cloud Audit Logs record administrative activities and data access for Google Cloud resources. There are four types:

Log Type What It Records Enabled by Default Can Be Disabled Retention (_Required bucket)
Admin Activity Configuration changes (create/delete/update resources, set IAM policies) Yes No 400 days
Data Access API calls that read resource configuration/metadata or read/write user data No (except BigQuery) Yes (can be turned off) 30 days (_Default bucket)
System Event Google-initiated system actions (auto-healing, live migration) Yes No 400 days
Policy Denied Access denied due to VPC Service Controls or Organization Policy violations Yes No (but can be excluded from sinks) 30 days (_Default bucket)
# View Admin Activity logs
gcloud logging read 'logName="projects/PROJECT/logs/cloudaudit.googleapis.com%2Factivity"' \
  --limit=10

# View Data Access logs
gcloud logging read 'logName="projects/PROJECT/logs/cloudaudit.googleapis.com%2Fdata_access"' \
  --limit=10

# Enable Data Access logs for all services (project level)
gcloud projects get-iam-policy PROJECT_ID --format=json > policy.json
# Edit policy.json to add auditConfigs, then:
gcloud projects set-iam-policy PROJECT_ID policy.json

Data Access log configuration uses auditConfigs in the IAM policy:

{
  "auditConfigs": [
    {
      "service": "allServices",
      "auditLogConfigs": [
        { "logType": "ADMIN_READ" },
        { "logType": "DATA_READ" },
        { "logType": "DATA_WRITE" }
      ]
    }
  ]
}

IAM roles for viewing audit logs:

Log Type Required Role
Admin Activity, System Event roles/logging.viewer (Logs Viewer)
Data Access, Policy Denied roles/logging.privateLogViewer (Private Logs Viewer)

Exam trap: Data Access logs are disabled by default for most services (BigQuery is the notable exception where they are always on). They can generate very high volume and cost. If a question asks about tracking who read specific data, you need to enable Data Access audit logs for that service first.

Exam trap: Admin Activity and System Event logs are stored in the _Required bucket with 400-day retention that cannot be modified. Data Access and Policy Denied logs go to the _Default bucket with a configurable retention (default 30 days). To retain any logs beyond their bucket retention, export via a sink.


Quick-Reference: Decision Tree

Scenario Correct Approach
Need to change a VM's machine type Stop the VM, then set-machine-type
VM has no external IP but needs outbound internet Configure Cloud NAT on the VPC
Need to back up a Compute Engine disk Create a snapshot (incremental by default)
Deploy a new version of a Cloud Run service gradually Traffic splitting between revisions
GKE pods failing to schedule (insufficient resources) Cluster Autoscaler adds nodes automatically
Cloud Run cold start latency is too high Set --min-instances=1 or higher
Move Cloud Storage objects to cheaper storage after 90 days Lifecycle rule with age condition and SetStorageClass action
Estimate BigQuery query cost before running Use --dry_run flag
Need VM metrics/logs in Cloud Monitoring Install the Ops Agent
Export logs to Splunk Create a Pub/Sub sink, connect Splunk to the Pub/Sub subscription
Retain Admin Activity logs beyond 400 days Create a sink to Cloud Storage or BigQuery
Track who accessed sensitive data in Cloud Storage Enable Data Access audit logs for Cloud Storage
Alert when a specific error appears in logs Create a log-based alerting policy
Alert when CPU exceeds 80% for 5 minutes Create a metric-threshold alerting policy
Reduce Cloud Logging costs Add exclusion filters to the _Default sink
VMs without external IPs need to pull packages Cloud NAT for outbound internet access
Need DNS resolution only within VPCs Private Cloud DNS zone
Reserve an IP for a load balancer gcloud compute addresses create --global

Common Exam Traps Summary

  1. Deleting a VM deletes the boot disk by default -- non-boot disks are preserved unless --delete-disks=all is specified.
  2. Machine type changes require stopping the VM -- you cannot resize CPU/memory on a running instance.
  3. Snapshots are incremental -- only changed blocks are stored after the first snapshot. Deleting an earlier snapshot does not lose data; the incremental chain is automatically reconciled.
  4. Image families always point to the latest non-deprecated image -- use families in automation for automatic updates.
  5. HPA and VPA conflict on the same metric -- do not use both to scale on CPU simultaneously.
  6. Cluster Autoscaler scales on pending pods, not node CPU -- high node utilization alone does not trigger scaling.
  7. Cloud Run --to-latest sends all traffic to new deployments immediately -- for canary deployments, route traffic to specific revision names.
  8. Lifecycle rule changes take up to 24 hours -- not real-time.
  9. Lifecycle transitions are one-direction only -- Standard to Nearline to Coldline to Archive. You cannot use lifecycle rules to warm up objects.
  10. Disabling versioning does not delete noncurrent versions -- they persist until explicitly removed.
  11. Reserved static IPs incur charges when unattached -- release IPs you are not using.
  12. Subnet IP ranges can be expanded but never shrunk -- plan CIDR ranges carefully.
  13. Cloud NAT is outbound only -- it does not provide inbound connectivity.
  14. Ops Agent is not installed by default on VMs -- missing agent is the top reason for absent VM metrics.
  15. Data Access audit logs are disabled by default (except BigQuery) -- enable them explicitly to track data reads/writes.
  16. _Required bucket: 400-day retention, immutable. _Default bucket: 30-day default, configurable -- export to Cloud Storage or BigQuery for longer retention.
  17. Sink service accounts need destination permissions -- the most common cause of sink failures is missing IAM grants on the destination resource.
  18. Logs Explorer lines are implicitly AND -- use explicit OR for disjunctive filters.

References