Reference

Domain 4: Ensuring Successful Operation of a Cloud Solution (~20%)

Domain 4 accounts for approximately 20% of the Associate Cloud Engineer exam, translating to roughly 10-12 questions. This is one of the most operationally focused domains: you must know how to manage running resources, troubleshoot issues, configure monitoring and logging, and handle day-to-day operational tasks across Compute Engine, GKE, Cloud Run, storage, databases, and networking. The exam tests six sub-domains with heavy emphasis on gcloud commands and Console workflows.

4.1 Managing Compute Engine Resources

Starting, Stopping, and Deleting Instances

Compute Engine VMs have distinct lifecycle states that determine billing and behavior.

Action	Command	Billing	Persistent Disk
Start	`gcloud compute instances start INSTANCE`	Resumes full billing	Preserved
Stop	`gcloud compute instances stop INSTANCE`	No CPU/memory charge; disk charges continue	Preserved
Suspend	`gcloud compute instances suspend INSTANCE`	No CPU/memory charge; charges for suspended state memory + disk	Preserved + memory saved
Delete	`gcloud compute instances delete INSTANCE`	All billing stops	Boot disk deleted by default; non-boot disks preserved unless `--delete-disks=all`
Reset	`gcloud compute instances reset INSTANCE`	Continues	Preserved (hard reset, no graceful shutdown)

# Stop an instance
gcloud compute instances stop my-vm --zone=us-central1-a

# Start an instance
gcloud compute instances start my-vm --zone=us-central1-a

# Delete an instance but keep all disks
gcloud compute instances delete my-vm --zone=us-central1-a --keep-disks=all

# Delete an instance and all attached disks
gcloud compute instances delete my-vm --zone=us-central1-a --delete-disks=all

Exam trap: When you delete a VM, the boot disk is deleted by default. Non-boot (additional) disks are NOT deleted by default. If a question asks about preserving data after VM deletion, you need --keep-disks=all or detach the disk first.

Editing VM Configuration

Some properties can be changed while the VM is running; others require a stop first.

Property	Requires Stop?	Command
Machine type	Yes	`gcloud compute instances set-machine-type INSTANCE --machine-type=e2-standard-4`
Labels	No	`gcloud compute instances update INSTANCE --update-labels=env=prod`
Metadata	No	`gcloud compute instances add-metadata INSTANCE --metadata=key=value`
Tags (network)	No	`gcloud compute instances add-tags INSTANCE --tags=http-server`
Service account	Yes	Must stop, then use `gcloud compute instances set-service-account`
Attached disks	No (attach); Yes (detach boot)	`gcloud compute instances attach-disk INSTANCE --disk=DISK_NAME`

Exam trap: Changing the machine type requires stopping the instance first. You cannot resize a running VM's CPU/memory. This is a frequent exam question.

SSH and RDP Connections

# SSH into a Linux VM (uses IAP tunnel by default if no external IP)
gcloud compute ssh my-vm --zone=us-central1-a

# SSH through IAP explicitly
gcloud compute ssh my-vm --zone=us-central1-a --tunnel-through-iap

# SSH with a specific user
gcloud compute ssh user@my-vm --zone=us-central1-a

# Create an RDP tunnel for Windows VMs
gcloud compute start-iap-tunnel my-windows-vm 3389 --local-host-port=localhost:3389

OS Login is Google's recommended method for managing SSH access. When enabled, SSH keys are managed centrally through IAM rather than per-instance metadata.

Feature	Metadata SSH Keys	OS Login
Key management	Per-instance or project-level metadata	Centralized via Cloud Identity/Workspace
IAM integration	None	Full: `roles/compute.osLogin` (standard), `roles/compute.osAdminLogin` (sudo)
2FA support	No	Yes (with `roles/compute.osLogin` + 2FA configured)
Audit	Limited	Full audit via Cloud Audit Logs

# Enable OS Login at project level
gcloud compute project-info add-metadata --metadata enable-oslogin=TRUE

# Enable OS Login on a specific instance
gcloud compute instances add-metadata my-vm --metadata enable-oslogin=TRUE

Exam trap: OS Login overrides metadata-based SSH keys when enabled. If a user cannot SSH after OS Login is enabled, they need roles/compute.osLogin (or roles/compute.osAdminLogin for sudo access) on the project or instance.

Snapshots

Snapshots are incremental backups of persistent disks. After the first full snapshot, subsequent snapshots only store changed blocks.

# Create a snapshot
gcloud compute disks snapshot my-disk --zone=us-central1-a \
  --snapshot-names=my-snapshot

# Create a snapshot with a storage location
gcloud compute disks snapshot my-disk --zone=us-central1-a \
  --snapshot-names=my-snapshot --storage-location=us

# List snapshots
gcloud compute snapshots list

# Create a disk from a snapshot
gcloud compute disks create new-disk --source-snapshot=my-snapshot \
  --zone=us-east1-b

# Create a VM from a snapshot (create disk first, then VM)
gcloud compute instances create my-new-vm --zone=us-east1-b \
  --disk=name=new-disk,boot=yes

Snapshot schedules automate recurring backups:

# Create a snapshot schedule (daily, retain 7 days)
gcloud compute resource-policies create snapshot-schedule my-schedule \
  --region=us-central1 \
  --max-retention-days=7 \
  --daily-schedule \
  --start-time=02:00

# Attach schedule to a disk
gcloud compute disks add-resource-policies my-disk \
  --resource-policies=my-schedule --zone=us-central1-a

Snapshot Type	Description	Use Case
Standard	Point-in-time backup of a persistent disk	Disaster recovery, migration
Archive	Lower-cost storage for long-term retention	Compliance, long-term backups
Instant	Rapid restore for zonal persistent disks	Fast recovery (same zone only)

Exam trap: Snapshots are global resources but can be restricted to a specific storage location (multi-region or region). When restoring a snapshot to a different zone or region, you create a new disk from the snapshot in the target location -- the snapshot itself is not moved.

Images

Image Type	Description	Example
Public images	Google-provided OS images	`debian-cloud/debian-12`, `ubuntu-os-cloud/ubuntu-2404-lts-amd64`
Custom images	Images you create from disks, snapshots, or other images	`my-project/my-custom-image`
Image families	Group of related images; always points to the latest non-deprecated image	`debian-12`, `ubuntu-2404-lts`

# Create a custom image from a disk (stop VM first for consistency)
gcloud compute images create my-image --source-disk=my-disk \
  --source-disk-zone=us-central1-a --family=my-app-images

# Create a custom image from a snapshot
gcloud compute images create my-image --source-snapshot=my-snapshot

# Create a VM from the latest image in a family
gcloud compute instances create my-vm \
  --image-family=my-app-images --image-project=my-project

# Deprecate an image
gcloud compute images deprecate old-image --state=DEPRECATED \
  --replacement=new-image

Exam trap: When you specify --image-family, Compute Engine uses the most recent non-deprecated image in that family. This is the recommended approach for automation -- new deployments automatically pick up updated images without changing scripts.

4.2 Managing GKE Resources

Cluster and Node Pool Management

# View cluster details
gcloud container clusters describe my-cluster --zone=us-central1-a

# List clusters
gcloud container clusters list

# Get credentials for kubectl
gcloud container clusters get-credentials my-cluster --zone=us-central1-a

# Resize a node pool (manual scaling)
gcloud container clusters resize my-cluster --node-pool=default-pool \
  --num-nodes=5 --zone=us-central1-a

# Add a node pool
gcloud container node-pools create new-pool --cluster=my-cluster \
  --zone=us-central1-a --machine-type=e2-standard-4 --num-nodes=3

# Delete a node pool
gcloud container node-pools delete old-pool --cluster=my-cluster \
  --zone=us-central1-a

Cluster modes determine how nodes are managed:

Mode	Node Management	Billing	Use Case
Standard	You manage node pools, upgrades, scaling	Per node (VM pricing)	Full control, custom configurations
Autopilot	Google manages nodes; you define pods	Per pod resource request	Hands-off operations, optimized costs

Exam trap: Autopilot clusters do not expose node pools for direct management. You cannot SSH into Autopilot nodes. If a question involves node-level configuration (custom kernel settings, specific machine types per node), Standard mode is required.

Artifact Registry

Artifact Registry is the recommended container image and package repository for GKE.

# Configure Docker authentication to Artifact Registry
gcloud auth configure-docker us-central1-docker.pkg.dev

# Tag and push an image
docker tag my-app us-central1-docker.pkg.dev/PROJECT_ID/my-repo/my-app:v1
docker push us-central1-docker.pkg.dev/PROJECT_ID/my-repo/my-app:v1

# List images in a repository
gcloud artifacts docker images list us-central1-docker.pkg.dev/PROJECT_ID/my-repo

GKE nodes need the roles/artifactregistry.reader role on the Artifact Registry repository to pull images. When using Workload Identity, ensure the GKE service account has this role.

Managing Kubernetes Workloads

Key kubectl commands for the exam:

# Pods
kubectl get pods --all-namespaces
kubectl describe pod POD_NAME
kubectl logs POD_NAME
kubectl logs POD_NAME --container=CONTAINER_NAME  # multi-container pod
kubectl delete pod POD_NAME
kubectl exec -it POD_NAME -- /bin/bash

# Deployments
kubectl get deployments
kubectl describe deployment DEPLOYMENT_NAME
kubectl scale deployment DEPLOYMENT_NAME --replicas=5
kubectl rollout status deployment DEPLOYMENT_NAME
kubectl rollout undo deployment DEPLOYMENT_NAME        # rollback
kubectl rollout history deployment DEPLOYMENT_NAME

# Services
kubectl get services
kubectl expose deployment DEPLOYMENT_NAME --type=LoadBalancer --port=80

# StatefulSets
kubectl get statefulsets
kubectl scale statefulset STATEFULSET_NAME --replicas=3

Workload Type	Use Case	Key Characteristic
Deployment	Stateless applications	Rolling updates, easy scaling, interchangeable pods
StatefulSet	Stateful applications (databases)	Stable network IDs, persistent storage, ordered deployment
DaemonSet	One pod per node (monitoring agents)	Automatically runs on every node
Job	One-time tasks	Runs to completion, then stops
CronJob	Scheduled tasks	Runs jobs on a cron schedule

Exam trap: StatefulSets provide stable, persistent storage per pod and ordered scaling (pods are created/deleted sequentially: pod-0, pod-1, pod-2). Deployments do NOT guarantee pod ordering or stable storage. If a question involves a database workload needing persistent identity, StatefulSet is the answer.

Autoscaling in GKE

Three autoscaling mechanisms work at different levels:

Autoscaler	Level	What It Scales	Based On
Horizontal Pod Autoscaler (HPA)	Pod	Number of pod replicas	CPU utilization, memory, or custom metrics
Vertical Pod Autoscaler (VPA)	Pod	CPU/memory requests per pod	Historical resource usage
Cluster Autoscaler (CA)	Node	Number of nodes in a node pool	Pending pods that cannot be scheduled

# Enable HPA for a deployment (target 70% CPU)
kubectl autoscale deployment my-app --min=2 --max=10 --cpu-percent=70

# View HPA status
kubectl get hpa

# Enable cluster autoscaler on a node pool
gcloud container clusters update my-cluster --zone=us-central1-a \
  --enable-autoscaling --min-nodes=1 --max-nodes=10 \
  --node-pool=default-pool

Exam trap: HPA and VPA should generally not be used together on the same metric (e.g., both targeting CPU). HPA changes replica count; VPA changes resource requests. Conflicting signals can cause instability. The exam may test whether you know this conflict exists.

Exam trap: Cluster Autoscaler only adds nodes when pods are pending due to insufficient resources. It does not scale based on CPU utilization of existing nodes. If all pods fit on existing nodes, the autoscaler does not add more -- even if nodes are highly utilized.

4.3 Managing Cloud Run Resources

Deploying and Managing Revisions

Every deployment to Cloud Run creates a new immutable revision. Revisions are point-in-time snapshots of your service configuration and container image.

# Deploy a new revision
gcloud run deploy my-service --image=us-central1-docker.pkg.dev/PROJECT/repo/app:v2 \
  --region=us-central1

# Deploy with specific configuration
gcloud run deploy my-service \
  --image=us-central1-docker.pkg.dev/PROJECT/repo/app:v2 \
  --region=us-central1 \
  --memory=512Mi --cpu=1 \
  --min-instances=1 --max-instances=10 \
  --concurrency=80 \
  --set-env-vars=DB_HOST=10.0.0.1

# List revisions
gcloud run revisions list --service=my-service --region=us-central1

# Describe a specific revision
gcloud run revisions describe my-service-00005-abc --region=us-central1

Traffic Splitting

Traffic splitting routes percentages of traffic to different revisions. This enables canary deployments and gradual rollouts.

# Route all traffic to the latest revision
gcloud run services update-traffic my-service --to-latest --region=us-central1

# Split traffic: 90% to current, 10% to new revision
gcloud run services update-traffic my-service \
  --to-revisions=my-service-00004-xyz=90,my-service-00005-abc=10 \
  --region=us-central1

# Roll back: send all traffic to a previous revision
gcloud run services update-traffic my-service \
  --to-revisions=my-service-00004-xyz=100 --region=us-central1

Exam trap: Setting --to-latest means the revision tagged as "latest" always gets all traffic. If you deploy a new revision, it immediately receives 100% traffic. For controlled rollouts, assign traffic to specific revision names instead of using --to-latest.

Scaling Parameters

Parameter	Flag	Description	Default
Min instances	`--min-instances`	Minimum instances always running (avoids cold starts)	0
Max instances	`--max-instances`	Maximum instances to scale to	100
Concurrency	`--concurrency`	Max concurrent requests per instance	80
CPU allocation	`--cpu-throttling` / `--no-cpu-throttling`	CPU allocated only during requests vs. always	Throttled (during requests only)

Setting --min-instances=1 or higher eliminates cold starts but incurs charges even when idle. Setting --no-cpu-throttling (CPU always allocated) enables background processing but increases cost.

Exam trap: Cloud Run scales to zero by default (--min-instances=0). This means there may be cold start latency on the first request. If a question describes latency-sensitive workloads, setting min instances above zero is the correct approach.

4.4 Managing Storage and Database Solutions

Cloud Storage Lifecycle Management

Object lifecycle management applies automatic actions to objects based on configurable conditions. Rules are set at the bucket level. An object must match ALL conditions in a rule for the action to trigger.

Action	Description
Delete	Remove objects matching conditions. Deleted objects become soft-deleted (recoverable for 7 days by default).
SetStorageClass	Transition objects to a different storage class. Counts as a Class A operation but avoids retrieval fees.
AbortIncompleteMultipartUpload	Clean up abandoned multipart uploads.

Condition	Description
`age`	Days since object creation
`createdBefore`	Objects created before a UTC date
`numNewerVersions`	For versioned buckets: applies when N newer versions exist
`isLive`	`true` = current version; `false` = noncurrent (versioned buckets)
`matchesStorageClass`	Filter by current storage class
`daysSinceNoncurrentTime`	Days since object became noncurrent
`matchesPrefix` / `matchesSuffix`	Pattern matching on object names

# View lifecycle configuration
gcloud storage buckets describe gs://my-bucket --format="json(lifecycle)"

# Set lifecycle from a JSON file
gcloud storage buckets update gs://my-bucket --lifecycle-file=lifecycle.json

Example lifecycle JSON to transition to Coldline after 90 days and delete after 365 days:

{
  "rule": [
    {
      "action": {"type": "SetStorageClass", "storageClass": "COLDLINE"},
      "condition": {"age": 90, "matchesStorageClass": ["STANDARD"]}
    },
    {
      "action": {"type": "Delete"},
      "condition": {"age": 365}
    }
  ]
}

Exam trap: Lifecycle changes can take up to 24 hours to take effect. This is a processing delay, not a real-time operation. Also, lifecycle rules can only transition objects to a colder storage class (Standard -> Nearline -> Coldline -> Archive). You cannot use lifecycle rules to move objects to a warmer class.

Object Versioning

When versioning is enabled, overwriting or deleting an object creates a noncurrent version rather than permanently removing it.

# Enable versioning
gcloud storage buckets update gs://my-bucket --versioning

# Disable versioning
gcloud storage buckets update gs://my-bucket --no-versioning

# List object versions
gcloud storage ls --all-versions gs://my-bucket/my-object

# Restore a noncurrent version (copy it to make it current)
gcloud storage cp gs://my-bucket/my-object#GENERATION gs://my-bucket/my-object

Exam trap: Disabling versioning does NOT delete existing noncurrent versions. They remain (and incur storage charges) until explicitly deleted or removed by a lifecycle rule. Use a lifecycle rule with numNewerVersions to manage noncurrent version cleanup.

Storage Class Transitions

Storage Class	Minimum Duration	Access Pattern	Relative Cost (Storage)	Relative Cost (Retrieval)
Standard	None	Frequent access	Highest	Lowest
Nearline	30 days	Once per month	Lower	Higher
Coldline	90 days	Once per quarter	Lower still	Higher still
Archive	365 days	Once per year	Lowest	Highest

The minimum storage duration is a billing minimum. Objects deleted before the minimum duration are charged for the full minimum period. Access is always possible regardless of storage class -- classes differ only in pricing, not availability.

Running Queries Across Database Services

Service	Query Method	Key Command / Interface
Cloud SQL	SQL via client tools, Cloud SQL proxy, Console	`gcloud sql connect INSTANCE --user=root`
BigQuery	SQL via Console, `bq`, client libraries	`bq query --use_legacy_sql=false 'SELECT ...'`
Cloud Spanner	SQL or Google Standard SQL	Console Query Editor, `gcloud spanner databases execute-sql`
Firestore	Document queries via SDKs, Console	Console Data viewer, client libraries
AlloyDB	PostgreSQL-compatible SQL	`gcloud alloydb instances connect`, `psql`

BigQuery Dry Run (Cost Estimation)

A dry run validates a query and returns the estimated bytes processed without actually running it. This is essential for cost estimation before running expensive queries.

# Dry run from CLI
bq query --use_legacy_sql=false --dry_run 'SELECT * FROM `project.dataset.table`'

# Output shows: Query successfully validated. Estimated bytes processed: X

BigQuery charges based on bytes scanned. A dry run helps avoid unexpected costs on large tables. In the Console, the query validator shows estimated bytes before you click Run.

Exam trap: BigQuery on-demand pricing charges per TB scanned ($6.25/TB as of current pricing). A SELECT * on a 10 TB table costs approximately $62.50. The dry run flag is the correct answer when a question asks about estimating query cost before execution.

Backups and Restores

Service	Backup Method	Restore Method
Cloud SQL	Automated daily backups + on-demand	Restore to same or new instance: `gcloud sql backups restore BACKUP_ID --restore-instance=INSTANCE`
Cloud Spanner	Automatic, configurable retention	`gcloud spanner backups create` / `gcloud spanner databases restore`
AlloyDB	Continuous backup (automatic) + on-demand	Point-in-time recovery or backup restore
Firestore	Managed export/import	`gcloud firestore export gs://bucket` / `gcloud firestore import gs://bucket/export`

# Create an on-demand Cloud SQL backup
gcloud sql backups create --instance=my-instance

# List Cloud SQL backups
gcloud sql backups list --instance=my-instance

# Restore a Cloud SQL backup
gcloud sql backups restore BACKUP_ID --restore-instance=my-instance

Exam trap: Cloud SQL automated backups have a retention window (default 7 days, configurable up to 365 days). On-demand backups persist until you delete them. If a question asks about long-term backup retention, on-demand backups or export to Cloud Storage is the answer.

Monitoring Job Status

# Check BigQuery job status
bq show --job=true JOB_ID

# List recent BigQuery jobs
bq ls --jobs=true --max_results=10

# Check Dataflow job status
gcloud dataflow jobs list --region=us-central1
gcloud dataflow jobs describe JOB_ID --region=us-central1

# Cancel a Dataflow job
gcloud dataflow jobs cancel JOB_ID --region=us-central1

4.5 Managing Networking Resources

Subnets: Adding and Expanding

VPC networks contain subnets, and subnets can be expanded (but never shrunk).

# Add a subnet to an existing VPC
gcloud compute networks subnets create my-subnet \
  --network=my-vpc --region=us-central1 --range=10.0.1.0/24

# Expand a subnet's IP range (can only increase, never decrease)
gcloud compute networks subnets expand-ip-range my-subnet \
  --region=us-central1 --prefix-length=20

Exam trap: You can expand a subnet's CIDR range but you cannot shrink it. The new range must contain the original range. For example, expanding 10.0.1.0/24 to 10.0.0.0/20 is valid. This operation does not disrupt existing resources.

Static IP Addresses

Type	Scope	Use Case	Billing
External static	Regional or global	Public-facing services, load balancers	Charged when reserved but NOT attached to a running resource
Internal static	Regional	Fixed internal addressing	Free

# Reserve a regional external static IP
gcloud compute addresses create my-external-ip --region=us-central1

# Reserve a global external static IP (for global load balancers)
gcloud compute addresses create my-global-ip --global

# Reserve an internal static IP
gcloud compute addresses create my-internal-ip --region=us-central1 \
  --subnet=my-subnet --addresses=10.0.1.50

# List reserved addresses
gcloud compute addresses list

# Assign a reserved external IP to a VM
gcloud compute instances create my-vm --zone=us-central1-a \
  --address=my-external-ip

# Promote an ephemeral IP to static
gcloud compute addresses create my-ip --addresses=EPHEMERAL_IP \
  --region=us-central1

Exam trap: Reserved static external IPs that are NOT attached to a running resource incur charges. This is a common cost surprise. If a VM is stopped or deleted but its static IP is still reserved, you continue paying for the IP. The exam tests whether you know to release unused static IPs.

Cloud DNS

Cloud DNS is a managed authoritative DNS service.

Zone Type	Visibility	Use Case
Public	Internet	Route external traffic to your services
Private	Specific VPC networks	Internal DNS resolution within VPCs

# Create a public managed zone
gcloud dns managed-zones create my-zone \
  --dns-name="example.com." --description="My public zone"

# Create a private managed zone
gcloud dns managed-zones create my-private-zone \
  --dns-name="internal.example.com." --description="Internal zone" \
  --visibility=private --networks=my-vpc

# Add an A record
gcloud dns record-sets create www.example.com. --zone=my-zone \
  --type=A --ttl=300 --rrdatas="34.120.1.1"

# Add a CNAME record
gcloud dns record-sets create app.example.com. --zone=my-zone \
  --type=CNAME --ttl=300 --rrdatas="my-service.example.com."

# List records in a zone
gcloud dns record-sets list --zone=my-zone

Exam trap: DNS names in Cloud DNS must end with a trailing dot (e.g., example.com.). Forgetting the trailing dot is a common configuration error.

Cloud NAT

Cloud NAT provides outbound internet access for resources without external IP addresses. It operates at the VPC network level through a Cloud Router.

# Create a Cloud Router (required for Cloud NAT)
gcloud compute routers create my-router --network=my-vpc --region=us-central1

# Create a Cloud NAT gateway
gcloud compute routers nats create my-nat --router=my-router \
  --region=us-central1 --auto-allocate-nat-external-ips \
  --nat-all-subnet-ip-ranges

Configuration	Description
`--auto-allocate-nat-external-ips`	Google automatically assigns external IPs
`--nat-external-ip-pool=IP1,IP2`	Use specific reserved static IPs for NAT
`--nat-all-subnet-ip-ranges`	Apply NAT to all subnets in the region
`--nat-custom-subnet-ip-ranges=SUBNET`	Apply NAT to specific subnets only

Exam trap: Cloud NAT is outbound only. It does not allow unsolicited inbound connections from the internet. If a question asks about allowing inbound traffic to VMs without external IPs, the answer is an internal load balancer or IAP, not Cloud NAT.

4.6 Monitoring and Logging

Cloud Monitoring Alerting Policies

Cloud Monitoring alerting policies define conditions that trigger notifications. An alerting policy has three components: the condition (what to monitor), the notification channel (how to alert), and the documentation (context for responders).

Alert condition types:

Condition Type	Triggers When	Duration
Metric threshold	A metric value exceeds or falls below a threshold for a specified duration	Configurable alignment period
Metric absence	A monitored metric stops reporting data	Up to 23.5 hours
Forecasted value	A metric is predicted to breach a threshold	1-7 day prediction window
Log-based	A specific log entry pattern is detected	Immediate (rate-limited)
SQL-based	A Log Analytics query returns matching results	Public preview

# Create a notification channel (email)
gcloud beta monitoring channels create \
  --display-name="Ops Team Email" \
  --type=email \
  --channel-labels=email_address=ops@example.com

# List alerting policies
gcloud alpha monitoring policies list

# Describe an alerting policy
gcloud alpha monitoring policies describe POLICY_ID

Key concepts for the exam:

Notification channels: Email, SMS, Slack, PagerDuty, Pub/Sub, webhooks, and mobile app
Uptime checks: HTTP(S), TCP, or ICMP probes from global locations. Can trigger alerts on failure.
Snooze: Temporarily suppresses notifications without modifying the alert policy
Incidents: Created automatically when conditions are met; auto-close when conditions resolve

Exam trap: Log-based alerts and metric-based alerts are configured differently. Log-based alerts operate on log entries (Logs Explorer query syntax), while metric-based alerts operate on time-series data. You cannot use metric-threshold conditions to alert on specific log messages.

Custom Metrics

Custom metrics extend Cloud Monitoring beyond built-in GCP metrics. Create them via the Monitoring API or OpenTelemetry.

# Write a custom metric data point
gcloud monitoring metrics-descriptors create \
  custom.googleapis.com/my_app/request_latency \
  --type=DOUBLE --description="Application request latency"

Use custom metrics for application-specific KPIs (queue depth, business transactions, cache hit rate) that built-in metrics do not cover.

Log Sinks and Export

Log sinks route log entries to destinations for long-term storage, analysis, or integration with external tools. Every log entry is evaluated by all sinks in the resource; a sink routes entries that match its inclusion filter and do not match any exclusion filters.

Supported destinations:

Destination	Use Case	Format
Log buckets	Retention in Cloud Logging, Logs Explorer, Log Analytics	Structured
BigQuery	SQL analysis, joining with business data	Streaming inserts
Cloud Storage	Long-term archival, compliance	JSON files (batched hourly)
Pub/Sub	Streaming to external tools (Splunk, Datadog)	JSON messages
Google Cloud project	Cross-project log routing	Re-routed through destination sinks

# Create a sink that exports to Cloud Storage
gcloud logging sinks create my-storage-sink \
  storage.googleapis.com/my-log-bucket \
  --log-filter='resource.type="gce_instance" AND severity>=ERROR'

# Create a sink that exports to BigQuery
gcloud logging sinks create my-bq-sink \
  bigquery.googleapis.com/projects/PROJECT/datasets/my_logs \
  --log-filter='resource.type="gce_instance"'

# Create a sink that exports to Pub/Sub
gcloud logging sinks create my-pubsub-sink \
  pubsub.googleapis.com/projects/PROJECT/topics/my-topic \
  --log-filter='logName="projects/PROJECT/logs/my-app"'

# List sinks
gcloud logging sinks list

# Update a sink filter
gcloud logging sinks update my-storage-sink \
  --log-filter='resource.type="gce_instance" AND severity>=WARNING'

System-created sinks (cannot be deleted):

Sink	Destination	Modifiable?
`_Required`	`_Required` log bucket	No. Routes Admin Activity, System Event, and Access Transparency logs.
`_Default`	`_Default` log bucket	Yes. Routes everything else. Can add exclusion filters to reduce costs.

Exam trap: After creating a sink, you must grant the sink's service account writer permissions on the destination. The sink creation output displays the service account. For BigQuery, grant roles/bigquery.dataEditor; for Cloud Storage, grant roles/storage.objectCreator; for Pub/Sub, grant roles/pubsub.publisher.

Organization and folder sinks:

Non-intercepting (default): Routes matching logs but lets child resource sinks also process them
Intercepting: Blocks log entries from flowing to child resource sinks (except _Required). Useful for centralized logging where you do not want projects to also retain logs.

Log Buckets, Analytics, and Routers

Concept	Description
Log bucket	Storage container for log entries in Cloud Logging. Each project has `_Required` (400-day retention, immutable) and `_Default` (30-day default retention, modifiable).
Custom log bucket	User-created bucket with configurable retention (1-3650 days). Can enable Log Analytics for SQL querying.
Log Analytics	Enables BigQuery-compatible SQL queries directly on log buckets without exporting. Requires upgrading the bucket.
Log router	The pipeline that evaluates all sinks against incoming log entries. Processes every entry, applies inclusion/exclusion filters, and routes to matched destinations.

# Create a custom log bucket with 90-day retention
gcloud logging buckets create my-custom-bucket \
  --location=us-central1 --retention-days=90

# Enable Log Analytics on a bucket
gcloud logging buckets update my-custom-bucket \
  --location=us-central1 --enable-analytics

# Update the _Default bucket retention to 90 days
gcloud logging buckets update _Default --location=global --retention-days=90

Exam trap: The _Required bucket has a fixed 400-day retention that cannot be changed. The _Default bucket has a 30-day default retention that CAN be customized. If a question asks about retaining Admin Activity audit logs for more than 400 days, you need to create a sink to export them to Cloud Storage or BigQuery.

Logs Explorer Filtering

Logs Explorer uses the Cloud Logging query language to filter log entries.

Common filter patterns for the exam:

# Filter by resource type
resource.type="gce_instance"

# Filter by severity
severity>=ERROR

# Filter by log name
logName="projects/PROJECT_ID/logs/cloudaudit.googleapis.com%2Factivity"

# Filter by text payload
textPayload:"error"

# Filter by JSON payload field
jsonPayload.status=500

# Combine conditions (AND is implicit between lines)
resource.type="gce_instance"
severity>=WARNING
resource.labels.instance_id="1234567890"

# Time range
timestamp>="2026-02-25T00:00:00Z"
timestamp<"2026-02-26T00:00:00Z"

Exam trap: In Logs Explorer, separate lines are implicitly joined with AND. To use OR, you must write it explicitly: severity=ERROR OR severity=CRITICAL. Newline-separated conditions are always AND.

Ops Agent and Managed Service for Prometheus

Ops Agent is the unified agent for collecting both metrics and logs from Compute Engine VMs. It replaces the legacy Monitoring and Logging agents.

# Install Ops Agent on a VM (run from within the VM)
curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
sudo bash add-google-cloud-ops-agent-repo.sh --also-install

# Verify the agent is running
sudo systemctl status google-cloud-ops-agent

Agent	Collects	Status
Ops Agent	Metrics + logs (unified)	Current, recommended
Legacy Monitoring Agent	Metrics only	Deprecated
Legacy Logging Agent	Logs only	Deprecated

Managed Service for Prometheus provides a fully managed, multi-cloud Prometheus-compatible monitoring solution. It stores metrics in Cloud Monitoring and supports PromQL queries.

Collects Prometheus metrics from GKE workloads and Compute Engine
Data stored in Monarch (Google's global monitoring backend)
Query using PromQL in Cloud Monitoring or Grafana

Exam trap: The Ops Agent must be installed on every VM you want to monitor. It is NOT installed by default. GKE nodes use a different mechanism (built-in integration with Cloud Monitoring). If a question asks about missing VM metrics/logs, checking whether the Ops Agent is installed is the first troubleshooting step.

Audit Logs

Cloud Audit Logs record administrative activities and data access for Google Cloud resources. There are four types:

Log Type	What It Records	Enabled by Default	Can Be Disabled	Retention (`_Required` bucket)
Admin Activity	Configuration changes (create/delete/update resources, set IAM policies)	Yes	No	400 days
Data Access	API calls that read resource configuration/metadata or read/write user data	No (except BigQuery)	Yes (can be turned off)	30 days (`_Default` bucket)
System Event	Google-initiated system actions (auto-healing, live migration)	Yes	No	400 days
Policy Denied	Access denied due to VPC Service Controls or Organization Policy violations	Yes	No (but can be excluded from sinks)	30 days (`_Default` bucket)

# View Admin Activity logs
gcloud logging read 'logName="projects/PROJECT/logs/cloudaudit.googleapis.com%2Factivity"' \
  --limit=10

# View Data Access logs
gcloud logging read 'logName="projects/PROJECT/logs/cloudaudit.googleapis.com%2Fdata_access"' \
  --limit=10

# Enable Data Access logs for all services (project level)
gcloud projects get-iam-policy PROJECT_ID --format=json > policy.json
# Edit policy.json to add auditConfigs, then:
gcloud projects set-iam-policy PROJECT_ID policy.json

Data Access log configuration uses auditConfigs in the IAM policy:

{
  "auditConfigs": [
    {
      "service": "allServices",
      "auditLogConfigs": [
        { "logType": "ADMIN_READ" },
        { "logType": "DATA_READ" },
        { "logType": "DATA_WRITE" }
      ]
    }
  ]
}

IAM roles for viewing audit logs:

Log Type	Required Role
Admin Activity, System Event	`roles/logging.viewer` (Logs Viewer)
Data Access, Policy Denied	`roles/logging.privateLogViewer` (Private Logs Viewer)

Exam trap: Data Access logs are disabled by default for most services (BigQuery is the notable exception where they are always on). They can generate very high volume and cost. If a question asks about tracking who read specific data, you need to enable Data Access audit logs for that service first.

Exam trap: Admin Activity and System Event logs are stored in the _Required bucket with 400-day retention that cannot be modified. Data Access and Policy Denied logs go to the _Default bucket with a configurable retention (default 30 days). To retain any logs beyond their bucket retention, export via a sink.

Quick-Reference: Decision Tree

Scenario	Correct Approach
Need to change a VM's machine type	Stop the VM, then `set-machine-type`
VM has no external IP but needs outbound internet	Configure Cloud NAT on the VPC
Need to back up a Compute Engine disk	Create a snapshot (incremental by default)
Deploy a new version of a Cloud Run service gradually	Traffic splitting between revisions
GKE pods failing to schedule (insufficient resources)	Cluster Autoscaler adds nodes automatically
Cloud Run cold start latency is too high	Set `--min-instances=1` or higher
Move Cloud Storage objects to cheaper storage after 90 days	Lifecycle rule with `age` condition and `SetStorageClass` action
Estimate BigQuery query cost before running	Use `--dry_run` flag
Need VM metrics/logs in Cloud Monitoring	Install the Ops Agent
Export logs to Splunk	Create a Pub/Sub sink, connect Splunk to the Pub/Sub subscription
Retain Admin Activity logs beyond 400 days	Create a sink to Cloud Storage or BigQuery
Track who accessed sensitive data in Cloud Storage	Enable Data Access audit logs for Cloud Storage
Alert when a specific error appears in logs	Create a log-based alerting policy
Alert when CPU exceeds 80% for 5 minutes	Create a metric-threshold alerting policy
Reduce Cloud Logging costs	Add exclusion filters to the `_Default` sink
VMs without external IPs need to pull packages	Cloud NAT for outbound internet access
Need DNS resolution only within VPCs	Private Cloud DNS zone
Reserve an IP for a load balancer	`gcloud compute addresses create --global`

Common Exam Traps Summary

Deleting a VM deletes the boot disk by default -- non-boot disks are preserved unless --delete-disks=all is specified.
Machine type changes require stopping the VM -- you cannot resize CPU/memory on a running instance.
Snapshots are incremental -- only changed blocks are stored after the first snapshot. Deleting an earlier snapshot does not lose data; the incremental chain is automatically reconciled.
Image families always point to the latest non-deprecated image -- use families in automation for automatic updates.
HPA and VPA conflict on the same metric -- do not use both to scale on CPU simultaneously.
Cluster Autoscaler scales on pending pods, not node CPU -- high node utilization alone does not trigger scaling.
Cloud Run --to-latest sends all traffic to new deployments immediately -- for canary deployments, route traffic to specific revision names.
Lifecycle rule changes take up to 24 hours -- not real-time.
Lifecycle transitions are one-direction only -- Standard to Nearline to Coldline to Archive. You cannot use lifecycle rules to warm up objects.
Disabling versioning does not delete noncurrent versions -- they persist until explicitly removed.
Reserved static IPs incur charges when unattached -- release IPs you are not using.
Subnet IP ranges can be expanded but never shrunk -- plan CIDR ranges carefully.
Cloud NAT is outbound only -- it does not provide inbound connectivity.
Ops Agent is not installed by default on VMs -- missing agent is the top reason for absent VM metrics.
Data Access audit logs are disabled by default (except BigQuery) -- enable them explicitly to track data reads/writes.
_Required bucket: 400-day retention, immutable. _Default bucket: 30-day default, configurable -- export to Cloud Storage or BigQuery for longer retention.
Sink service accounts need destination permissions -- the most common cause of sink failures is missing IAM grants on the destination resource.
Logs Explorer lines are implicitly AND -- use explicit OR for disjunctive filters.