Domain 4: Ensuring Successful Operation of a Cloud Solution (~20%)
Domain 4 accounts for approximately 20% of the Associate Cloud Engineer exam, translating to roughly 10-12 questions. This is one of the most operationally focused domains: you must know how to manage running resources, troubleshoot issues, configure monitoring and logging, and handle day-to-day operational tasks across Compute Engine, GKE, Cloud Run, storage, databases, and networking. The exam tests six sub-domains with heavy emphasis on gcloud commands and Console workflows.
4.1 Managing Compute Engine Resources
Starting, Stopping, and Deleting Instances
Compute Engine VMs have distinct lifecycle states that determine billing and behavior.
| Action | Command | Billing | Persistent Disk |
|---|---|---|---|
| Start | gcloud compute instances start INSTANCE |
Resumes full billing | Preserved |
| Stop | gcloud compute instances stop INSTANCE |
No CPU/memory charge; disk charges continue | Preserved |
| Suspend | gcloud compute instances suspend INSTANCE |
No CPU/memory charge; charges for suspended state memory + disk | Preserved + memory saved |
| Delete | gcloud compute instances delete INSTANCE |
All billing stops | Boot disk deleted by default; non-boot disks preserved unless --delete-disks=all |
| Reset | gcloud compute instances reset INSTANCE |
Continues | Preserved (hard reset, no graceful shutdown) |
# Stop an instance
gcloud compute instances stop my-vm --zone=us-central1-a
# Start an instance
gcloud compute instances start my-vm --zone=us-central1-a
# Delete an instance but keep all disks
gcloud compute instances delete my-vm --zone=us-central1-a --keep-disks=all
# Delete an instance and all attached disks
gcloud compute instances delete my-vm --zone=us-central1-a --delete-disks=all
Exam trap: When you delete a VM, the boot disk is deleted by default. Non-boot (additional) disks are NOT deleted by default. If a question asks about preserving data after VM deletion, you need
--keep-disks=allor detach the disk first.
Editing VM Configuration
Some properties can be changed while the VM is running; others require a stop first.
| Property | Requires Stop? | Command |
|---|---|---|
| Machine type | Yes | gcloud compute instances set-machine-type INSTANCE --machine-type=e2-standard-4 |
| Labels | No | gcloud compute instances update INSTANCE --update-labels=env=prod |
| Metadata | No | gcloud compute instances add-metadata INSTANCE --metadata=key=value |
| Tags (network) | No | gcloud compute instances add-tags INSTANCE --tags=http-server |
| Service account | Yes | Must stop, then use gcloud compute instances set-service-account |
| Attached disks | No (attach); Yes (detach boot) | gcloud compute instances attach-disk INSTANCE --disk=DISK_NAME |
Exam trap: Changing the machine type requires stopping the instance first. You cannot resize a running VM's CPU/memory. This is a frequent exam question.
SSH and RDP Connections
# SSH into a Linux VM (uses IAP tunnel by default if no external IP)
gcloud compute ssh my-vm --zone=us-central1-a
# SSH through IAP explicitly
gcloud compute ssh my-vm --zone=us-central1-a --tunnel-through-iap
# SSH with a specific user
gcloud compute ssh user@my-vm --zone=us-central1-a
# Create an RDP tunnel for Windows VMs
gcloud compute start-iap-tunnel my-windows-vm 3389 --local-host-port=localhost:3389
OS Login is Google's recommended method for managing SSH access. When enabled, SSH keys are managed centrally through IAM rather than per-instance metadata.
| Feature | Metadata SSH Keys | OS Login |
|---|---|---|
| Key management | Per-instance or project-level metadata | Centralized via Cloud Identity/Workspace |
| IAM integration | None | Full: roles/compute.osLogin (standard), roles/compute.osAdminLogin (sudo) |
| 2FA support | No | Yes (with roles/compute.osLogin + 2FA configured) |
| Audit | Limited | Full audit via Cloud Audit Logs |
# Enable OS Login at project level
gcloud compute project-info add-metadata --metadata enable-oslogin=TRUE
# Enable OS Login on a specific instance
gcloud compute instances add-metadata my-vm --metadata enable-oslogin=TRUE
Exam trap: OS Login overrides metadata-based SSH keys when enabled. If a user cannot SSH after OS Login is enabled, they need
roles/compute.osLogin(orroles/compute.osAdminLoginfor sudo access) on the project or instance.
Snapshots
Snapshots are incremental backups of persistent disks. After the first full snapshot, subsequent snapshots only store changed blocks.
# Create a snapshot
gcloud compute disks snapshot my-disk --zone=us-central1-a \
--snapshot-names=my-snapshot
# Create a snapshot with a storage location
gcloud compute disks snapshot my-disk --zone=us-central1-a \
--snapshot-names=my-snapshot --storage-location=us
# List snapshots
gcloud compute snapshots list
# Create a disk from a snapshot
gcloud compute disks create new-disk --source-snapshot=my-snapshot \
--zone=us-east1-b
# Create a VM from a snapshot (create disk first, then VM)
gcloud compute instances create my-new-vm --zone=us-east1-b \
--disk=name=new-disk,boot=yes
Snapshot schedules automate recurring backups:
# Create a snapshot schedule (daily, retain 7 days)
gcloud compute resource-policies create snapshot-schedule my-schedule \
--region=us-central1 \
--max-retention-days=7 \
--daily-schedule \
--start-time=02:00
# Attach schedule to a disk
gcloud compute disks add-resource-policies my-disk \
--resource-policies=my-schedule --zone=us-central1-a
| Snapshot Type | Description | Use Case |
|---|---|---|
| Standard | Point-in-time backup of a persistent disk | Disaster recovery, migration |
| Archive | Lower-cost storage for long-term retention | Compliance, long-term backups |
| Instant | Rapid restore for zonal persistent disks | Fast recovery (same zone only) |
Exam trap: Snapshots are global resources but can be restricted to a specific storage location (multi-region or region). When restoring a snapshot to a different zone or region, you create a new disk from the snapshot in the target location -- the snapshot itself is not moved.
Images
| Image Type | Description | Example |
|---|---|---|
| Public images | Google-provided OS images | debian-cloud/debian-12, ubuntu-os-cloud/ubuntu-2404-lts-amd64 |
| Custom images | Images you create from disks, snapshots, or other images | my-project/my-custom-image |
| Image families | Group of related images; always points to the latest non-deprecated image | debian-12, ubuntu-2404-lts |
# Create a custom image from a disk (stop VM first for consistency)
gcloud compute images create my-image --source-disk=my-disk \
--source-disk-zone=us-central1-a --family=my-app-images
# Create a custom image from a snapshot
gcloud compute images create my-image --source-snapshot=my-snapshot
# Create a VM from the latest image in a family
gcloud compute instances create my-vm \
--image-family=my-app-images --image-project=my-project
# Deprecate an image
gcloud compute images deprecate old-image --state=DEPRECATED \
--replacement=new-image
Exam trap: When you specify
--image-family, Compute Engine uses the most recent non-deprecated image in that family. This is the recommended approach for automation -- new deployments automatically pick up updated images without changing scripts.
4.2 Managing GKE Resources
Cluster and Node Pool Management
# View cluster details
gcloud container clusters describe my-cluster --zone=us-central1-a
# List clusters
gcloud container clusters list
# Get credentials for kubectl
gcloud container clusters get-credentials my-cluster --zone=us-central1-a
# Resize a node pool (manual scaling)
gcloud container clusters resize my-cluster --node-pool=default-pool \
--num-nodes=5 --zone=us-central1-a
# Add a node pool
gcloud container node-pools create new-pool --cluster=my-cluster \
--zone=us-central1-a --machine-type=e2-standard-4 --num-nodes=3
# Delete a node pool
gcloud container node-pools delete old-pool --cluster=my-cluster \
--zone=us-central1-a
Cluster modes determine how nodes are managed:
| Mode | Node Management | Billing | Use Case |
|---|---|---|---|
| Standard | You manage node pools, upgrades, scaling | Per node (VM pricing) | Full control, custom configurations |
| Autopilot | Google manages nodes; you define pods | Per pod resource request | Hands-off operations, optimized costs |
Exam trap: Autopilot clusters do not expose node pools for direct management. You cannot SSH into Autopilot nodes. If a question involves node-level configuration (custom kernel settings, specific machine types per node), Standard mode is required.
Artifact Registry
Artifact Registry is the recommended container image and package repository for GKE.
# Configure Docker authentication to Artifact Registry
gcloud auth configure-docker us-central1-docker.pkg.dev
# Tag and push an image
docker tag my-app us-central1-docker.pkg.dev/PROJECT_ID/my-repo/my-app:v1
docker push us-central1-docker.pkg.dev/PROJECT_ID/my-repo/my-app:v1
# List images in a repository
gcloud artifacts docker images list us-central1-docker.pkg.dev/PROJECT_ID/my-repo
GKE nodes need the roles/artifactregistry.reader role on the Artifact Registry repository to pull images. When using Workload Identity, ensure the GKE service account has this role.
Managing Kubernetes Workloads
Key kubectl commands for the exam:
# Pods
kubectl get pods --all-namespaces
kubectl describe pod POD_NAME
kubectl logs POD_NAME
kubectl logs POD_NAME --container=CONTAINER_NAME # multi-container pod
kubectl delete pod POD_NAME
kubectl exec -it POD_NAME -- /bin/bash
# Deployments
kubectl get deployments
kubectl describe deployment DEPLOYMENT_NAME
kubectl scale deployment DEPLOYMENT_NAME --replicas=5
kubectl rollout status deployment DEPLOYMENT_NAME
kubectl rollout undo deployment DEPLOYMENT_NAME # rollback
kubectl rollout history deployment DEPLOYMENT_NAME
# Services
kubectl get services
kubectl expose deployment DEPLOYMENT_NAME --type=LoadBalancer --port=80
# StatefulSets
kubectl get statefulsets
kubectl scale statefulset STATEFULSET_NAME --replicas=3
| Workload Type | Use Case | Key Characteristic |
|---|---|---|
| Deployment | Stateless applications | Rolling updates, easy scaling, interchangeable pods |
| StatefulSet | Stateful applications (databases) | Stable network IDs, persistent storage, ordered deployment |
| DaemonSet | One pod per node (monitoring agents) | Automatically runs on every node |
| Job | One-time tasks | Runs to completion, then stops |
| CronJob | Scheduled tasks | Runs jobs on a cron schedule |
Exam trap: StatefulSets provide stable, persistent storage per pod and ordered scaling (pods are created/deleted sequentially: pod-0, pod-1, pod-2). Deployments do NOT guarantee pod ordering or stable storage. If a question involves a database workload needing persistent identity, StatefulSet is the answer.
Autoscaling in GKE
Three autoscaling mechanisms work at different levels:
| Autoscaler | Level | What It Scales | Based On |
|---|---|---|---|
| Horizontal Pod Autoscaler (HPA) | Pod | Number of pod replicas | CPU utilization, memory, or custom metrics |
| Vertical Pod Autoscaler (VPA) | Pod | CPU/memory requests per pod | Historical resource usage |
| Cluster Autoscaler (CA) | Node | Number of nodes in a node pool | Pending pods that cannot be scheduled |
# Enable HPA for a deployment (target 70% CPU)
kubectl autoscale deployment my-app --min=2 --max=10 --cpu-percent=70
# View HPA status
kubectl get hpa
# Enable cluster autoscaler on a node pool
gcloud container clusters update my-cluster --zone=us-central1-a \
--enable-autoscaling --min-nodes=1 --max-nodes=10 \
--node-pool=default-pool
Exam trap: HPA and VPA should generally not be used together on the same metric (e.g., both targeting CPU). HPA changes replica count; VPA changes resource requests. Conflicting signals can cause instability. The exam may test whether you know this conflict exists.
Exam trap: Cluster Autoscaler only adds nodes when pods are pending due to insufficient resources. It does not scale based on CPU utilization of existing nodes. If all pods fit on existing nodes, the autoscaler does not add more -- even if nodes are highly utilized.
4.3 Managing Cloud Run Resources
Deploying and Managing Revisions
Every deployment to Cloud Run creates a new immutable revision. Revisions are point-in-time snapshots of your service configuration and container image.
# Deploy a new revision
gcloud run deploy my-service --image=us-central1-docker.pkg.dev/PROJECT/repo/app:v2 \
--region=us-central1
# Deploy with specific configuration
gcloud run deploy my-service \
--image=us-central1-docker.pkg.dev/PROJECT/repo/app:v2 \
--region=us-central1 \
--memory=512Mi --cpu=1 \
--min-instances=1 --max-instances=10 \
--concurrency=80 \
--set-env-vars=DB_HOST=10.0.0.1
# List revisions
gcloud run revisions list --service=my-service --region=us-central1
# Describe a specific revision
gcloud run revisions describe my-service-00005-abc --region=us-central1
Traffic Splitting
Traffic splitting routes percentages of traffic to different revisions. This enables canary deployments and gradual rollouts.
# Route all traffic to the latest revision
gcloud run services update-traffic my-service --to-latest --region=us-central1
# Split traffic: 90% to current, 10% to new revision
gcloud run services update-traffic my-service \
--to-revisions=my-service-00004-xyz=90,my-service-00005-abc=10 \
--region=us-central1
# Roll back: send all traffic to a previous revision
gcloud run services update-traffic my-service \
--to-revisions=my-service-00004-xyz=100 --region=us-central1
Exam trap: Setting
--to-latestmeans the revision tagged as "latest" always gets all traffic. If you deploy a new revision, it immediately receives 100% traffic. For controlled rollouts, assign traffic to specific revision names instead of using--to-latest.
Scaling Parameters
| Parameter | Flag | Description | Default |
|---|---|---|---|
| Min instances | --min-instances |
Minimum instances always running (avoids cold starts) | 0 |
| Max instances | --max-instances |
Maximum instances to scale to | 100 |
| Concurrency | --concurrency |
Max concurrent requests per instance | 80 |
| CPU allocation | --cpu-throttling / --no-cpu-throttling |
CPU allocated only during requests vs. always | Throttled (during requests only) |
Setting --min-instances=1 or higher eliminates cold starts but incurs charges even when idle. Setting --no-cpu-throttling (CPU always allocated) enables background processing but increases cost.
Exam trap: Cloud Run scales to zero by default (
--min-instances=0). This means there may be cold start latency on the first request. If a question describes latency-sensitive workloads, setting min instances above zero is the correct approach.
4.4 Managing Storage and Database Solutions
Cloud Storage Lifecycle Management
Object lifecycle management applies automatic actions to objects based on configurable conditions. Rules are set at the bucket level. An object must match ALL conditions in a rule for the action to trigger.
| Action | Description |
|---|---|
| Delete | Remove objects matching conditions. Deleted objects become soft-deleted (recoverable for 7 days by default). |
| SetStorageClass | Transition objects to a different storage class. Counts as a Class A operation but avoids retrieval fees. |
| AbortIncompleteMultipartUpload | Clean up abandoned multipart uploads. |
| Condition | Description |
|---|---|
age |
Days since object creation |
createdBefore |
Objects created before a UTC date |
numNewerVersions |
For versioned buckets: applies when N newer versions exist |
isLive |
true = current version; false = noncurrent (versioned buckets) |
matchesStorageClass |
Filter by current storage class |
daysSinceNoncurrentTime |
Days since object became noncurrent |
matchesPrefix / matchesSuffix |
Pattern matching on object names |
# View lifecycle configuration
gcloud storage buckets describe gs://my-bucket --format="json(lifecycle)"
# Set lifecycle from a JSON file
gcloud storage buckets update gs://my-bucket --lifecycle-file=lifecycle.json
Example lifecycle JSON to transition to Coldline after 90 days and delete after 365 days:
{
"rule": [
{
"action": {"type": "SetStorageClass", "storageClass": "COLDLINE"},
"condition": {"age": 90, "matchesStorageClass": ["STANDARD"]}
},
{
"action": {"type": "Delete"},
"condition": {"age": 365}
}
]
}
Exam trap: Lifecycle changes can take up to 24 hours to take effect. This is a processing delay, not a real-time operation. Also, lifecycle rules can only transition objects to a colder storage class (Standard -> Nearline -> Coldline -> Archive). You cannot use lifecycle rules to move objects to a warmer class.
Object Versioning
When versioning is enabled, overwriting or deleting an object creates a noncurrent version rather than permanently removing it.
# Enable versioning
gcloud storage buckets update gs://my-bucket --versioning
# Disable versioning
gcloud storage buckets update gs://my-bucket --no-versioning
# List object versions
gcloud storage ls --all-versions gs://my-bucket/my-object
# Restore a noncurrent version (copy it to make it current)
gcloud storage cp gs://my-bucket/my-object#GENERATION gs://my-bucket/my-object
Exam trap: Disabling versioning does NOT delete existing noncurrent versions. They remain (and incur storage charges) until explicitly deleted or removed by a lifecycle rule. Use a lifecycle rule with
numNewerVersionsto manage noncurrent version cleanup.
Storage Class Transitions
| Storage Class | Minimum Duration | Access Pattern | Relative Cost (Storage) | Relative Cost (Retrieval) |
|---|---|---|---|---|
| Standard | None | Frequent access | Highest | Lowest |
| Nearline | 30 days | Once per month | Lower | Higher |
| Coldline | 90 days | Once per quarter | Lower still | Higher still |
| Archive | 365 days | Once per year | Lowest | Highest |
The minimum storage duration is a billing minimum. Objects deleted before the minimum duration are charged for the full minimum period. Access is always possible regardless of storage class -- classes differ only in pricing, not availability.
Running Queries Across Database Services
| Service | Query Method | Key Command / Interface |
|---|---|---|
| Cloud SQL | SQL via client tools, Cloud SQL proxy, Console | gcloud sql connect INSTANCE --user=root |
| BigQuery | SQL via Console, bq, client libraries |
bq query --use_legacy_sql=false 'SELECT ...' |
| Cloud Spanner | SQL or Google Standard SQL | Console Query Editor, gcloud spanner databases execute-sql |
| Firestore | Document queries via SDKs, Console | Console Data viewer, client libraries |
| AlloyDB | PostgreSQL-compatible SQL | gcloud alloydb instances connect, psql |
BigQuery Dry Run (Cost Estimation)
A dry run validates a query and returns the estimated bytes processed without actually running it. This is essential for cost estimation before running expensive queries.
# Dry run from CLI
bq query --use_legacy_sql=false --dry_run 'SELECT * FROM `project.dataset.table`'
# Output shows: Query successfully validated. Estimated bytes processed: X
BigQuery charges based on bytes scanned. A dry run helps avoid unexpected costs on large tables. In the Console, the query validator shows estimated bytes before you click Run.
Exam trap: BigQuery on-demand pricing charges per TB scanned ($6.25/TB as of current pricing). A
SELECT *on a 10 TB table costs approximately $62.50. The dry run flag is the correct answer when a question asks about estimating query cost before execution.
Backups and Restores
| Service | Backup Method | Restore Method |
|---|---|---|
| Cloud SQL | Automated daily backups + on-demand | Restore to same or new instance: gcloud sql backups restore BACKUP_ID --restore-instance=INSTANCE |
| Cloud Spanner | Automatic, configurable retention | gcloud spanner backups create / gcloud spanner databases restore |
| AlloyDB | Continuous backup (automatic) + on-demand | Point-in-time recovery or backup restore |
| Firestore | Managed export/import | gcloud firestore export gs://bucket / gcloud firestore import gs://bucket/export |
# Create an on-demand Cloud SQL backup
gcloud sql backups create --instance=my-instance
# List Cloud SQL backups
gcloud sql backups list --instance=my-instance
# Restore a Cloud SQL backup
gcloud sql backups restore BACKUP_ID --restore-instance=my-instance
Exam trap: Cloud SQL automated backups have a retention window (default 7 days, configurable up to 365 days). On-demand backups persist until you delete them. If a question asks about long-term backup retention, on-demand backups or export to Cloud Storage is the answer.
Monitoring Job Status
# Check BigQuery job status
bq show --job=true JOB_ID
# List recent BigQuery jobs
bq ls --jobs=true --max_results=10
# Check Dataflow job status
gcloud dataflow jobs list --region=us-central1
gcloud dataflow jobs describe JOB_ID --region=us-central1
# Cancel a Dataflow job
gcloud dataflow jobs cancel JOB_ID --region=us-central1
4.5 Managing Networking Resources
Subnets: Adding and Expanding
VPC networks contain subnets, and subnets can be expanded (but never shrunk).
# Add a subnet to an existing VPC
gcloud compute networks subnets create my-subnet \
--network=my-vpc --region=us-central1 --range=10.0.1.0/24
# Expand a subnet's IP range (can only increase, never decrease)
gcloud compute networks subnets expand-ip-range my-subnet \
--region=us-central1 --prefix-length=20
Exam trap: You can expand a subnet's CIDR range but you cannot shrink it. The new range must contain the original range. For example, expanding
10.0.1.0/24to10.0.0.0/20is valid. This operation does not disrupt existing resources.
Static IP Addresses
| Type | Scope | Use Case | Billing |
|---|---|---|---|
| External static | Regional or global | Public-facing services, load balancers | Charged when reserved but NOT attached to a running resource |
| Internal static | Regional | Fixed internal addressing | Free |
# Reserve a regional external static IP
gcloud compute addresses create my-external-ip --region=us-central1
# Reserve a global external static IP (for global load balancers)
gcloud compute addresses create my-global-ip --global
# Reserve an internal static IP
gcloud compute addresses create my-internal-ip --region=us-central1 \
--subnet=my-subnet --addresses=10.0.1.50
# List reserved addresses
gcloud compute addresses list
# Assign a reserved external IP to a VM
gcloud compute instances create my-vm --zone=us-central1-a \
--address=my-external-ip
# Promote an ephemeral IP to static
gcloud compute addresses create my-ip --addresses=EPHEMERAL_IP \
--region=us-central1
Exam trap: Reserved static external IPs that are NOT attached to a running resource incur charges. This is a common cost surprise. If a VM is stopped or deleted but its static IP is still reserved, you continue paying for the IP. The exam tests whether you know to release unused static IPs.
Cloud DNS
Cloud DNS is a managed authoritative DNS service.
| Zone Type | Visibility | Use Case |
|---|---|---|
| Public | Internet | Route external traffic to your services |
| Private | Specific VPC networks | Internal DNS resolution within VPCs |
# Create a public managed zone
gcloud dns managed-zones create my-zone \
--dns-name="example.com." --description="My public zone"
# Create a private managed zone
gcloud dns managed-zones create my-private-zone \
--dns-name="internal.example.com." --description="Internal zone" \
--visibility=private --networks=my-vpc
# Add an A record
gcloud dns record-sets create www.example.com. --zone=my-zone \
--type=A --ttl=300 --rrdatas="34.120.1.1"
# Add a CNAME record
gcloud dns record-sets create app.example.com. --zone=my-zone \
--type=CNAME --ttl=300 --rrdatas="my-service.example.com."
# List records in a zone
gcloud dns record-sets list --zone=my-zone
Exam trap: DNS names in Cloud DNS must end with a trailing dot (e.g.,
example.com.). Forgetting the trailing dot is a common configuration error.
Cloud NAT
Cloud NAT provides outbound internet access for resources without external IP addresses. It operates at the VPC network level through a Cloud Router.
# Create a Cloud Router (required for Cloud NAT)
gcloud compute routers create my-router --network=my-vpc --region=us-central1
# Create a Cloud NAT gateway
gcloud compute routers nats create my-nat --router=my-router \
--region=us-central1 --auto-allocate-nat-external-ips \
--nat-all-subnet-ip-ranges
| Configuration | Description |
|---|---|
--auto-allocate-nat-external-ips |
Google automatically assigns external IPs |
--nat-external-ip-pool=IP1,IP2 |
Use specific reserved static IPs for NAT |
--nat-all-subnet-ip-ranges |
Apply NAT to all subnets in the region |
--nat-custom-subnet-ip-ranges=SUBNET |
Apply NAT to specific subnets only |
Exam trap: Cloud NAT is outbound only. It does not allow unsolicited inbound connections from the internet. If a question asks about allowing inbound traffic to VMs without external IPs, the answer is an internal load balancer or IAP, not Cloud NAT.
4.6 Monitoring and Logging
Cloud Monitoring Alerting Policies
Cloud Monitoring alerting policies define conditions that trigger notifications. An alerting policy has three components: the condition (what to monitor), the notification channel (how to alert), and the documentation (context for responders).
Alert condition types:
| Condition Type | Triggers When | Duration |
|---|---|---|
| Metric threshold | A metric value exceeds or falls below a threshold for a specified duration | Configurable alignment period |
| Metric absence | A monitored metric stops reporting data | Up to 23.5 hours |
| Forecasted value | A metric is predicted to breach a threshold | 1-7 day prediction window |
| Log-based | A specific log entry pattern is detected | Immediate (rate-limited) |
| SQL-based | A Log Analytics query returns matching results | Public preview |
# Create a notification channel (email)
gcloud beta monitoring channels create \
--display-name="Ops Team Email" \
--type=email \
--channel-labels=email_address=ops@example.com
# List alerting policies
gcloud alpha monitoring policies list
# Describe an alerting policy
gcloud alpha monitoring policies describe POLICY_ID
Key concepts for the exam:
- Notification channels: Email, SMS, Slack, PagerDuty, Pub/Sub, webhooks, and mobile app
- Uptime checks: HTTP(S), TCP, or ICMP probes from global locations. Can trigger alerts on failure.
- Snooze: Temporarily suppresses notifications without modifying the alert policy
- Incidents: Created automatically when conditions are met; auto-close when conditions resolve
Exam trap: Log-based alerts and metric-based alerts are configured differently. Log-based alerts operate on log entries (Logs Explorer query syntax), while metric-based alerts operate on time-series data. You cannot use metric-threshold conditions to alert on specific log messages.
Custom Metrics
Custom metrics extend Cloud Monitoring beyond built-in GCP metrics. Create them via the Monitoring API or OpenTelemetry.
# Write a custom metric data point
gcloud monitoring metrics-descriptors create \
custom.googleapis.com/my_app/request_latency \
--type=DOUBLE --description="Application request latency"
Use custom metrics for application-specific KPIs (queue depth, business transactions, cache hit rate) that built-in metrics do not cover.
Log Sinks and Export
Log sinks route log entries to destinations for long-term storage, analysis, or integration with external tools. Every log entry is evaluated by all sinks in the resource; a sink routes entries that match its inclusion filter and do not match any exclusion filters.
Supported destinations:
| Destination | Use Case | Format |
|---|---|---|
| Log buckets | Retention in Cloud Logging, Logs Explorer, Log Analytics | Structured |
| BigQuery | SQL analysis, joining with business data | Streaming inserts |
| Cloud Storage | Long-term archival, compliance | JSON files (batched hourly) |
| Pub/Sub | Streaming to external tools (Splunk, Datadog) | JSON messages |
| Google Cloud project | Cross-project log routing | Re-routed through destination sinks |
# Create a sink that exports to Cloud Storage
gcloud logging sinks create my-storage-sink \
storage.googleapis.com/my-log-bucket \
--log-filter='resource.type="gce_instance" AND severity>=ERROR'
# Create a sink that exports to BigQuery
gcloud logging sinks create my-bq-sink \
bigquery.googleapis.com/projects/PROJECT/datasets/my_logs \
--log-filter='resource.type="gce_instance"'
# Create a sink that exports to Pub/Sub
gcloud logging sinks create my-pubsub-sink \
pubsub.googleapis.com/projects/PROJECT/topics/my-topic \
--log-filter='logName="projects/PROJECT/logs/my-app"'
# List sinks
gcloud logging sinks list
# Update a sink filter
gcloud logging sinks update my-storage-sink \
--log-filter='resource.type="gce_instance" AND severity>=WARNING'
System-created sinks (cannot be deleted):
| Sink | Destination | Modifiable? |
|---|---|---|
_Required |
_Required log bucket |
No. Routes Admin Activity, System Event, and Access Transparency logs. |
_Default |
_Default log bucket |
Yes. Routes everything else. Can add exclusion filters to reduce costs. |
Exam trap: After creating a sink, you must grant the sink's service account writer permissions on the destination. The sink creation output displays the service account. For BigQuery, grant
roles/bigquery.dataEditor; for Cloud Storage, grantroles/storage.objectCreator; for Pub/Sub, grantroles/pubsub.publisher.
Organization and folder sinks:
- Non-intercepting (default): Routes matching logs but lets child resource sinks also process them
- Intercepting: Blocks log entries from flowing to child resource sinks (except
_Required). Useful for centralized logging where you do not want projects to also retain logs.
Log Buckets, Analytics, and Routers
| Concept | Description |
|---|---|
| Log bucket | Storage container for log entries in Cloud Logging. Each project has _Required (400-day retention, immutable) and _Default (30-day default retention, modifiable). |
| Custom log bucket | User-created bucket with configurable retention (1-3650 days). Can enable Log Analytics for SQL querying. |
| Log Analytics | Enables BigQuery-compatible SQL queries directly on log buckets without exporting. Requires upgrading the bucket. |
| Log router | The pipeline that evaluates all sinks against incoming log entries. Processes every entry, applies inclusion/exclusion filters, and routes to matched destinations. |
# Create a custom log bucket with 90-day retention
gcloud logging buckets create my-custom-bucket \
--location=us-central1 --retention-days=90
# Enable Log Analytics on a bucket
gcloud logging buckets update my-custom-bucket \
--location=us-central1 --enable-analytics
# Update the _Default bucket retention to 90 days
gcloud logging buckets update _Default --location=global --retention-days=90
Exam trap: The
_Requiredbucket has a fixed 400-day retention that cannot be changed. The_Defaultbucket has a 30-day default retention that CAN be customized. If a question asks about retaining Admin Activity audit logs for more than 400 days, you need to create a sink to export them to Cloud Storage or BigQuery.
Logs Explorer Filtering
Logs Explorer uses the Cloud Logging query language to filter log entries.
Common filter patterns for the exam:
# Filter by resource type
resource.type="gce_instance"
# Filter by severity
severity>=ERROR
# Filter by log name
logName="projects/PROJECT_ID/logs/cloudaudit.googleapis.com%2Factivity"
# Filter by text payload
textPayload:"error"
# Filter by JSON payload field
jsonPayload.status=500
# Combine conditions (AND is implicit between lines)
resource.type="gce_instance"
severity>=WARNING
resource.labels.instance_id="1234567890"
# Time range
timestamp>="2026-02-25T00:00:00Z"
timestamp<"2026-02-26T00:00:00Z"
Exam trap: In Logs Explorer, separate lines are implicitly joined with AND. To use OR, you must write it explicitly:
severity=ERROR OR severity=CRITICAL. Newline-separated conditions are always AND.
Ops Agent and Managed Service for Prometheus
Ops Agent is the unified agent for collecting both metrics and logs from Compute Engine VMs. It replaces the legacy Monitoring and Logging agents.
# Install Ops Agent on a VM (run from within the VM)
curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
sudo bash add-google-cloud-ops-agent-repo.sh --also-install
# Verify the agent is running
sudo systemctl status google-cloud-ops-agent
| Agent | Collects | Status |
|---|---|---|
| Ops Agent | Metrics + logs (unified) | Current, recommended |
| Legacy Monitoring Agent | Metrics only | Deprecated |
| Legacy Logging Agent | Logs only | Deprecated |
Managed Service for Prometheus provides a fully managed, multi-cloud Prometheus-compatible monitoring solution. It stores metrics in Cloud Monitoring and supports PromQL queries.
- Collects Prometheus metrics from GKE workloads and Compute Engine
- Data stored in Monarch (Google's global monitoring backend)
- Query using PromQL in Cloud Monitoring or Grafana
Exam trap: The Ops Agent must be installed on every VM you want to monitor. It is NOT installed by default. GKE nodes use a different mechanism (built-in integration with Cloud Monitoring). If a question asks about missing VM metrics/logs, checking whether the Ops Agent is installed is the first troubleshooting step.
Audit Logs
Cloud Audit Logs record administrative activities and data access for Google Cloud resources. There are four types:
| Log Type | What It Records | Enabled by Default | Can Be Disabled | Retention (_Required bucket) |
|---|---|---|---|---|
| Admin Activity | Configuration changes (create/delete/update resources, set IAM policies) | Yes | No | 400 days |
| Data Access | API calls that read resource configuration/metadata or read/write user data | No (except BigQuery) | Yes (can be turned off) | 30 days (_Default bucket) |
| System Event | Google-initiated system actions (auto-healing, live migration) | Yes | No | 400 days |
| Policy Denied | Access denied due to VPC Service Controls or Organization Policy violations | Yes | No (but can be excluded from sinks) | 30 days (_Default bucket) |
# View Admin Activity logs
gcloud logging read 'logName="projects/PROJECT/logs/cloudaudit.googleapis.com%2Factivity"' \
--limit=10
# View Data Access logs
gcloud logging read 'logName="projects/PROJECT/logs/cloudaudit.googleapis.com%2Fdata_access"' \
--limit=10
# Enable Data Access logs for all services (project level)
gcloud projects get-iam-policy PROJECT_ID --format=json > policy.json
# Edit policy.json to add auditConfigs, then:
gcloud projects set-iam-policy PROJECT_ID policy.json
Data Access log configuration uses auditConfigs in the IAM policy:
{
"auditConfigs": [
{
"service": "allServices",
"auditLogConfigs": [
{ "logType": "ADMIN_READ" },
{ "logType": "DATA_READ" },
{ "logType": "DATA_WRITE" }
]
}
]
}
IAM roles for viewing audit logs:
| Log Type | Required Role |
|---|---|
| Admin Activity, System Event | roles/logging.viewer (Logs Viewer) |
| Data Access, Policy Denied | roles/logging.privateLogViewer (Private Logs Viewer) |
Exam trap: Data Access logs are disabled by default for most services (BigQuery is the notable exception where they are always on). They can generate very high volume and cost. If a question asks about tracking who read specific data, you need to enable Data Access audit logs for that service first.
Exam trap: Admin Activity and System Event logs are stored in the
_Requiredbucket with 400-day retention that cannot be modified. Data Access and Policy Denied logs go to the_Defaultbucket with a configurable retention (default 30 days). To retain any logs beyond their bucket retention, export via a sink.
Quick-Reference: Decision Tree
| Scenario | Correct Approach |
|---|---|
| Need to change a VM's machine type | Stop the VM, then set-machine-type |
| VM has no external IP but needs outbound internet | Configure Cloud NAT on the VPC |
| Need to back up a Compute Engine disk | Create a snapshot (incremental by default) |
| Deploy a new version of a Cloud Run service gradually | Traffic splitting between revisions |
| GKE pods failing to schedule (insufficient resources) | Cluster Autoscaler adds nodes automatically |
| Cloud Run cold start latency is too high | Set --min-instances=1 or higher |
| Move Cloud Storage objects to cheaper storage after 90 days | Lifecycle rule with age condition and SetStorageClass action |
| Estimate BigQuery query cost before running | Use --dry_run flag |
| Need VM metrics/logs in Cloud Monitoring | Install the Ops Agent |
| Export logs to Splunk | Create a Pub/Sub sink, connect Splunk to the Pub/Sub subscription |
| Retain Admin Activity logs beyond 400 days | Create a sink to Cloud Storage or BigQuery |
| Track who accessed sensitive data in Cloud Storage | Enable Data Access audit logs for Cloud Storage |
| Alert when a specific error appears in logs | Create a log-based alerting policy |
| Alert when CPU exceeds 80% for 5 minutes | Create a metric-threshold alerting policy |
| Reduce Cloud Logging costs | Add exclusion filters to the _Default sink |
| VMs without external IPs need to pull packages | Cloud NAT for outbound internet access |
| Need DNS resolution only within VPCs | Private Cloud DNS zone |
| Reserve an IP for a load balancer | gcloud compute addresses create --global |
Common Exam Traps Summary
- Deleting a VM deletes the boot disk by default -- non-boot disks are preserved unless
--delete-disks=allis specified. - Machine type changes require stopping the VM -- you cannot resize CPU/memory on a running instance.
- Snapshots are incremental -- only changed blocks are stored after the first snapshot. Deleting an earlier snapshot does not lose data; the incremental chain is automatically reconciled.
- Image families always point to the latest non-deprecated image -- use families in automation for automatic updates.
- HPA and VPA conflict on the same metric -- do not use both to scale on CPU simultaneously.
- Cluster Autoscaler scales on pending pods, not node CPU -- high node utilization alone does not trigger scaling.
- Cloud Run
--to-latestsends all traffic to new deployments immediately -- for canary deployments, route traffic to specific revision names. - Lifecycle rule changes take up to 24 hours -- not real-time.
- Lifecycle transitions are one-direction only -- Standard to Nearline to Coldline to Archive. You cannot use lifecycle rules to warm up objects.
- Disabling versioning does not delete noncurrent versions -- they persist until explicitly removed.
- Reserved static IPs incur charges when unattached -- release IPs you are not using.
- Subnet IP ranges can be expanded but never shrunk -- plan CIDR ranges carefully.
- Cloud NAT is outbound only -- it does not provide inbound connectivity.
- Ops Agent is not installed by default on VMs -- missing agent is the top reason for absent VM metrics.
- Data Access audit logs are disabled by default (except BigQuery) -- enable them explicitly to track data reads/writes.
_Requiredbucket: 400-day retention, immutable._Defaultbucket: 30-day default, configurable -- export to Cloud Storage or BigQuery for longer retention.- Sink service accounts need destination permissions -- the most common cause of sink failures is missing IAM grants on the destination resource.
- Logs Explorer lines are implicitly AND -- use explicit
ORfor disjunctive filters.
References
- Cloud Monitoring Alerting Overview
- Cloud Logging Export Overview
- Cloud Audit Logs Overview
- Cloud Storage Lifecycle Management
- Compute Engine Snapshots
- GKE Cluster Autoscaler
- Cloud Run Traffic Management
- Cloud DNS Overview
- Cloud NAT Overview
- Managed Service for Prometheus
- GCP Associate Cloud Engineer Certification