Reference

Domain 6: Ensuring Solution and Operations Excellence (~12.5%)

Domain 6 accounts for approximately 12.5% of the Professional Cloud Architect exam (roughly 6-7 questions). The v6.1 exam guide (October 2025) rewrote this domain to align with the Well-Architected Framework operational excellence pillar. Expect questions that test your ability to design monitoring strategies, choose deployment patterns, support production workloads, and evaluate quality control -- all through the lens of SRE principles and operational excellence.


6.1 Monitoring, Logging, Profiling, and Alerting

This is the heaviest sub-domain. The exam tests your ability to design an end-to-end observability strategy using Google Cloud's operations suite (formerly Stackdriver).

Cloud Monitoring

Cloud Monitoring collects metric data and provides dashboards, alerting, and uptime checking across Google Cloud, hybrid, and multi-cloud environments.

Metrics Model

Every metric in Cloud Monitoring has three components: a monitored resource type (what is being measured), a metric type (what measurement), and time series data (the values over time).

Metric Kind Description Example
GAUGE Point-in-time measurement Current CPU utilization
CUMULATIVE Accumulated value over time Total request count since start
DELTA Change over a time interval Bytes sent in last 60 seconds

Metric Sources

Source Collection Method
Google Cloud system metrics Automatic for all GCP services
VM agent metrics Ops Agent on Compute Engine
Custom/user-defined metrics Cloud Monitoring API or OpenTelemetry
Prometheus metrics Managed Service for Prometheus or Ops Agent
Third-party metrics BindPlane for on-prem/hybrid systems
External metrics Third-party providers via API

Metrics Scope: A metrics scope defines which projects' data is visible in a single monitoring view. The scoping project stores alerts, dashboards, and synthetic monitors. You can configure a metrics scope to include time-series data from other Google Cloud projects and from AWS accounts. This is critical for multi-project monitoring architectures.

Metrics Retention: Google Cloud system metrics are retained for approximately 6 weeks at full resolution, then downsampled and kept for up to 24 months. Custom metrics follow the same retention pattern. This is important -- if you need metrics beyond these retention windows, export them to BigQuery or Cloud Storage.

Dashboards: Custom dashboards display charts, tables, logs panels, error groups, alerting policy information, and event annotations. Use Metrics Explorer for ad-hoc investigation.

Query Languages: Cloud Monitoring supports PromQL (Prometheus Query Language) and MQL (Monitoring Query Language). PromQL is the preferred choice for teams already using Prometheus. MQL supports alerting policy conditions directly.

Alert Policies

An alerting policy defines conditions (metric thresholds, absence of data, or complex queries), notification channels, and documentation. When a condition is met, Cloud Monitoring creates an incident and sends notifications.

Notification Channels: Email, SMS, Cloud Mobile App, PagerDuty, Slack, webhooks, and Pub/Sub. The exam frequently tests which channel to use for which scenario -- Pub/Sub for programmatic response, PagerDuty for on-call rotation, webhooks for custom integrations.

Uptime Checks: Synthetic monitors probe HTTP, HTTPS, and TCP endpoints from multiple global locations. If an endpoint fails checks from multiple regions, an incident fires. Types include standard uptime checks (HTTP/HTTPS/TCP), SSL certificate checks, and broken-link crawlers.

SLO Monitoring and Error Budgets

Service Monitoring lets you define SLIs (Service Level Indicators) and SLOs (Service Level Objectives) directly in Cloud Monitoring.

Concept Definition Exam Relevance
SLI Quantitative measure of service (e.g., latency, availability) Know the difference between request-based vs. windows-based SLIs
SLO Target percentage for the SLI over a compliance period Typical: 99.9% availability over 30 days
Error Budget 100% minus SLO target (e.g., 0.1% allowed downtime) Determines when to freeze deployments vs. push features
Burn Rate Rate at which error budget is being consumed Fast burn triggers immediate alerts; slow burn triggers investigation

Exam trap: The exam tests whether you understand that error budgets are a release velocity tool. When the error budget is exhausted, you stop deploying new features and focus on reliability. When the budget is healthy, you deploy faster. This is core SRE philosophy.

Alert on burn rate, not raw error rate. A 1% error rate might be acceptable if it burns the budget slowly over 30 days. A 0.5% error rate might be alarming if it appeared in a sudden spike. Burn-rate alerts catch both scenarios appropriately.

Cloud Logging

Cloud Logging is a fully managed service for storing, searching, analyzing, and alerting on log data.

Log Router Architecture

Every log entry passes through the Log Router, which evaluates sinks to determine where each entry is routed. The router temporarily buffers entries to handle disruptions.

Component Purpose
Inclusion filter Entry must match to be routed (empty = match all)
Exclusion filter Entry is dropped if it matches any exclusion filter
Sink Routes matching entries to a destination

Routing rule: an entry is routed if it matches the inclusion filter AND does not match any exclusion filter.

System-Created Sinks

Sink Destination Bucket Retention Modifiable?
_Required _Required bucket 400 days (fixed) Cannot be modified or deleted
_Default _Default bucket 30 days (configurable) Can be modified or disabled

The _Required sink captures Admin Activity audit logs, System Event audit logs, and Access Transparency logs. These cannot be excluded or redirected -- they always go to the _Required bucket.

Custom Log Buckets: You can create up to 100 buckets per project. Retention is configurable between 1 and 3,650 days. Bucket region cannot be changed after creation. CMEK encryption is available. Buckets can be linked to a BigQuery dataset for SQL-based analysis via Log Analytics.

Exam trap: Retention costs apply to logs retained longer than the default retention period (effective April 2023). The exam may present cost-optimization scenarios where reducing retention or excluding verbose logs saves money.

Sink Destinations

Destination Use Case Key Detail
Log bucket Query via Logs Explorer, Log Analytics Default and primary destination
BigQuery dataset Ad-hoc SQL analytics on logs Must be write-enabled; hours to activate for Cloud Storage
Cloud Storage Long-term archival, compliance JSON format; cheapest for cold storage
Pub/Sub topic Real-time streaming, third-party SIEM export Event-driven processing (e.g., Splunk, Datadog)
Another GCP project Centralized logging across projects One-hop limit; cannot chain project-to-project

Aggregated Sinks: At the organization or folder level, aggregated sinks route logs from all child resources. Two types:

  • Non-intercepting: Routes a copy to the destination; entries still flow to project-level sinks.
  • Intercepting: Captures entries and blocks them from reaching child resource sinks (except _Required).

Exam trap: Log sinks do NOT retroactively route entries that existed before the sink was created. If the exam asks about exporting historical logs, the answer involves BigQuery linked datasets or Cloud Storage exports from existing buckets -- not creating a new sink.

Log-Based Metrics: Create custom metrics from log entries using filters. These metrics appear in Cloud Monitoring and can trigger alert policies. Two types: counter metrics (count matching entries) and distribution metrics (extract numeric values from log entries).

Cloud Trace

Cloud Trace is a distributed tracing system for latency analysis across microservices.

Concept Description
Trace Complete request flow through all services
Span Individual operation within a trace
Trace context Propagated header linking spans across services

Instrumentation: App Engine standard, Cloud Run, and Cloud Run functions provide automatic tracing for HTTP requests. All other environments require manual instrumentation via OpenTelemetry (recommended) or Cloud Trace client libraries. OpenTelemetry implements batching for better performance.

Supported languages: C++, Go, Java, Node.js, Python, Ruby, C#.

Cloud Profiler

Cloud Profiler continuously analyzes CPU and memory usage in production workloads with minimal overhead (~0.5% CPU). It identifies hotspots in application code without affecting performance.

Exam relevance: When a question describes slow application performance and asks for root-cause analysis, Cloud Profiler identifies which functions consume the most CPU/memory. Cloud Trace identifies which services add the most latency. These are complementary -- Trace for inter-service latency, Profiler for intra-service code hotspots.

Well-Architected Framework: Operational Excellence

The operational excellence pillar defines five principles:

  1. Operational readiness and performance -- Define SLOs; ensure solutions meet operational requirements.
  2. Incident and problem management -- Minimize impact through observability, response procedures, and preventive measures.
  3. Resource optimization -- Right-sizing, autoscaling, cost monitoring.
  4. Automation and change management -- Automate to eliminate toil; streamline change processes.
  5. Continuous improvement -- Ongoing enhancements driven by data.

The exam may ask you to identify which principle applies to a given scenario. Automation eliminates toil. SLOs drive operational readiness. Post-mortems drive continuous improvement.


6.2 Deployment and Release Management

Cloud Deploy

Cloud Deploy is a managed continuous delivery service that automates application deployment to a series of target environments in a defined promotion sequence.

Core Components

Component Description
Delivery Pipeline YAML-defined promotion sequence across targets
Target Deployment destination: GKE cluster, Cloud Run service/job, or GKE attached cluster
Release Rendered manifests for each target; created when deployment initiates
Rollout Associates a release with a specific target; executes the actual deployment
Skaffold Handles rendering, deployment, and verification (required even if not used locally)

Deployment Workflow: Define pipeline YAML with promotion sequence, define targets (inline or separate files), register with Cloud Deploy, CI triggers release creation via API, promotions advance through targets sequentially.

Approval Gates: Set requireApproval: true on targets. Generates Pub/Sub messages for external approval workflows. This is how you enforce manual sign-off before production deployment.

Automation Rules: Cloud Deploy supports rules-based promotion and advancement without manual intervention. Combine with approval gates for a hybrid model: auto-promote through dev/staging, require approval for production.

Deployment Strategies

Strategy How It Works Risk Level Rollback Speed
Standard Deploy new version directly, replacing old Highest Redeploy previous release
Canary Deploy to a percentage of infrastructure first, then expand Low Route traffic back to stable version
Blue-Green (GKE/Cloud Run native) Run two identical environments; switch traffic Low Switch traffic back to blue
Rolling Update (GKE native) Replace pods incrementally Medium Rolling back pod-by-pod

Cloud Deploy supports standard and canary strategies natively. Blue-green and rolling updates are configured at the platform level (GKE Deployment strategy or Cloud Run traffic splitting) rather than in Cloud Deploy pipeline definitions.

Canary specifics: Traffic percentages must be whole numbers. On first deployment to a target, canary phases may be skipped since no existing version exists to split traffic with.

Exam trap: The exam may present a scenario requiring zero-downtime deployment. Blue-green gives the cleanest cutover (instant traffic switch). Canary gives the safest progressive validation. Rolling updates are the GKE default but do not allow easy instant rollback. Know which to recommend based on the scenario's priorities.

Rollback Strategies

Cloud Deploy rollback redeploys the last successful release using identical parameters. Key points:

  • Rollbacks create a new rollout (they do not revert state).
  • Automated rollback can be triggered by deployment verification failure.
  • Rollback is per-target, not pipeline-wide.

Feature Flags and Traffic Splitting

Cloud Run supports traffic splitting natively between revisions. This enables:

  • Gradual rollout (send 5% to new revision, monitor, increase).
  • A/B testing (split traffic between feature variants).
  • Instant rollback (shift 100% traffic back to previous revision).

GKE supports traffic splitting via Istio/Anthos Service Mesh or Gateway API.


6.3 Supporting Deployed Solutions

Uptime Checks and Health Metrics

Configure uptime checks to probe endpoints from multiple global regions. Types:

Check Type Protocol Use Case
HTTP/HTTPS HTTP GET/POST Web application availability
TCP TCP connection Database or service port availability
SSL certificate HTTPS Certificate expiration monitoring

Uptime checks run from Google-managed locations. If checks from at least two regions fail, an alert fires. This prevents false positives from single-region network issues.

Error Reporting

Error Reporting aggregates and displays errors from Cloud Logging, App Engine, Cloud Run, Cloud Run functions, and Compute Engine. It groups similar errors, tracks first/last occurrence, and provides stack traces. Key for incident investigation -- when the exam asks "how to quickly identify the most frequent errors," Error Reporting is the answer.

Google Cloud Support Tiers

Tier P1 Response Time Key Features Pricing Model
Basic N/A (no case support) Documentation, community forums Free
Standard P2: 4 business hours (no P1 coverage) Unlimited cases, business hours Percentage of spend (3%, $29/mo min)
Enhanced 1 hour (24/7) Third-party tech support, Training API Percentage of spend (min. commitment)
Premium 15 minutes (24/7) Named TAM, Event Management, training credits Percentage of spend (higher min.)

Exam trap: The exam frequently tests Premium vs. Enhanced. Premium includes a named Technical Account Manager (TAM) and 15-minute P1 response. Enhanced provides 1-hour P1 response but no dedicated TAM. If the question mentions "dedicated technical advisor" or "event management support," the answer is Premium.

Incident Response and Post-Mortems

Google's SRE model for incident management:

  1. Detect -- Monitoring alerts (burn-rate or threshold-based).
  2. Triage -- Assess severity, assign incident commander.
  3. Mitigate -- Restore service (rollback, failover, scale up).
  4. Resolve -- Fix root cause.
  5. Post-mortem -- Blameless review documenting timeline, root cause, action items.

Exam trap: Post-mortems must be blameless. The exam will present scenarios where a team member caused an outage. The correct answer focuses on process improvement, not individual blame. Action items should prevent recurrence through automation, alerts, or architecture changes.


6.4 Evaluating Quality Control Measures

Pre-Deploy Quality Assurance

QA Method Stage Tool/Service
Unit tests Code commit Cloud Build trigger
Integration tests Build pipeline Cloud Build + test containers
Load testing Pre-production Cloud Load Testing or third-party (Locust, k6)
Security scanning Build pipeline Artifact Analysis for container vulnerability scanning
Manual approval gates Pre-production Cloud Deploy requireApproval on target

Exam relevance: When the exam asks about "shift-left" testing, it means moving quality checks earlier in the pipeline. Unit tests at commit time, integration tests at build time, and security scanning before deployment to any environment.

Post-Deploy Quality Assurance

Method Tool What It Measures
SLO monitoring Cloud Monitoring Service Monitoring Whether the service meets its reliability targets
Error budget burn rate Cloud Monitoring alerts How fast reliability margin is being consumed
Deployment verification Cloud Deploy verify step Whether the new version passes automated health checks
Canary analysis Cloud Deploy canary metrics Whether canary traffic shows degradation vs. baseline
Synthetic monitoring Uptime checks Whether endpoints remain accessible and performant

Rollback vs. Hold-Release Decisions

The exam tests your judgment on when to rollback vs. when to hold and fix forward:

Scenario Action Reasoning
Error budget nearly exhausted, new deployment increases errors Rollback immediately Protect remaining error budget
Error budget healthy, minor degradation in canary Hold release, investigate Budget allows investigation time
Critical security patch with minor performance regression Push forward Security risk outweighs performance cost
Canary shows data corruption Rollback immediately Data integrity is non-negotiable

Continuous Verification

Cloud Deploy supports deployment verification -- automated tests that run after deployment to confirm the release is healthy. If verification fails, the rollout can be automatically rolled back.

Pattern: CI/CD triggers release, Cloud Deploy deploys to canary percentage, verification tests run against canary, if tests pass then promote to full deployment, if tests fail then rollback automatically. This is the gold standard for automated quality gates.


Key Exam Strategies for Domain 6

  1. Monitoring vs. Logging vs. Tracing vs. Profiling: Know which tool answers which question. Monitoring = "is my service healthy?" Logging = "what happened?" Trace = "where is the latency?" Profiler = "which code is slow?"

  2. SLO/SLI/Error Budget: This is core SRE and heavily tested. Error budgets drive deployment velocity decisions. Burn-rate alerts are preferred over static threshold alerts.

  3. Log Router flow: Understand that all logs pass through the Log Router, _Required sink cannot be modified, and sinks do not apply retroactively.

  4. Cloud Deploy promotion model: Releases promote through targets sequentially. Approval gates block promotion until explicitly approved. Automation rules can auto-promote non-production targets.

  5. Support tiers: Premium = 15-min P1 + TAM. Enhanced = 1-hour P1. Standard = P2 only (4 hours, business hours). Basic = no case support.

  6. Blameless post-mortems: Always the correct answer when the exam asks about incident review culture.


References