Reference

Domain 6: Ensuring Solution and Operations Excellence (~12.5%)

Domain 6 accounts for approximately 12.5% of the Professional Cloud Architect exam (roughly 6-7 questions). The v6.1 exam guide (October 2025) rewrote this domain to align with the Well-Architected Framework operational excellence pillar. Expect questions that test your ability to design monitoring strategies, choose deployment patterns, support production workloads, and evaluate quality control -- all through the lens of SRE principles and operational excellence.

6.1 Monitoring, Logging, Profiling, and Alerting

This is the heaviest sub-domain. The exam tests your ability to design an end-to-end observability strategy using Google Cloud's operations suite (formerly Stackdriver).

Cloud Monitoring

Cloud Monitoring collects metric data and provides dashboards, alerting, and uptime checking across Google Cloud, hybrid, and multi-cloud environments.

Metrics Model

Every metric in Cloud Monitoring has three components: a monitored resource type (what is being measured), a metric type (what measurement), and time series data (the values over time).

Metric Kind	Description	Example
GAUGE	Point-in-time measurement	Current CPU utilization
CUMULATIVE	Accumulated value over time	Total request count since start
DELTA	Change over a time interval	Bytes sent in last 60 seconds

Metric Sources

Source	Collection Method
Google Cloud system metrics	Automatic for all GCP services
VM agent metrics	Ops Agent on Compute Engine
Custom/user-defined metrics	Cloud Monitoring API or OpenTelemetry
Prometheus metrics	Managed Service for Prometheus or Ops Agent
Third-party metrics	BindPlane for on-prem/hybrid systems
External metrics	Third-party providers via API

Metrics Scope: A metrics scope defines which projects' data is visible in a single monitoring view. The scoping project stores alerts, dashboards, and synthetic monitors. You can configure a metrics scope to include time-series data from other Google Cloud projects and from AWS accounts. This is critical for multi-project monitoring architectures.

Metrics Retention: Google Cloud system metrics are retained for approximately 6 weeks at full resolution, then downsampled and kept for up to 24 months. Custom metrics follow the same retention pattern. This is important -- if you need metrics beyond these retention windows, export them to BigQuery or Cloud Storage.

Dashboards: Custom dashboards display charts, tables, logs panels, error groups, alerting policy information, and event annotations. Use Metrics Explorer for ad-hoc investigation.

Query Languages: Cloud Monitoring supports PromQL (Prometheus Query Language) and MQL (Monitoring Query Language). PromQL is the preferred choice for teams already using Prometheus. MQL supports alerting policy conditions directly.

Alert Policies

An alerting policy defines conditions (metric thresholds, absence of data, or complex queries), notification channels, and documentation. When a condition is met, Cloud Monitoring creates an incident and sends notifications.

Notification Channels: Email, SMS, Cloud Mobile App, PagerDuty, Slack, webhooks, and Pub/Sub. The exam frequently tests which channel to use for which scenario -- Pub/Sub for programmatic response, PagerDuty for on-call rotation, webhooks for custom integrations.

Uptime Checks: Synthetic monitors probe HTTP, HTTPS, and TCP endpoints from multiple global locations. If an endpoint fails checks from multiple regions, an incident fires. Types include standard uptime checks (HTTP/HTTPS/TCP), SSL certificate checks, and broken-link crawlers.

SLO Monitoring and Error Budgets

Service Monitoring lets you define SLIs (Service Level Indicators) and SLOs (Service Level Objectives) directly in Cloud Monitoring.

Concept	Definition	Exam Relevance
SLI	Quantitative measure of service (e.g., latency, availability)	Know the difference between request-based vs. windows-based SLIs
SLO	Target percentage for the SLI over a compliance period	Typical: 99.9% availability over 30 days
Error Budget	100% minus SLO target (e.g., 0.1% allowed downtime)	Determines when to freeze deployments vs. push features
Burn Rate	Rate at which error budget is being consumed	Fast burn triggers immediate alerts; slow burn triggers investigation

Exam trap: The exam tests whether you understand that error budgets are a release velocity tool. When the error budget is exhausted, you stop deploying new features and focus on reliability. When the budget is healthy, you deploy faster. This is core SRE philosophy.

Alert on burn rate, not raw error rate. A 1% error rate might be acceptable if it burns the budget slowly over 30 days. A 0.5% error rate might be alarming if it appeared in a sudden spike. Burn-rate alerts catch both scenarios appropriately.

Cloud Logging

Cloud Logging is a fully managed service for storing, searching, analyzing, and alerting on log data.

Log Router Architecture

Every log entry passes through the Log Router, which evaluates sinks to determine where each entry is routed. The router temporarily buffers entries to handle disruptions.

Component	Purpose
Inclusion filter	Entry must match to be routed (empty = match all)
Exclusion filter	Entry is dropped if it matches any exclusion filter
Sink	Routes matching entries to a destination

Routing rule: an entry is routed if it matches the inclusion filter AND does not match any exclusion filter.

System-Created Sinks

Sink	Destination Bucket	Retention	Modifiable?
`_Required`	`_Required` bucket	400 days (fixed)	Cannot be modified or deleted
`_Default`	`_Default` bucket	30 days (configurable)	Can be modified or disabled

The _Required sink captures Admin Activity audit logs, System Event audit logs, and Access Transparency logs. These cannot be excluded or redirected -- they always go to the _Required bucket.

Custom Log Buckets: You can create up to 100 buckets per project. Retention is configurable between 1 and 3,650 days. Bucket region cannot be changed after creation. CMEK encryption is available. Buckets can be linked to a BigQuery dataset for SQL-based analysis via Log Analytics.

Exam trap: Retention costs apply to logs retained longer than the default retention period (effective April 2023). The exam may present cost-optimization scenarios where reducing retention or excluding verbose logs saves money.

Sink Destinations

Destination	Use Case	Key Detail
Log bucket	Query via Logs Explorer, Log Analytics	Default and primary destination
BigQuery dataset	Ad-hoc SQL analytics on logs	Must be write-enabled; hours to activate for Cloud Storage
Cloud Storage	Long-term archival, compliance	JSON format; cheapest for cold storage
Pub/Sub topic	Real-time streaming, third-party SIEM export	Event-driven processing (e.g., Splunk, Datadog)
Another GCP project	Centralized logging across projects	One-hop limit; cannot chain project-to-project

Aggregated Sinks: At the organization or folder level, aggregated sinks route logs from all child resources. Two types:

Non-intercepting: Routes a copy to the destination; entries still flow to project-level sinks.
Intercepting: Captures entries and blocks them from reaching child resource sinks (except _Required).

Exam trap: Log sinks do NOT retroactively route entries that existed before the sink was created. If the exam asks about exporting historical logs, the answer involves BigQuery linked datasets or Cloud Storage exports from existing buckets -- not creating a new sink.

Log-Based Metrics: Create custom metrics from log entries using filters. These metrics appear in Cloud Monitoring and can trigger alert policies. Two types: counter metrics (count matching entries) and distribution metrics (extract numeric values from log entries).

Cloud Trace

Cloud Trace is a distributed tracing system for latency analysis across microservices.

Concept	Description
Trace	Complete request flow through all services
Span	Individual operation within a trace
Trace context	Propagated header linking spans across services

Instrumentation: App Engine standard, Cloud Run, and Cloud Run functions provide automatic tracing for HTTP requests. All other environments require manual instrumentation via OpenTelemetry (recommended) or Cloud Trace client libraries. OpenTelemetry implements batching for better performance.

Supported languages: C++, Go, Java, Node.js, Python, Ruby, C#.

Cloud Profiler

Cloud Profiler continuously analyzes CPU and memory usage in production workloads with minimal overhead (~0.5% CPU). It identifies hotspots in application code without affecting performance.

Exam relevance: When a question describes slow application performance and asks for root-cause analysis, Cloud Profiler identifies which functions consume the most CPU/memory. Cloud Trace identifies which services add the most latency. These are complementary -- Trace for inter-service latency, Profiler for intra-service code hotspots.

Well-Architected Framework: Operational Excellence

The operational excellence pillar defines five principles:

Operational readiness and performance -- Define SLOs; ensure solutions meet operational requirements.
Incident and problem management -- Minimize impact through observability, response procedures, and preventive measures.
Resource optimization -- Right-sizing, autoscaling, cost monitoring.
Automation and change management -- Automate to eliminate toil; streamline change processes.
Continuous improvement -- Ongoing enhancements driven by data.

The exam may ask you to identify which principle applies to a given scenario. Automation eliminates toil. SLOs drive operational readiness. Post-mortems drive continuous improvement.

6.2 Deployment and Release Management

Cloud Deploy

Cloud Deploy is a managed continuous delivery service that automates application deployment to a series of target environments in a defined promotion sequence.

Core Components

Component	Description
Delivery Pipeline	YAML-defined promotion sequence across targets
Target	Deployment destination: GKE cluster, Cloud Run service/job, or GKE attached cluster
Release	Rendered manifests for each target; created when deployment initiates
Rollout	Associates a release with a specific target; executes the actual deployment
Skaffold	Handles rendering, deployment, and verification (required even if not used locally)

Deployment Workflow: Define pipeline YAML with promotion sequence, define targets (inline or separate files), register with Cloud Deploy, CI triggers release creation via API, promotions advance through targets sequentially.

Approval Gates: Set requireApproval: true on targets. Generates Pub/Sub messages for external approval workflows. This is how you enforce manual sign-off before production deployment.

Automation Rules: Cloud Deploy supports rules-based promotion and advancement without manual intervention. Combine with approval gates for a hybrid model: auto-promote through dev/staging, require approval for production.

Deployment Strategies

Strategy	How It Works	Risk Level	Rollback Speed
Standard	Deploy new version directly, replacing old	Highest	Redeploy previous release
Canary	Deploy to a percentage of infrastructure first, then expand	Low	Route traffic back to stable version
Blue-Green (GKE/Cloud Run native)	Run two identical environments; switch traffic	Low	Switch traffic back to blue
Rolling Update (GKE native)	Replace pods incrementally	Medium	Rolling back pod-by-pod

Cloud Deploy supports standard and canary strategies natively. Blue-green and rolling updates are configured at the platform level (GKE Deployment strategy or Cloud Run traffic splitting) rather than in Cloud Deploy pipeline definitions.

Canary specifics: Traffic percentages must be whole numbers. On first deployment to a target, canary phases may be skipped since no existing version exists to split traffic with.

Exam trap: The exam may present a scenario requiring zero-downtime deployment. Blue-green gives the cleanest cutover (instant traffic switch). Canary gives the safest progressive validation. Rolling updates are the GKE default but do not allow easy instant rollback. Know which to recommend based on the scenario's priorities.

Rollback Strategies

Cloud Deploy rollback redeploys the last successful release using identical parameters. Key points:

Rollbacks create a new rollout (they do not revert state).
Automated rollback can be triggered by deployment verification failure.
Rollback is per-target, not pipeline-wide.

Feature Flags and Traffic Splitting

Cloud Run supports traffic splitting natively between revisions. This enables:

Gradual rollout (send 5% to new revision, monitor, increase).
A/B testing (split traffic between feature variants).
Instant rollback (shift 100% traffic back to previous revision).

GKE supports traffic splitting via Istio/Anthos Service Mesh or Gateway API.

6.3 Supporting Deployed Solutions

Uptime Checks and Health Metrics

Configure uptime checks to probe endpoints from multiple global regions. Types:

Check Type	Protocol	Use Case
HTTP/HTTPS	HTTP GET/POST	Web application availability
TCP	TCP connection	Database or service port availability
SSL certificate	HTTPS	Certificate expiration monitoring

Uptime checks run from Google-managed locations. If checks from at least two regions fail, an alert fires. This prevents false positives from single-region network issues.

Error Reporting

Error Reporting aggregates and displays errors from Cloud Logging, App Engine, Cloud Run, Cloud Run functions, and Compute Engine. It groups similar errors, tracks first/last occurrence, and provides stack traces. Key for incident investigation -- when the exam asks "how to quickly identify the most frequent errors," Error Reporting is the answer.

Google Cloud Support Tiers

Tier	P1 Response Time	Key Features	Pricing Model
Basic	N/A (no case support)	Documentation, community forums	Free
Standard	P2: 4 business hours (no P1 coverage)	Unlimited cases, business hours	Percentage of spend (3%, $29/mo min)
Enhanced	1 hour (24/7)	Third-party tech support, Training API	Percentage of spend (min. commitment)
Premium	15 minutes (24/7)	Named TAM, Event Management, training credits	Percentage of spend (higher min.)

Exam trap: The exam frequently tests Premium vs. Enhanced. Premium includes a named Technical Account Manager (TAM) and 15-minute P1 response. Enhanced provides 1-hour P1 response but no dedicated TAM. If the question mentions "dedicated technical advisor" or "event management support," the answer is Premium.

Incident Response and Post-Mortems

Google's SRE model for incident management:

Detect -- Monitoring alerts (burn-rate or threshold-based).
Triage -- Assess severity, assign incident commander.
Mitigate -- Restore service (rollback, failover, scale up).
Resolve -- Fix root cause.
Post-mortem -- Blameless review documenting timeline, root cause, action items.

Exam trap: Post-mortems must be blameless. The exam will present scenarios where a team member caused an outage. The correct answer focuses on process improvement, not individual blame. Action items should prevent recurrence through automation, alerts, or architecture changes.

6.4 Evaluating Quality Control Measures

Pre-Deploy Quality Assurance

QA Method	Stage	Tool/Service
Unit tests	Code commit	Cloud Build trigger
Integration tests	Build pipeline	Cloud Build + test containers
Load testing	Pre-production	Cloud Load Testing or third-party (Locust, k6)
Security scanning	Build pipeline	Artifact Analysis for container vulnerability scanning
Manual approval gates	Pre-production	Cloud Deploy `requireApproval` on target

Exam relevance: When the exam asks about "shift-left" testing, it means moving quality checks earlier in the pipeline. Unit tests at commit time, integration tests at build time, and security scanning before deployment to any environment.

Post-Deploy Quality Assurance

Method	Tool	What It Measures
SLO monitoring	Cloud Monitoring Service Monitoring	Whether the service meets its reliability targets
Error budget burn rate	Cloud Monitoring alerts	How fast reliability margin is being consumed
Deployment verification	Cloud Deploy verify step	Whether the new version passes automated health checks
Canary analysis	Cloud Deploy canary metrics	Whether canary traffic shows degradation vs. baseline
Synthetic monitoring	Uptime checks	Whether endpoints remain accessible and performant

Rollback vs. Hold-Release Decisions

The exam tests your judgment on when to rollback vs. when to hold and fix forward:

Scenario	Action	Reasoning
Error budget nearly exhausted, new deployment increases errors	Rollback immediately	Protect remaining error budget
Error budget healthy, minor degradation in canary	Hold release, investigate	Budget allows investigation time
Critical security patch with minor performance regression	Push forward	Security risk outweighs performance cost
Canary shows data corruption	Rollback immediately	Data integrity is non-negotiable

Continuous Verification

Cloud Deploy supports deployment verification -- automated tests that run after deployment to confirm the release is healthy. If verification fails, the rollout can be automatically rolled back.

Pattern: CI/CD triggers release, Cloud Deploy deploys to canary percentage, verification tests run against canary, if tests pass then promote to full deployment, if tests fail then rollback automatically. This is the gold standard for automated quality gates.

Key Exam Strategies for Domain 6

Monitoring vs. Logging vs. Tracing vs. Profiling: Know which tool answers which question. Monitoring = "is my service healthy?" Logging = "what happened?" Trace = "where is the latency?" Profiler = "which code is slow?"
SLO/SLI/Error Budget: This is core SRE and heavily tested. Error budgets drive deployment velocity decisions. Burn-rate alerts are preferred over static threshold alerts.
Log Router flow: Understand that all logs pass through the Log Router, _Required sink cannot be modified, and sinks do not apply retroactively.
Cloud Deploy promotion model: Releases promote through targets sequentially. Approval gates block promotion until explicitly approved. Automation rules can auto-promote non-production targets.
Support tiers: Premium = 15-min P1 + TAM. Enhanced = 1-hour P1. Standard = P2 only (4 hours, business hours). Basic = no case support.
Blameless post-mortems: Always the correct answer when the exam asks about incident review culture.