Reference

Domain 4: Analyzing and Optimizing Technical and Business Processes (~15%)

Domain 4 accounts for approximately 15% of the Professional Cloud Architect exam, translating to roughly 8-9 questions. This domain tests your ability to design CI/CD pipelines, analyze costs, make technology trade-off decisions, and build reliable systems using SRE principles. It bridges the gap between pure technical architecture (Domains 1-3) and real-world operational excellence. Expect scenario questions that require you to choose the right deployment strategy, justify a cost decision, or design a reliability workflow.


4.1 Analyzing and Defining Technical Processes

Software Development Lifecycle (SDLC) on GCP

The exam expects you to understand how development workflows map to GCP services across the full lifecycle. The canonical GCP CI/CD toolchain follows this flow:

Code → Cloud Source Repos / GitHub → Cloud Build (CI) → Artifact Registry → Cloud Deploy (CD) → GKE / Cloud Run
SDLC Phase GCP Service(s) Purpose
Development Cloud Workstations, Cloud Code, Cloud Shell IDE environments with GCP integration
Source Control Cloud Source Repositories, GitHub/GitLab/Bitbucket integration Version control and trigger source
Build & Test Cloud Build Compile, test, containerize
Artifact Storage Artifact Registry Store container images, language packages, OS packages
Delivery Cloud Deploy Promote releases through environments
Runtime GKE, Cloud Run, App Engine, Compute Engine Production workloads
Monitoring Cloud Monitoring, Cloud Logging, Cloud Trace Observability and alerting

Cloud Build: Continuous Integration

Cloud Build is Google Cloud's fully managed CI/CD platform. It executes builds on Google's infrastructure, imports source code from repositories or Cloud Storage, and produces artifacts like Docker images or compiled binaries. Cloud Build aligns with SLSA level 3 requirements for supply chain security.

Build configuration is defined in cloudbuild.yaml (or JSON). Each build consists of ordered build steps, where each step runs as a Docker container instance. Steps communicate via a local Docker network named cloudbuild and share data through a persistent /workspace volume.

Key concepts:

Concept Description Exam Relevance
Build steps Sequential Docker container executions; cloud-provided, community, or custom Know that each step is an isolated container
Triggers Automate builds on code changes from GitHub, GitLab, Bitbucket, Cloud Source Repos, Pub/Sub, webhooks, or schedules Know trigger types and when to use each
Substitutions Variable replacement in build configs ($PROJECT_ID, $COMMIT_SHA, custom _VARIABLES) Used for parameterizing builds across environments
Worker pools Default (managed, public internet access) or private (VPC-connected, custom machine types) Private pools for builds needing internal network access
Build provenance Verifiable metadata: image digests, source locations, toolchain, duration Critical for Binary Authorization and supply chain security
Ephemeral environments New VM per build, destroyed after completion Security benefit: no cross-build contamination

Builder images: Cloud Build provides pre-built builder images (gcr.io/cloud-builders/*) for common tasks (docker, gcloud, npm, maven, gradle, kubectl). Standard Docker Hub images are also supported.

Exam trap: Cloud Build uses Docker engine to execute builds, but Cloud Build itself is not a Docker registry. Artifacts are pushed to Artifact Registry (not Container Registry, which is deprecated). If an exam question mentions storing container images, the answer is Artifact Registry.

Artifact Registry

Artifact Registry is the universal artifact management service, replacing the legacy Container Registry. It stores:

  • Container images (Docker, OCI)
  • Language packages (Maven, npm, Python, Go, NuGet)
  • OS packages (Debian, RPM)
  • Helm charts

Key features for the exam: regional and multi-regional repositories, IAM-based access control, vulnerability scanning via Artifact Analysis, and integration with Binary Authorization for deployment policy enforcement.

Exam trap: Container Registry (gcr.io) is deprecated. Always choose Artifact Registry ({region}-docker.pkg.dev) in exam answers. They are not the same service.

Cloud Deploy: Continuous Delivery

Cloud Deploy is a managed continuous delivery service that automates application delivery through a defined promotion sequence. It is not a CI tool -- it handles the CD side, taking built artifacts and deploying them through environments.

Core terminology:

Term Definition
Delivery Pipeline Defines the ordered promotion sequence of targets and metadata
Target A deployment destination: GKE cluster, Cloud Run service, or Anthos cluster
Release A rendered manifest snapshot tied to specific container image versions; created via Skaffold
Rollout The execution of deploying a release to a specific target
Promotion Moving a release from one target to the next in the pipeline sequence
Approval A gate (requireApproval: true) that blocks promotion until explicitly approved

Deployment flow:

  1. CI pipeline (Cloud Build) creates a release referencing built artifacts
  2. Cloud Deploy automatically rolls out to the first target (e.g., dev)
  3. Operator promotes the release to subsequent targets (staging, production)
  4. Each promotion creates a new rollout for that target
  5. Targets with requireApproval block until an authorized user approves

Skaffold integration: Cloud Deploy uses Skaffold under the hood for manifest rendering and deployment execution. A skaffold.yaml is required alongside your Kubernetes manifests or Cloud Run service definitions.

Rollback: Revert by creating a new rollout against a previously successful release. Cloud Deploy does not perform in-place rollback -- it deploys the older release as a new rollout.

# Example delivery pipeline configuration
apiVersion: deploy.cloud.google.com/v1
kind: DeliveryPipeline
metadata:
  name: my-app-pipeline
serialPipeline:
  stages:
  - targetId: dev
    profiles: [dev]
  - targetId: staging
    profiles: [staging]
  - targetId: prod
    profiles: [prod]
    strategy:
      canary:
        runtimeConfig:
          kubernetes:
            serviceNetworking:
              service: my-app-svc
        canaryDeployment:
          percentages: [25, 50, 75]

Testing Strategies

The exam tests your understanding of testing types and where they fit in the pipeline:

Test Type When What It Validates GCP Integration
Unit tests Build step in Cloud Build Individual functions/methods in isolation Run in build step container
Integration tests Post-build, pre-deploy Component interactions, API contracts Cloud Build + test databases/services
Load/Performance tests Pre-production Capacity, latency under load Cloud Build triggers + tools like Locust, JMeter
Smoke tests Post-deployment Basic functionality of deployed service Cloud Deploy verification hooks
Canary analysis During rollout Production behavior with partial traffic Cloud Deploy canary strategy

Exam trap: The exam may present a scenario where a team deploys directly from development to production. The correct recommendation always includes intermediate environments (dev → staging → production) with appropriate testing gates at each stage.

Troubleshooting and Root Cause Analysis

For the exam, know this troubleshooting methodology:

  1. Define the problem clearly (what is the expected vs. actual behavior?)
  2. Gather data -- Cloud Logging, Cloud Monitoring metrics, Cloud Trace spans, Error Reporting
  3. Form hypotheses based on data
  4. Test hypotheses systematically (one change at a time)
  5. Implement fix and verify
  6. Document in a blameless postmortem

Key GCP troubleshooting tools:

Tool Purpose
Cloud Logging Centralized log aggregation and search
Cloud Monitoring Metrics, dashboards, uptime checks
Cloud Trace Distributed tracing for latency analysis
Cloud Profiler CPU and memory profiling of production apps
Error Reporting Aggregates and deduplicates application errors
Cloud Debugger (deprecated) Replaced by snapshot debugging in Cloud Code

4.2 Analyzing and Defining Business Processes

CapEx vs. OpEx in Cloud Migration

This is a fundamental concept the exam tests in migration and cost justification scenarios.

Attribute CapEx (Capital Expenditure) OpEx (Operational Expenditure)
Definition Upfront investment in physical assets Ongoing pay-as-you-go spending
Examples On-premises servers, data center build-out, perpetual licenses Cloud compute/storage consumption, SaaS subscriptions
Accounting Depreciated over useful life (3-5 years) Expensed in the period incurred
Cash flow Large upfront outlay Predictable recurring payments
Flexibility Low -- locked into purchased capacity High -- scale up/down as needed
Cloud model Most cloud resources are OpEx Exception: sole-tenant nodes and CUDs can have CapEx characteristics

Exam trap: Committed Use Discounts (CUDs) are still OpEx in cloud accounting terms -- you are committing to a spend level, not purchasing hardware. However, the Google Cloud cost optimization framework notes that sole-tenant nodes may be classified as capital expenditure. Know the distinction.

Cost Optimization Strategies

The Google Cloud Architecture Framework cost optimization pillar defines four principles: align spending with business value, foster cost awareness culture, optimize resource usage, and optimize continuously.

Compute Cost Optimization

Strategy Discount Commitment Best For
On-demand 0% (baseline) None Unpredictable, short-lived workloads
Sustained Use Discounts (SUDs) Up to 30% None (automatic) Workloads running 25%+ of month; N1 and sole-tenant node VMs only
Committed Use Discounts (CUDs) Up to 55% (general-purpose), up to 70% (memory-optimized) 1-year or 3-year Predictable, steady-state workloads
Spot VMs 60-91% off on-demand None (can be preempted anytime) Fault-tolerant batch processing, CI/CD builds
Right-sizing Varies None Over-provisioned VMs; use Recommender
Autoscaling Varies None Variable-demand workloads

Key distinctions:

  • Spot VMs replace the older preemptible VMs. Spot VMs have no 24-hour maximum lifetime (preemptible VMs did). Both can be reclaimed with 30-second notice.
  • SUDs are automatic -- no commitment required. They apply to N1 machines and sole-tenant nodes only. N2, N2D, E2, and newer machine families use CUDs instead.
  • CUDs are resource-based (vCPU + memory, applied across projects in a billing account) or spend-based (dollar commitment for specific services). Resource-based CUDs are more common for compute.

Exam trap: E2 machine types do NOT get sustained use discounts. Only N1 and sole-tenant nodes do. For E2, N2, and newer families, the only discount mechanism is CUDs. This is a frequently tested distinction.

Storage Cost Optimization

Storage Class Min Duration Access Frequency Use Case
Standard None Frequent Active data, frequently accessed
Nearline 30 days Monthly access Backups, infrequently accessed data
Coldline 90 days Quarterly access Disaster recovery, compliance archives
Archive 365 days Annual access Long-term regulatory retention

Lifecycle policies automatically transition objects between storage classes or delete them based on age, creation date, or custom conditions. This is the primary cost optimization tool for Cloud Storage.

Exam trap: All storage classes have the same first-byte latency (milliseconds) and availability SLA for multi-region (99.95%). The cost savings come from lower storage costs in exchange for higher retrieval costs and minimum storage duration charges. If you delete an Archive object after 30 days, you still pay for 365 days.

Network Cost Optimization

  • Ingress is free on GCP. Egress is where costs accumulate.
  • Cloud CDN reduces egress costs by caching content at edge locations.
  • Same-zone traffic is free. Cross-zone traffic within a region incurs charges.
  • Private Google Access avoids egress charges for accessing Google APIs from VMs without external IPs.
  • Cloud Interconnect provides reduced egress rates compared to public internet egress.

Licensing

Approach Description When to Choose
BYOL (Bring Your Own License) Use existing on-premises licenses on cloud VMs (sole-tenant nodes for license compliance) Existing enterprise agreements, license mobility rights
Cloud-native licensing Pay-as-you-go licenses included in VM pricing (e.g., Windows Server, SQL Server on Compute Engine) No existing licenses, want operational simplicity
License Manager Track and manage license usage across GCP Compliance and audit requirements

Change Management and Decision Frameworks

The exam tests organizational aspects of cloud adoption:

  • Stakeholder management: Identify decision-makers, influencers, and affected teams. Cloud migration affects infrastructure, development, security, finance, and operations teams.
  • Communication plans: Regular cadence for migration status, cost reporting, and incident updates.
  • Technology selection framework: Evaluate based on requirements (scalability, cost, compliance, team expertise, vendor lock-in risk), not technology preference.

4.3 Developing Reliability Procedures

Business Continuity vs. Disaster Recovery

Concept Scope Focus
Business Continuity Planning (BCP) Entire business Maintaining all critical business functions during and after disruption
Disaster Recovery (DR) IT systems Restoring IT infrastructure and data after a disaster

DR is a subset of BCP. The exam tests both, but DR scenarios are more common.

RTO and RPO

These are the two most critical metrics in DR planning, and the exam tests them heavily.

Metric Definition Drives Example
RTO (Recovery Time Objective) Maximum acceptable time from disruption to service restoration Architecture choices: active-active, warm standby, cold backup RTO of 1 hour means the system must be restored within 60 minutes
RPO (Recovery Point Objective) Maximum acceptable amount of data loss measured in time Backup frequency, replication strategy RPO of 15 minutes means you can lose at most 15 minutes of data

DR pattern selection based on RTO/RPO:

Pattern RTO RPO Cost GCP Implementation
Cold Hours to days Hours Lowest Snapshots in Cloud Storage, recreate infrastructure from IaC
Warm standby Minutes to hours Minutes Medium Scaled-down replica in secondary region, promote on failover
Hot standby Seconds to minutes Near-zero High Active-passive with Cloud SQL HA, regional GKE, Cloud DNS failover
Active-active (multi-region) Near-zero Near-zero Highest Spanner (multi-region), global load balancer, Cloud DNS routing

Exam trap: Lower RTO/RPO always costs more. The exam frequently presents scenarios where you must choose the cheapest DR strategy that meets the stated RTO/RPO requirements. Do not over-architect. If the requirement is RTO=24 hours and RPO=4 hours, a cold DR pattern with periodic snapshots is correct -- not an active-active multi-region setup.

SLI / SLO / SLA Workflow

This topic comes directly from Google's SRE Book Chapter 4. Understand the hierarchy:

Concept Definition Owner Example
SLI (Service Level Indicator) A quantitative measure of service behavior Engineering Request latency (p99 < 200ms), availability (successful requests / total requests)
SLO (Service Level Objective) A target value or range for an SLI Engineering + Product 99.9% availability over a rolling 30-day window
SLA (Service Level Agreement) A contract with consequences for missing SLOs Business + Legal 99.5% availability; service credits if breached

Key relationships:

  • SLIs are measured. SLOs are targets. SLAs are contracts.
  • SLOs should always be stricter than SLAs (internal target > external commitment).
  • The gap between SLO and SLA is your safety margin.

Error Budgets

The error budget is the inverse of the SLO. If your SLO is 99.9% availability, your error budget is 0.1% downtime per measurement period.

Error budget math (30-day month):

SLO Error Budget Allowed Downtime (30 days)
99% 1% 7 hours 12 minutes
99.9% 0.1% 43 minutes 12 seconds
99.95% 0.05% 21 minutes 36 seconds
99.99% 0.01% 4 minutes 19 seconds

Error budget policy: When the budget is exhausted, slow down deployments and prioritize reliability work. When budget is healthy, deploy faster and take more risks. This is the core SRE mechanism for balancing velocity and reliability.

Exam trap: The exam may present a scenario where a team has consumed its error budget. The correct action is to freeze feature deployments and focus on reliability improvements -- not to relax the SLO or ignore the budget.

Deployment Strategies for Reliability

Strategy How It Works Risk Level Rollback Speed Best For
Rolling update Incrementally replace old instances with new Medium Slow (roll forward or back) Stateless services, GKE default
Blue-green Run two identical environments; switch traffic at once Low (full environment tested before switch) Fast (switch back to blue) Zero-downtime requirement, stateful apps
Canary Route small percentage of traffic to new version, gradually increase Lowest Fast (route 100% back to stable) Production validation with minimal blast radius
A/B testing Route traffic based on user attributes (not just percentage) Low Fast Feature experimentation, UX testing
Recreate Terminate all old, start all new Highest (downtime window) Slow Dev/test environments, acceptable downtime

Cloud Deploy supports canary and blue-green strategies natively for GKE and Cloud Run targets. Rolling updates are the default for GKE Deployments via kubectl.

Exam trap: Blue-green deployments require double the infrastructure during the transition. If cost is a constraint in the scenario, canary is often the better choice because only a small percentage of capacity runs the new version at any time.

Chaos Engineering and Penetration Testing

Chaos engineering deliberately introduces failures to verify system resilience. On GCP:

  • Test zone failures by draining instances from one zone
  • Test region failures by simulating DNS failover
  • Use Fault Injection in Istio/Anthos Service Mesh for HTTP error injection and delay injection
  • Automate chaos tests in CI/CD pipelines as part of pre-production validation

Penetration testing on GCP:

  • Google's Acceptable Use Policy permits penetration testing of your own resources without prior notification to Google.
  • You may test resources you own within your GCP projects.
  • You may not test Google's underlying infrastructure or other customers' resources.
  • Use Web Security Scanner (part of Security Command Center) for automated web application vulnerability scanning.

Exam trap: Unlike AWS (which used to require pre-approval for pen testing), GCP does not require you to notify Google before conducting penetration tests on your own resources. If the exam asks about pen testing prerequisites, the answer involves scoping to your own resources and following responsible testing practices -- not requesting permission from Google.

Monitoring and Alerting for Reliability

Cloud Monitoring is the observability backbone:

Component Purpose
Metrics Time-series data from GCP services, custom metrics, and agents
Uptime checks External probes testing endpoint availability from global locations
Alerting policies Conditions (metric thresholds, absence, rate of change) + notification channels
Dashboards Custom visualizations of metrics and logs
Notification channels Email, SMS, PagerDuty, Slack, Pub/Sub, webhooks
SLO monitoring Native SLO tracking with burn-rate alerts

Alerting best practices for the exam:

  • Alert on symptoms (user-facing impact), not causes (CPU usage).
  • Use burn-rate alerts for SLO monitoring: alert when error budget is being consumed faster than expected.
  • Configure multiple notification channels for critical alerts (e.g., PagerDuty + email).
  • Set appropriate alignment periods and evaluation windows to reduce alert noise.

Infrastructure as Code for Reliability

Terraform is the primary IaC tool tested on the PCA exam:

Concept Exam Relevance
State file Single source of truth for infrastructure; store in Cloud Storage backend with state locking
Drift detection terraform plan detects differences between state and actual infrastructure
Modules Reusable, versioned infrastructure components; use Cloud Foundation Toolkit modules
Workspaces Manage multiple environments (dev/staging/prod) from same configuration
CI/CD integration Run terraform plan in Cloud Build on PR, terraform apply on merge to main

Exam trap: Terraform state files contain sensitive data (resource IDs, connection strings). Always store state in a remote backend (Cloud Storage bucket with versioning and encryption) with state locking enabled via Cloud Storage's built-in locking mechanism. Never store state in local files or version control.


Key Exam Patterns for Domain 4

  1. Pipeline design questions: Given a development workflow, select the correct combination of Cloud Build, Artifact Registry, and Cloud Deploy services.
  2. Cost optimization scenarios: Given a workload profile, select the cheapest viable compute/storage option that meets the requirements.
  3. DR strategy selection: Given RTO/RPO requirements, select the appropriate DR pattern (cold/warm/hot/active-active).
  4. SLO/error budget decisions: Given an SLO breach or budget exhaustion scenario, select the correct organizational response.
  5. Deployment strategy selection: Given reliability requirements and constraints (cost, downtime tolerance), select the appropriate deployment strategy.

References