Reference

Domain 4: Analyzing and Optimizing Technical and Business Processes (~15%)

Domain 4 accounts for approximately 15% of the Professional Cloud Architect exam, translating to roughly 8-9 questions. This domain tests your ability to design CI/CD pipelines, analyze costs, make technology trade-off decisions, and build reliable systems using SRE principles. It bridges the gap between pure technical architecture (Domains 1-3) and real-world operational excellence. Expect scenario questions that require you to choose the right deployment strategy, justify a cost decision, or design a reliability workflow.

4.1 Analyzing and Defining Technical Processes

Software Development Lifecycle (SDLC) on GCP

The exam expects you to understand how development workflows map to GCP services across the full lifecycle. The canonical GCP CI/CD toolchain follows this flow:

Code → Cloud Source Repos / GitHub → Cloud Build (CI) → Artifact Registry → Cloud Deploy (CD) → GKE / Cloud Run

SDLC Phase	GCP Service(s)	Purpose
Development	Cloud Workstations, Cloud Code, Cloud Shell	IDE environments with GCP integration
Source Control	Cloud Source Repositories, GitHub/GitLab/Bitbucket integration	Version control and trigger source
Build & Test	Cloud Build	Compile, test, containerize
Artifact Storage	Artifact Registry	Store container images, language packages, OS packages
Delivery	Cloud Deploy	Promote releases through environments
Runtime	GKE, Cloud Run, App Engine, Compute Engine	Production workloads
Monitoring	Cloud Monitoring, Cloud Logging, Cloud Trace	Observability and alerting

Cloud Build: Continuous Integration

Cloud Build is Google Cloud's fully managed CI/CD platform. It executes builds on Google's infrastructure, imports source code from repositories or Cloud Storage, and produces artifacts like Docker images or compiled binaries. Cloud Build aligns with SLSA level 3 requirements for supply chain security.

Build configuration is defined in cloudbuild.yaml (or JSON). Each build consists of ordered build steps, where each step runs as a Docker container instance. Steps communicate via a local Docker network named cloudbuild and share data through a persistent /workspace volume.

Key concepts:

Concept	Description	Exam Relevance
Build steps	Sequential Docker container executions; cloud-provided, community, or custom	Know that each step is an isolated container
Triggers	Automate builds on code changes from GitHub, GitLab, Bitbucket, Cloud Source Repos, Pub/Sub, webhooks, or schedules	Know trigger types and when to use each
Substitutions	Variable replacement in build configs (`$PROJECT_ID`, `$COMMIT_SHA`, custom `_VARIABLES`)	Used for parameterizing builds across environments
Worker pools	Default (managed, public internet access) or private (VPC-connected, custom machine types)	Private pools for builds needing internal network access
Build provenance	Verifiable metadata: image digests, source locations, toolchain, duration	Critical for Binary Authorization and supply chain security
Ephemeral environments	New VM per build, destroyed after completion	Security benefit: no cross-build contamination

Builder images: Cloud Build provides pre-built builder images (gcr.io/cloud-builders/*) for common tasks (docker, gcloud, npm, maven, gradle, kubectl). Standard Docker Hub images are also supported.

Exam trap: Cloud Build uses Docker engine to execute builds, but Cloud Build itself is not a Docker registry. Artifacts are pushed to Artifact Registry (not Container Registry, which is deprecated). If an exam question mentions storing container images, the answer is Artifact Registry.

Artifact Registry

Artifact Registry is the universal artifact management service, replacing the legacy Container Registry. It stores:

Container images (Docker, OCI)
Language packages (Maven, npm, Python, Go, NuGet)
OS packages (Debian, RPM)
Helm charts

Key features for the exam: regional and multi-regional repositories, IAM-based access control, vulnerability scanning via Artifact Analysis, and integration with Binary Authorization for deployment policy enforcement.

Exam trap: Container Registry (gcr.io) is deprecated. Always choose Artifact Registry ({region}-docker.pkg.dev) in exam answers. They are not the same service.

Cloud Deploy: Continuous Delivery

Cloud Deploy is a managed continuous delivery service that automates application delivery through a defined promotion sequence. It is not a CI tool -- it handles the CD side, taking built artifacts and deploying them through environments.

Core terminology:

Term	Definition
Delivery Pipeline	Defines the ordered promotion sequence of targets and metadata
Target	A deployment destination: GKE cluster, Cloud Run service, or Anthos cluster
Release	A rendered manifest snapshot tied to specific container image versions; created via Skaffold
Rollout	The execution of deploying a release to a specific target
Promotion	Moving a release from one target to the next in the pipeline sequence
Approval	A gate (`requireApproval: true`) that blocks promotion until explicitly approved

Deployment flow:

CI pipeline (Cloud Build) creates a release referencing built artifacts
Cloud Deploy automatically rolls out to the first target (e.g., dev)
Operator promotes the release to subsequent targets (staging, production)
Each promotion creates a new rollout for that target
Targets with requireApproval block until an authorized user approves

Skaffold integration: Cloud Deploy uses Skaffold under the hood for manifest rendering and deployment execution. A skaffold.yaml is required alongside your Kubernetes manifests or Cloud Run service definitions.

Rollback: Revert by creating a new rollout against a previously successful release. Cloud Deploy does not perform in-place rollback -- it deploys the older release as a new rollout.

# Example delivery pipeline configuration
apiVersion: deploy.cloud.google.com/v1
kind: DeliveryPipeline
metadata:
  name: my-app-pipeline
serialPipeline:
  stages:
  - targetId: dev
    profiles: [dev]
  - targetId: staging
    profiles: [staging]
  - targetId: prod
    profiles: [prod]
    strategy:
      canary:
        runtimeConfig:
          kubernetes:
            serviceNetworking:
              service: my-app-svc
        canaryDeployment:
          percentages: [25, 50, 75]

Testing Strategies

The exam tests your understanding of testing types and where they fit in the pipeline:

Test Type	When	What It Validates	GCP Integration
Unit tests	Build step in Cloud Build	Individual functions/methods in isolation	Run in build step container
Integration tests	Post-build, pre-deploy	Component interactions, API contracts	Cloud Build + test databases/services
Load/Performance tests	Pre-production	Capacity, latency under load	Cloud Build triggers + tools like Locust, JMeter
Smoke tests	Post-deployment	Basic functionality of deployed service	Cloud Deploy verification hooks
Canary analysis	During rollout	Production behavior with partial traffic	Cloud Deploy canary strategy

Exam trap: The exam may present a scenario where a team deploys directly from development to production. The correct recommendation always includes intermediate environments (dev → staging → production) with appropriate testing gates at each stage.

Troubleshooting and Root Cause Analysis

For the exam, know this troubleshooting methodology:

Define the problem clearly (what is the expected vs. actual behavior?)
Gather data -- Cloud Logging, Cloud Monitoring metrics, Cloud Trace spans, Error Reporting
Form hypotheses based on data
Test hypotheses systematically (one change at a time)
Implement fix and verify
Document in a blameless postmortem

Key GCP troubleshooting tools:

Tool	Purpose
Cloud Logging	Centralized log aggregation and search
Cloud Monitoring	Metrics, dashboards, uptime checks
Cloud Trace	Distributed tracing for latency analysis
Cloud Profiler	CPU and memory profiling of production apps
Error Reporting	Aggregates and deduplicates application errors
Cloud Debugger (deprecated)	Replaced by snapshot debugging in Cloud Code

4.2 Analyzing and Defining Business Processes

CapEx vs. OpEx in Cloud Migration

This is a fundamental concept the exam tests in migration and cost justification scenarios.

Attribute	CapEx (Capital Expenditure)	OpEx (Operational Expenditure)
Definition	Upfront investment in physical assets	Ongoing pay-as-you-go spending
Examples	On-premises servers, data center build-out, perpetual licenses	Cloud compute/storage consumption, SaaS subscriptions
Accounting	Depreciated over useful life (3-5 years)	Expensed in the period incurred
Cash flow	Large upfront outlay	Predictable recurring payments
Flexibility	Low -- locked into purchased capacity	High -- scale up/down as needed
Cloud model	Most cloud resources are OpEx	Exception: sole-tenant nodes and CUDs can have CapEx characteristics

Exam trap: Committed Use Discounts (CUDs) are still OpEx in cloud accounting terms -- you are committing to a spend level, not purchasing hardware. However, the Google Cloud cost optimization framework notes that sole-tenant nodes may be classified as capital expenditure. Know the distinction.

Cost Optimization Strategies

The Google Cloud Architecture Framework cost optimization pillar defines four principles: align spending with business value, foster cost awareness culture, optimize resource usage, and optimize continuously.

Compute Cost Optimization

Strategy	Discount	Commitment	Best For
On-demand	0% (baseline)	None	Unpredictable, short-lived workloads
Sustained Use Discounts (SUDs)	Up to 30%	None (automatic)	Workloads running 25%+ of month; N1 and sole-tenant node VMs only
Committed Use Discounts (CUDs)	Up to 55% (general-purpose), up to 70% (memory-optimized)	1-year or 3-year	Predictable, steady-state workloads
Spot VMs	60-91% off on-demand	None (can be preempted anytime)	Fault-tolerant batch processing, CI/CD builds
Right-sizing	Varies	None	Over-provisioned VMs; use Recommender
Autoscaling	Varies	None	Variable-demand workloads

Key distinctions:

Spot VMs replace the older preemptible VMs. Spot VMs have no 24-hour maximum lifetime (preemptible VMs did). Both can be reclaimed with 30-second notice.
SUDs are automatic -- no commitment required. They apply to N1 machines and sole-tenant nodes only. N2, N2D, E2, and newer machine families use CUDs instead.
CUDs are resource-based (vCPU + memory, applied across projects in a billing account) or spend-based (dollar commitment for specific services). Resource-based CUDs are more common for compute.

Exam trap: E2 machine types do NOT get sustained use discounts. Only N1 and sole-tenant nodes do. For E2, N2, and newer families, the only discount mechanism is CUDs. This is a frequently tested distinction.

Storage Cost Optimization

Storage Class	Min Duration	Access Frequency	Use Case
Standard	None	Frequent	Active data, frequently accessed
Nearline	30 days	Monthly access	Backups, infrequently accessed data
Coldline	90 days	Quarterly access	Disaster recovery, compliance archives
Archive	365 days	Annual access	Long-term regulatory retention

Lifecycle policies automatically transition objects between storage classes or delete them based on age, creation date, or custom conditions. This is the primary cost optimization tool for Cloud Storage.

Exam trap: All storage classes have the same first-byte latency (milliseconds) and availability SLA for multi-region (99.95%). The cost savings come from lower storage costs in exchange for higher retrieval costs and minimum storage duration charges. If you delete an Archive object after 30 days, you still pay for 365 days.

Network Cost Optimization

Ingress is free on GCP. Egress is where costs accumulate.
Cloud CDN reduces egress costs by caching content at edge locations.
Same-zone traffic is free. Cross-zone traffic within a region incurs charges.
Private Google Access avoids egress charges for accessing Google APIs from VMs without external IPs.
Cloud Interconnect provides reduced egress rates compared to public internet egress.

Licensing

Approach	Description	When to Choose
BYOL (Bring Your Own License)	Use existing on-premises licenses on cloud VMs (sole-tenant nodes for license compliance)	Existing enterprise agreements, license mobility rights
Cloud-native licensing	Pay-as-you-go licenses included in VM pricing (e.g., Windows Server, SQL Server on Compute Engine)	No existing licenses, want operational simplicity
License Manager	Track and manage license usage across GCP	Compliance and audit requirements

Change Management and Decision Frameworks

The exam tests organizational aspects of cloud adoption:

Stakeholder management: Identify decision-makers, influencers, and affected teams. Cloud migration affects infrastructure, development, security, finance, and operations teams.
Communication plans: Regular cadence for migration status, cost reporting, and incident updates.
Technology selection framework: Evaluate based on requirements (scalability, cost, compliance, team expertise, vendor lock-in risk), not technology preference.

4.3 Developing Reliability Procedures

Business Continuity vs. Disaster Recovery

Concept	Scope	Focus
Business Continuity Planning (BCP)	Entire business	Maintaining all critical business functions during and after disruption
Disaster Recovery (DR)	IT systems	Restoring IT infrastructure and data after a disaster

DR is a subset of BCP. The exam tests both, but DR scenarios are more common.

RTO and RPO

These are the two most critical metrics in DR planning, and the exam tests them heavily.

Metric	Definition	Drives	Example
RTO (Recovery Time Objective)	Maximum acceptable time from disruption to service restoration	Architecture choices: active-active, warm standby, cold backup	RTO of 1 hour means the system must be restored within 60 minutes
RPO (Recovery Point Objective)	Maximum acceptable amount of data loss measured in time	Backup frequency, replication strategy	RPO of 15 minutes means you can lose at most 15 minutes of data

DR pattern selection based on RTO/RPO:

Pattern	RTO	RPO	Cost	GCP Implementation
Cold	Hours to days	Hours	Lowest	Snapshots in Cloud Storage, recreate infrastructure from IaC
Warm standby	Minutes to hours	Minutes	Medium	Scaled-down replica in secondary region, promote on failover
Hot standby	Seconds to minutes	Near-zero	High	Active-passive with Cloud SQL HA, regional GKE, Cloud DNS failover
Active-active (multi-region)	Near-zero	Near-zero	Highest	Spanner (multi-region), global load balancer, Cloud DNS routing

Exam trap: Lower RTO/RPO always costs more. The exam frequently presents scenarios where you must choose the cheapest DR strategy that meets the stated RTO/RPO requirements. Do not over-architect. If the requirement is RTO=24 hours and RPO=4 hours, a cold DR pattern with periodic snapshots is correct -- not an active-active multi-region setup.

SLI / SLO / SLA Workflow

This topic comes directly from Google's SRE Book Chapter 4. Understand the hierarchy:

Concept	Definition	Owner	Example
SLI (Service Level Indicator)	A quantitative measure of service behavior	Engineering	Request latency (p99 < 200ms), availability (successful requests / total requests)
SLO (Service Level Objective)	A target value or range for an SLI	Engineering + Product	99.9% availability over a rolling 30-day window
SLA (Service Level Agreement)	A contract with consequences for missing SLOs	Business + Legal	99.5% availability; service credits if breached

Key relationships:

SLIs are measured. SLOs are targets. SLAs are contracts.
SLOs should always be stricter than SLAs (internal target > external commitment).
The gap between SLO and SLA is your safety margin.

Error Budgets

The error budget is the inverse of the SLO. If your SLO is 99.9% availability, your error budget is 0.1% downtime per measurement period.

Error budget math (30-day month):

SLO	Error Budget	Allowed Downtime (30 days)
99%	1%	7 hours 12 minutes
99.9%	0.1%	43 minutes 12 seconds
99.95%	0.05%	21 minutes 36 seconds
99.99%	0.01%	4 minutes 19 seconds

Error budget policy: When the budget is exhausted, slow down deployments and prioritize reliability work. When budget is healthy, deploy faster and take more risks. This is the core SRE mechanism for balancing velocity and reliability.

Exam trap: The exam may present a scenario where a team has consumed its error budget. The correct action is to freeze feature deployments and focus on reliability improvements -- not to relax the SLO or ignore the budget.

Deployment Strategies for Reliability

Strategy	How It Works	Risk Level	Rollback Speed	Best For
Rolling update	Incrementally replace old instances with new	Medium	Slow (roll forward or back)	Stateless services, GKE default
Blue-green	Run two identical environments; switch traffic at once	Low (full environment tested before switch)	Fast (switch back to blue)	Zero-downtime requirement, stateful apps
Canary	Route small percentage of traffic to new version, gradually increase	Lowest	Fast (route 100% back to stable)	Production validation with minimal blast radius
A/B testing	Route traffic based on user attributes (not just percentage)	Low	Fast	Feature experimentation, UX testing
Recreate	Terminate all old, start all new	Highest (downtime window)	Slow	Dev/test environments, acceptable downtime

Cloud Deploy supports canary and blue-green strategies natively for GKE and Cloud Run targets. Rolling updates are the default for GKE Deployments via kubectl.

Exam trap: Blue-green deployments require double the infrastructure during the transition. If cost is a constraint in the scenario, canary is often the better choice because only a small percentage of capacity runs the new version at any time.

Chaos Engineering and Penetration Testing

Chaos engineering deliberately introduces failures to verify system resilience. On GCP:

Test zone failures by draining instances from one zone
Test region failures by simulating DNS failover
Use Fault Injection in Istio/Anthos Service Mesh for HTTP error injection and delay injection
Automate chaos tests in CI/CD pipelines as part of pre-production validation

Penetration testing on GCP:

Google's Acceptable Use Policy permits penetration testing of your own resources without prior notification to Google.
You may test resources you own within your GCP projects.
You may not test Google's underlying infrastructure or other customers' resources.
Use Web Security Scanner (part of Security Command Center) for automated web application vulnerability scanning.

Exam trap: Unlike AWS (which used to require pre-approval for pen testing), GCP does not require you to notify Google before conducting penetration tests on your own resources. If the exam asks about pen testing prerequisites, the answer involves scoping to your own resources and following responsible testing practices -- not requesting permission from Google.

Monitoring and Alerting for Reliability

Cloud Monitoring is the observability backbone:

Component	Purpose
Metrics	Time-series data from GCP services, custom metrics, and agents
Uptime checks	External probes testing endpoint availability from global locations
Alerting policies	Conditions (metric thresholds, absence, rate of change) + notification channels
Dashboards	Custom visualizations of metrics and logs
Notification channels	Email, SMS, PagerDuty, Slack, Pub/Sub, webhooks
SLO monitoring	Native SLO tracking with burn-rate alerts

Alerting best practices for the exam:

Alert on symptoms (user-facing impact), not causes (CPU usage).
Use burn-rate alerts for SLO monitoring: alert when error budget is being consumed faster than expected.
Configure multiple notification channels for critical alerts (e.g., PagerDuty + email).
Set appropriate alignment periods and evaluation windows to reduce alert noise.

Infrastructure as Code for Reliability

Terraform is the primary IaC tool tested on the PCA exam:

Concept	Exam Relevance
State file	Single source of truth for infrastructure; store in Cloud Storage backend with state locking
Drift detection	`terraform plan` detects differences between state and actual infrastructure
Modules	Reusable, versioned infrastructure components; use Cloud Foundation Toolkit modules
Workspaces	Manage multiple environments (dev/staging/prod) from same configuration
CI/CD integration	Run `terraform plan` in Cloud Build on PR, `terraform apply` on merge to main

Exam trap: Terraform state files contain sensitive data (resource IDs, connection strings). Always store state in a remote backend (Cloud Storage bucket with versioning and encryption) with state locking enabled via Cloud Storage's built-in locking mechanism. Never store state in local files or version control.

Key Exam Patterns for Domain 4

Pipeline design questions: Given a development workflow, select the correct combination of Cloud Build, Artifact Registry, and Cloud Deploy services.
Cost optimization scenarios: Given a workload profile, select the cheapest viable compute/storage option that meets the requirements.
DR strategy selection: Given RTO/RPO requirements, select the appropriate DR pattern (cold/warm/hot/active-active).
SLO/error budget decisions: Given an SLO breach or budget exhaustion scenario, select the correct organizational response.
Deployment strategy selection: Given reliability requirements and constraints (cost, downtime tolerance), select the appropriate deployment strategy.