Domain 2: Managing and Provisioning a Solution Infrastructure (~17.5%)
Domain 2 accounts for approximately 17.5% of the Professional Cloud Architect exam (v6.1, October 2025), translating to roughly 9-10 questions. This domain tests your ability to configure networking, storage, compute, and -- new in v6.1 -- Vertex AI infrastructure. Unlike Domain 1 (which is about design), Domain 2 is about execution: knowing which knobs to turn and which services to provision for a given architecture. Expect scenario-based questions that present a workload and ask you to select the correct network topology, storage tier, compute configuration, or ML pipeline component.
2.1 Configuring Network Topologies
Hybrid Connectivity Options
The exam frequently tests the choice between Cloud VPN, Dedicated Interconnect, and Partner Interconnect. The decision tree is bandwidth, latency, cost, and whether traffic must avoid the public internet.
| Feature | HA VPN | Classic VPN | Dedicated Interconnect | Partner Interconnect |
|---|---|---|---|---|
| Bandwidth | ~1-3 Gbps per tunnel | ~1-3 Gbps per tunnel | 10 Gbps or 100 Gbps per link | 50 Mbps - 50 Gbps per attachment |
| Max capacity | Multiple tunnels | Multiple tunnels | 80 Gbps (8x10G) or 200 Gbps (2x100G) | 50 Gbps per attachment |
| SLA | 99.99% (proper config) | 99.9% | 99.99% (redundant) or 99.9% | Depends on partner |
| Encryption | IPsec (built-in) | IPsec (built-in) | None by default (MACsec optional) | None by default |
| Routing | BGP (dynamic only) | Static only | BGP (dynamic) | BGP (dynamic) |
| Traffic path | Public internet | Public internet | Private (Google edge) | Private (via partner) |
| IPv6 | Yes (dual-stack) | No | Yes | Provider-dependent |
| Setup time | Minutes | Minutes | Weeks (physical cross-connect) | Days (partner provisioning) |
Cross-Cloud Interconnect: Provides direct physical connections to AWS, Azure, OCI, and Alibaba Cloud through Google's network. Available in 10 Gbps and 100 Gbps circuits. This is the answer when the exam describes a multicloud architecture that needs private, low-latency connectivity between clouds without traversing the public internet.
Key decision points for the exam:
- Need >3 Gbps sustained bandwidth? Interconnect. VPN tunnels top out around 3 Gbps each.
- Data must not traverse the public internet? Dedicated Interconnect or Partner Interconnect. Cloud VPN uses IPsec over the public internet.
- Quick setup, moderate bandwidth? HA VPN. It deploys in minutes and provides 99.99% SLA.
- No colocation facility? Partner Interconnect. Dedicated Interconnect requires physical presence at a Google peering edge.
- Need encryption on Interconnect? Use HA VPN over Cloud Interconnect or enable MACsec.
Exam trap: Classic VPN only supports static routing and has a 99.9% SLA. HA VPN requires BGP (Cloud Router) and achieves 99.99%. If the exam describes a need for dynamic routing or high availability, Classic VPN is the wrong answer.
Cross-Cloud Interconnect: Provides direct physical connections to AWS, Azure, OCI, and Alibaba Cloud through Google's network. This is the answer when the exam describes a multicloud architecture that needs private, low-latency connectivity between clouds without traversing the public internet.
Cloud Router and BGP
Cloud Router provides dynamic routing via BGP for VPN tunnels and Interconnect VLAN attachments. Key facts:
- Regional resource: Each Cloud Router advertises routes for subnets in its region (regional routing) or all subnets in the VPC (global routing, if global dynamic routing mode is enabled).
- BGP ASN: You assign a private ASN (64512-65534 or 4200000000-4294967294) to each Cloud Router.
- Graceful restart: Maintains forwarding during Cloud Router maintenance.
- Custom route advertisements: Override default subnet advertisements to advertise specific IP ranges to on-premises.
Exam trap: If you need on-premises to reach subnets in multiple regions through a single VPN gateway, you must enable global dynamic routing on the VPC. The default is regional, meaning Cloud Router only advertises routes for subnets in its own region.
VPC Design Patterns
Shared VPC
Shared VPC allows a host project to share its VPC network with service projects in the same organization. This is the standard pattern for centralized network administration.
- Host project: Owns the VPC, subnets, firewall rules, routes, and VPN/Interconnect connections.
- Service projects: Deploy resources (VMs, GKE clusters, Cloud SQL) into subnets from the host project.
- IAM control: Network admins manage the host project; application teams work in service projects with
compute.networkUserrole on specific subnets. - Billing: Each service project retains its own billing account for compute resources.
When to use: Organizations that need centralized network control with decentralized project management. This is the default answer for enterprise network architecture on the exam.
VPC Peering
VPC Network Peering connects two VPC networks (same or different organizations) so they can communicate via internal IPs.
- Non-transitive: If VPC-A peers with VPC-B and VPC-B peers with VPC-C, VPC-A cannot reach VPC-C through VPC-B. Each pair needs its own peering.
- Subnet route exchange: Peered VPCs automatically exchange subnet routes.
- Default quota of 25 peers: Each VPC can peer with up to 25 other VPCs by default (adjustable via quota increase request).
- Cross-organization: Peering works across different organizations.
Exam trap: VPC Peering is non-transitive. If the question describes three or more networks that all need to communicate, Shared VPC or a hub-and-spoke topology is the answer, not chaining peerings.
Hub-and-Spoke Topology
For complex multinetwork architectures, use Network Connectivity Center as a hub connecting multiple spoke VPCs, VPN tunnels, and Interconnect attachments. This provides transitive routing between spokes through the hub.
Private Access Options
| Access Type | Purpose | How It Works |
|---|---|---|
| Private Google Access | VMs without external IPs access Google APIs/services | Enabled per subnet; traffic uses internal IPs to reach *.googleapis.com |
| Private Services Access | Connect to Google-managed services (Cloud SQL, Memorystore) via internal IPs | Uses VPC peering with a Google-managed VPC; allocate an IP range in your VPC |
| Private Service Connect | Access Google APIs or published services via a consumer endpoint in your VPC | Creates an internal IP endpoint; supports custom DNS and granular control |
| Restricted VIP | Access Google APIs while enforcing VPC Service Controls | Route to restricted.googleapis.com (199.36.153.4/30); only services in the perimeter are reachable |
Exam trap: Private Google Access must be enabled on the subnet. It does not apply to VMs that have external IPs (they already reach Google APIs directly). If the scenario says "VMs with only internal IPs need to reach Cloud Storage," Private Google Access is the answer.
VPC Service Controls
VPC Service Controls create security perimeters around Google Cloud resources to prevent data exfiltration.
- Service perimeter: Defines which projects and services are protected. Free communication inside; blocked across the boundary by default.
- Access levels: Allow external access based on IP range, device posture, identity, or geolocation.
- Ingress/egress rules: Granular policies controlling which identities and services can cross the perimeter.
- Perimeter bridges: Bidirectional connections between two perimeters for controlled data sharing.
- Dry-run mode: Test perimeter configurations before enforcement.
The restricted VIP (restricted.googleapis.com, 199.36.153.4/30) ensures that traffic from VMs to Google APIs stays within the perimeter.
Hierarchical Firewall Policies
Firewall rules are evaluated in the following order:
- Organization-level hierarchical firewall policies
- Folder-level hierarchical firewall policies
- VPC firewall rules (including network firewall policies)
Each hierarchical rule can allow, deny, or goto_next (delegate to the next level). This enables central security teams to enforce baseline rules (e.g., block all SSH from the internet) while letting project teams add more specific rules.
2.2 Configuring Storage Systems
Cloud Storage Classes
Cloud Storage offers four storage classes, all with 99.999999999% (eleven 9s) durability. The exam tests class selection based on access frequency and cost optimization.
| Class | Min Storage Duration | Retrieval Cost | Availability SLA (multi-region / single-region) | Use Case |
|---|---|---|---|---|
| Standard | None | None | 99.95% / 99.9% | Frequently accessed (hot) data |
| Nearline | 30 days | Per-GB fee | 99.9% / 99.0% | Accessed once a month or less |
| Coldline | 90 days | Higher per-GB fee | 99.9% / 99.0% | Accessed once a quarter or less |
| Archive | 365 days | Highest per-GB fee | 99.9% / 99.0% | Long-term archival, compliance |
Critical facts:
- Archive is not slow: Data is available within milliseconds, unlike AWS Glacier. The cost is in retrieval fees and the 365-day minimum storage charge.
- Autoclass: Automatically transitions objects between classes based on access patterns. Eliminates the need for manual lifecycle rules.
- Object Lifecycle Management: Rules can automatically delete objects, change storage class, or set abort-incomplete-multipart conditions based on age, creation date, or other criteria.
- Retention policies: Lock objects from deletion for a specified period. Combined with bucket lock, this becomes irreversible (compliance use cases).
- Object versioning: Maintains historical versions of objects; non-current versions can be lifecycle-managed separately.
Exam trap: Minimum storage duration is a billing concept, not an access restriction. You can delete a Nearline object after 1 day, but you are billed for the remaining 29 days. The exam tests whether you understand this distinction.
Database Selection
The exam heavily tests database selection. Know the decision matrix cold.
| Requirement | Best Service | Why |
|---|---|---|
| Global transactions with strong consistency | Cloud Spanner | Only globally distributed relational DB with external consistency |
| Managed MySQL/PostgreSQL/SQL Server | Cloud SQL | Fully managed, vertical scaling, HA with failover replicas |
| High-performance PostgreSQL | AlloyDB | PostgreSQL-compatible with columnar engine; high OLTP throughput and accelerated analytical queries |
| Petabyte-scale analytics (OLAP) | BigQuery | Serverless, columnar, SQL interface, built-in ML |
| High-throughput NoSQL, time-series, IoT | Bigtable | Wide-column, sub-ms latency, HBase-compatible API |
| Document DB, mobile/web real-time sync | Firestore | Serverless, ACID transactions, offline-first, real-time listeners |
| In-memory caching | Memorystore | Managed Redis/Memcached, sub-ms latency |
Key differentiators for exam scenarios:
- Cloud SQL vs. Spanner: Cloud SQL scales vertically (bigger machine) and supports read replicas. Spanner scales horizontally with automatic sharding. If the scenario mentions "global users" or "horizontal scaling of writes," the answer is Spanner. If the scenario is a standard web app with moderate scale, the answer is Cloud SQL.
- Cloud SQL vs. AlloyDB: AlloyDB is PostgreSQL-compatible with a columnar engine for mixed OLTP/OLAP. If the scenario describes PostgreSQL with analytical queries or very high transaction throughput, AlloyDB is the answer.
- Bigtable vs. Firestore: Bigtable is for massive throughput on flat, wide-column data (IoT, time-series, financial tickers). Firestore is for document-oriented data with real-time sync (mobile apps, user profiles). Firestore is serverless; Bigtable requires provisioned nodes.
- BigQuery: Not a transactional database. If the scenario involves OLTP workloads, BigQuery is the wrong answer. It is the answer for analytics, reporting, data warehousing, and ML on structured data.
Exam trap: Firestore has two modes -- Native mode (full document DB with real-time sync) and Datastore mode (backward-compatible with legacy Datastore). A project can only use one mode. Native mode is the default for new projects. If the exam mentions "legacy Datastore application," Firestore in Datastore mode is the migration path.
Filestore Service Tiers
Filestore provides managed NFS file shares. Know the tier differences:
| Tier | Capacity | Availability | Key Feature |
|---|---|---|---|
| Basic HDD | 1 - 63.9 TiB | Zonal | Lowest cost, general file sharing |
| Basic SSD | 2.5 - 63.9 TiB | Zonal | Higher IOPS, no CMEK support |
| Zonal | 1 - 100 TiB | Single zone | Configurable performance, snapshots, CMEK |
| Regional | 1 - 100 TiB | Multi-zone | Zone-resilient, data replicated across zones |
| Enterprise | 1 - 10 TiB | Regional | Multishare (up to 80 shares), GKE-optimized |
When to use Filestore: Shared file storage for GKE pods, legacy applications expecting NFS, HPC scratch space, content management. If the scenario describes shared file access across multiple VMs or pods, Filestore is the answer.
2.3 Configuring Compute Systems
Compute Engine Machine Types
Compute Engine offers five machine families. The exam tests selecting the right family for a workload.
| Family | Series | Use Case | Key Characteristics |
|---|---|---|---|
| General-purpose | E2, N2, N2D, N4, C4, Tau T2D | Web servers, dev/test, small-medium DBs | Balanced price/performance; E2 is cheapest |
| Compute-optimized | C2, C2D, H3, H4D | HPC, batch, gaming, single-threaded apps | Highest per-core performance |
| Memory-optimized | M2, M3, X4 | SAP HANA, in-memory DBs, large caches | M3: up to 12 TB RAM; X4: up to 32 TB RAM |
| Storage-optimized | Z3 | Local databases, data warehouses, distributed file systems | Up to 72 TiB local SSD, sub-ms latency |
| Accelerator-optimized | A2, A3, A4, G2 | ML training, inference, rendering, CUDA | Attached GPUs (NVIDIA) |
Custom machine types: Available on N-series and E2. Allow specifying exact vCPU and memory (in 256 MB increments). Cost 5% more than equivalent predefined types but eliminate waste from over-provisioning.
Shared-core machines: e2-micro, e2-small, e2-medium provide burstable CPU at lowest cost. Use for lightweight workloads (microservices, dev instances).
Preemptible VMs vs. Spot VMs
| Feature | Preemptible VMs (Legacy) | Spot VMs |
|---|---|---|
| Pricing | 60-91% discount | 60-91% discount |
| Max lifetime | 24 hours | No maximum |
| Preemption | Google can reclaim with 30s notice | Google can reclaim with 30s notice |
| Availability | Not guaranteed | Not guaranteed |
| Live migration | No | No |
| Use cases | Batch processing, fault-tolerant workloads | Same, but preferred over preemptible |
Exam trap: Spot VMs replace preemptible VMs. The key difference is Spot VMs have no 24-hour maximum lifetime. For new deployments, always choose Spot VMs. If the exam mentions cost optimization for fault-tolerant batch workloads, Spot VMs is the answer.
Sole-Tenant Nodes, Shielded VMs, Confidential VMs
| Feature | Purpose | When to Use |
|---|---|---|
| Sole-tenant nodes | Dedicate physical server hardware to your VMs exclusively | Licensing requirements (per-core/per-processor), compliance mandating physical isolation |
| Shielded VMs | Verified boot integrity (Secure Boot, vTPM, Integrity Monitoring) | Default security hardening, protection against rootkits and bootkits |
| Confidential VMs | Encrypt data in use (memory encryption via AMD SEV or Intel TDX) | Processing sensitive data that must remain encrypted even in RAM |
Exam trap: Sole-tenant nodes provide physical isolation but do not encrypt memory. Confidential VMs encrypt memory but may share physical hardware with other tenants (they are isolated at the hardware encryption level). If the scenario requires both physical isolation and memory encryption, you need Confidential VMs on sole-tenant nodes.
Google Kubernetes Engine (GKE)
Autopilot vs. Standard
| Aspect | GKE Autopilot | GKE Standard |
|---|---|---|
| Node management | Google-managed | You manage node pools |
| Scaling | Automatic (pods trigger node provisioning) | You configure cluster autoscaler, node auto-provisioning |
| Security | Hardened by default (GKE Dataplane V2, Workload Identity, Shielded GKE Nodes) | You enable security features manually |
| Billing | Per-pod resource requests | Per-node (you pay for the whole node) |
| SLA | Covers control plane + compute capacity | Covers control plane only |
| GPU/TPU support | Yes (via ComputeClass) | Yes (via node pools) |
| Best for | Teams prioritizing agility, reduced ops | Teams needing full control over node configuration |
Key GKE concepts for the exam:
- Node pools: Groups of nodes with the same configuration (machine type, disk, labels). Standard mode only.
- Cluster autoscaler: Adds/removes nodes based on pending pod resource requests. Standard mode.
- Node auto-provisioning: Automatically creates new node pools when existing pools cannot schedule pods. Standard mode.
- Horizontal Pod Autoscaler (HPA): Scales pod replicas based on CPU, memory, or custom metrics.
- Vertical Pod Autoscaler (VPA): Adjusts resource requests/limits per pod. Cannot run simultaneously with HPA on the same metric.
- Workload Identity: Maps Kubernetes service accounts to Google Cloud IAM service accounts. Eliminates the need for exported service account keys. This is the recommended way to access Google Cloud services from GKE pods.
- GKE Dataplane V2: Built on Cilium/eBPF. Provides network policy enforcement, improved observability, and is default in Autopilot.
Exam trap: VPA and HPA cannot target the same metric on the same deployment. If you need both horizontal and vertical scaling, use HPA for CPU and VPA for memory (or use Multidimensional Pod Autoscaling in Autopilot). The exam may present a scenario where both are configured on the same metric -- that is incorrect.
Cloud Run and App Engine
| Feature | Cloud Run | App Engine Flexible | App Engine Standard |
|---|---|---|---|
| Unit of deployment | Container image | Container image | Application code (runtime-specific) |
| Scaling | 0 to 1000+ instances | 1 to N instances | 0 to N instances |
| Scale to zero | Yes (default) | No (min 1 instance) | Yes |
| Pricing | Per-request or per-instance | Per-instance-hour | Per-instance-hour (free tier) |
| VPC connectivity | Direct VPC, VPC connectors | VPC-native | VPC connectors |
| Custom runtimes | Any container | Any container | Limited runtimes (Python, Java, Go, Node.js, PHP, Ruby) |
| Request timeout | Up to 60 minutes | 60 minutes | 10 minutes (auto), 24h (manual/basic) |
| WebSockets | Yes | Yes | No |
Cloud Run also supports Jobs (run-to-completion tasks, no HTTP endpoint) and Worker Pools (pull-based background processing).
Exam trap: App Engine Flexible cannot scale to zero. If the scenario requires zero-cost when idle, Cloud Run or App Engine Standard is the answer. If the scenario requires custom container runtimes with scale-to-zero, Cloud Run is the only option.
Infrastructure as Code
Terraform / OpenTofu: The primary IaC tool for Google Cloud. Key exam concepts:
- State management: Use a remote backend (Cloud Storage bucket) for team collaboration. Enable state locking to prevent concurrent modifications.
- Modules: Reuse infrastructure patterns. Google publishes official Terraform modules for common patterns.
terraform plan: Preview changes before applying. Critical for production safety.- Deletion protection: Enable on stateful resources (databases, disks) to prevent accidental destruction.
Infrastructure Manager: Google Cloud's managed Terraform service. It runs Terraform in a serverless environment, stores state in Google Cloud, and integrates with IAM for access control. Use when you want managed Terraform without maintaining your own CI/CD pipeline for IaC.
CI/CD with Cloud Build and Cloud Deploy
- Cloud Build: Serverless CI/CD platform. Executes build steps defined in
cloudbuild.yaml. Supports triggers from Cloud Source Repositories, GitHub, Bitbucket. Can build container images, run tests, and deploy to any Google Cloud service. - Cloud Deploy: Managed continuous delivery service specifically for GKE, Cloud Run, and Anthos. Supports canary deployments (gradual traffic shift), blue-green deployments (instant traffic swap), and approval gates between environments.
- Artifact Registry: Stores container images, language packages (Maven, npm, Python), and OS packages. Replaces Container Registry (deprecated).
Exam trap: Cloud Build is for building and testing (CI). Cloud Deploy is for promotion across environments (CD). If the scenario describes promoting a release from dev to staging to production with approval gates, Cloud Deploy is the answer, not Cloud Build alone.
2.4 Leveraging Vertex AI for End-to-End ML Workflows
This section is new in the v6.1 exam (October 2025). It tests your ability to architect ML infrastructure, not build models.
Vertex AI Platform Overview
Vertex AI is a unified platform for the full ML lifecycle: data preparation, training, evaluation, deployment, and monitoring.
Core ML workflow components:
| Stage | Service | Purpose |
|---|---|---|
| Data prep | Workbench notebooks, BigQuery, Dataproc Serverless | Explore, clean, and transform data |
| Training | AutoML or Custom Training | Build models without code (AutoML) or with full control (Custom Training) |
| Evaluation | Model Evaluation | Compare model metrics within pipelines |
| Deployment | Prediction endpoints | Online (real-time) or batch inference |
| Monitoring | Model Monitoring | Detect training-serving skew and data drift |
| Feature management | Feature Store | Centralized feature repository for training and serving consistency |
Vertex AI Pipelines (MLOps)
Vertex AI Pipelines orchestrate ML workflows as DAGs (directed acyclic graphs). They automate the sequence of data preprocessing, training, evaluation, and deployment.
- Built on Kubeflow Pipelines or TFX (TensorFlow Extended).
- Serverless execution: No cluster management required.
- Experiments: Track hyperparameters, architectures, and metrics across training runs.
- Model Registry: Version and manage models through their lifecycle.
AI Hypercomputer and Accelerators
For large-scale training, the exam tests your knowledge of compute options:
| Accelerator | Type | Use Case |
|---|---|---|
| NVIDIA GPUs (A100, H100, L4) | GPU | General ML training, inference, fine-tuning |
| Cloud TPUs (v4, v5e, v5p) | TPU | Large language model training, high-throughput inference |
| AI Hypercomputer | Integrated stack | Combines TPU/GPU hardware with software optimizations (Multislice, Pathways) for distributed training at scale |
Key decision points:
- TPUs: Best for large-scale training of models built with JAX or TensorFlow. Purpose-built for matrix operations. Not suitable for PyTorch (though support is improving).
- GPUs: More flexible. Support PyTorch, TensorFlow, JAX, and CUDA workloads. Better for inference and smaller training jobs.
- AI Hypercomputer: The answer when the scenario describes training very large models (hundreds of billions of parameters) that exceed single-accelerator capacity. Provides Multislice training across TPU pods.
Custom Model Runtimes
Vertex AI supports deploying models with:
- Pre-built containers: For TensorFlow, PyTorch, scikit-learn, XGBoost. No custom Docker image needed.
- Custom containers: Bring your own serving container for any framework. Must expose an HTTP endpoint for predictions.
- Optimized TensorFlow Runtime: Google-optimized TensorFlow serving with better latency and throughput.
Exam trap: If the scenario describes deploying a model built with an uncommon framework, the answer is a custom container on a Vertex AI prediction endpoint -- not rewriting the model in TensorFlow.
2.5 Configuring Pre-Built Solutions or APIs with Vertex AI
Also new in v6.1. This section tests your ability to select and configure pre-built AI capabilities rather than training from scratch.
Gemini LLMs
Gemini is Google's multimodal foundation model family. Key facts for the exam:
- Multimodal: Processes text, images, video, and audio in a single model.
- Model variants: Gemini Ultra (highest capability), Gemini Pro (balanced), Gemini Flash (fastest/cheapest).
- Enterprise features: Grounding (connect model responses to your data or Google Search), safety filters, context caching (reduce costs for repeated prompts), and system instructions.
- Fine-tuning: Supervised Fine-Tuning (SFT) and Parameter-Efficient Fine-Tuning (PEFT/LoRA) for domain adaptation.
- Model Armor: Runtime defense against prompt injection, harmful content, and data leakage.
Vertex AI Agent Builder
Agent Builder enables building conversational AI agents that can:
- Ground responses in enterprise data (Cloud Storage, BigQuery, websites).
- Execute multi-step tasks using tool calling (function calling).
- Integrate with search (Vertex AI Search) for retrieval-augmented generation (RAG).
- Deploy via Agent Engine for managed hosting and scaling.
The Agent Development Kit (ADK) provides the orchestration framework for building agents that combine LLM reasoning with tool execution.
Model Garden
Model Garden provides access to 200+ enterprise-ready models:
- Google models: Gemini, Imagen (image generation), Chirp (speech-to-text), Codey (code generation).
- Partner models: Anthropic Claude, Meta Llama, Mistral.
- Open-source models: Deploy and fine-tune community models on Vertex AI infrastructure.
Models can be deployed to Vertex AI endpoints with one click or customized before deployment.
Colab Enterprise
Colab Enterprise is a managed Jupyter notebook environment within Google Cloud:
- VPC-native: Runs inside your VPC with Private Google Access.
- IAM-integrated: Controlled access via Google Cloud IAM.
- Compute options: Configurable machine types, GPUs, and TPUs for notebook runtimes.
- Collaboration: Real-time shared editing with audit logging.
Use Colab Enterprise when the scenario requires a secure, governed notebook environment (not the public Colab).
Exam Strategy: Domain 2 Decision Frameworks
Network Connectivity Decision Tree
Need to connect on-premises to Google Cloud?
├── Bandwidth < 3 Gbps → HA VPN
├── Bandwidth 3-50 Gbps, no colocation → Partner Interconnect
├── Bandwidth 10-200 Gbps, have colocation → Dedicated Interconnect
└── Connecting to another cloud provider → Cross-Cloud Interconnect
Need VPC-to-VPC connectivity?
├── Same org, centralized admin → Shared VPC
├── Different orgs or same org with peer autonomy → VPC Peering
└── Multiple VPCs with transitive routing → Network Connectivity Center (hub-and-spoke)
Storage Decision Tree
Structured data?
├── Relational, single-region → Cloud SQL (or AlloyDB for PostgreSQL)
├── Relational, global scale → Cloud Spanner
├── Key-value, massive throughput → Bigtable
├── Document/JSON, real-time sync → Firestore
├── Analytics/OLAP → BigQuery
└── Caching layer → Memorystore
Unstructured data?
├── Object storage → Cloud Storage (choose class by access frequency)
└── Shared file system (NFS) → Filestore (choose tier by performance/availability)
Compute Decision Tree
Containers?
├── Managed Kubernetes → GKE (Autopilot for simplicity, Standard for control)
├── Serverless containers → Cloud Run
└── Serverless + scale to zero + no containers → App Engine Standard
VMs?
├── General workloads → N-series or E2
├── HPC/batch → C-series (compute-optimized)
├── In-memory databases → M-series (memory-optimized)
├── Local SSD-intensive → Z3 (storage-optimized)
└── ML training/inference → A-series (accelerator-optimized)
Cost optimization?
├── Fault-tolerant batch → Spot VMs (60-91% discount)
├── Steady-state workloads → Committed Use Discounts (1yr or 3yr)
└── Variable workloads → Sustained Use Discounts (automatic)
Common Exam Traps Summary
| Trap | Correct Understanding |
|---|---|
| Classic VPN has 99.99% SLA | No. Classic VPN is 99.9%. HA VPN is 99.99%. |
| VPC Peering is transitive | No. A-B peering + B-C peering does not give A-C connectivity. |
| Archive storage has slow retrieval | No. Retrieval is milliseconds. The cost is in retrieval fees and 365-day minimum billing. |
| Spot VMs have a 24-hour limit | No. That was preemptible VMs. Spot VMs have no maximum lifetime. |
| VPA and HPA can scale on the same metric | No. They conflict. Use different metrics or Multidimensional Pod Autoscaling. |
| App Engine Flexible scales to zero | No. Minimum 1 instance. Cloud Run and App Engine Standard scale to zero. |
| Cloud SQL scales horizontally for writes | No. Cloud SQL scales vertically. Read replicas help with reads only. Spanner scales writes horizontally. |
| Private Google Access works on VMs with external IPs | It is unnecessary. VMs with external IPs already access Google APIs directly. PGA is for internal-only VMs. |
| Confidential VMs provide physical isolation | No. They provide memory encryption. Sole-tenant nodes provide physical isolation. |
References
- VPC Network Overview
- Cloud Interconnect Overview
- Cloud VPN Overview
- Shared VPC Documentation
- VPC Service Controls Overview
- Cloud Storage Classes
- Google Cloud Databases
- Compute Engine Machine Resource
- GKE Autopilot Overview
- Cloud Run Overview
- Vertex AI Introduction
- Filestore Service Tiers
- Terraform on Google Cloud Best Practices
- Cloud Deploy Overview
- Cloud Build Documentation