Reference

Domain 2: Exploring Data Transformation with Google Cloud (~16%)

Domain 2 of the Google Cloud Digital Leader exam tests your understanding of how data drives business value and which Google Cloud products manage, process, and analyze that data. At approximately 16% of the exam, expect 8-10 questions across three objectives: the value of data, Google Cloud data management solutions, and making data useful and accessible.

This domain is product-heavy. You must know what each service does, when to choose it over alternatives, and how the services connect to form end-to-end data pipelines.

1. The Value of Data (Section 2.1)

1.1 Data as a Business Driver

Organizations use data to generate business insights, drive decision-making, and create new value. The exam expects you to understand that digital transformation fundamentally depends on an organization's ability to collect, manage, and act on data. Data is not just a byproduct of operations; it is a strategic asset. (Cloud Digital Leader Exam Guide)

Key concepts:

  • Descriptive analytics -- what happened (dashboards, reports)
  • Diagnostic analytics -- why it happened (drill-down analysis)
  • Predictive analytics -- what will happen (ML models, forecasting)
  • Prescriptive analytics -- what should we do (optimization, recommendations)

Cloud makes advanced analytics accessible because organizations no longer need to provision massive on-premises infrastructure to store and process data. They can scale compute and storage independently, paying only for what they use.

1.2 Databases vs. Data Warehouses vs. Data Lakes

This is a high-probability exam topic. Know the definitions and the differences cold.

Characteristic Database Data Warehouse Data Lake
Primary purpose Transactional operations (OLTP) Analytical queries (OLAP) Raw data storage for future processing
Data structure Structured (schema-on-write) Structured/semi-structured (schema-on-write) Structured, semi-structured, and unstructured (schema-on-read)
Data freshness Real-time current state Periodically loaded (ETL) Raw ingestion, processed later
Typical users Applications, developers Business analysts, data scientists Data engineers, data scientists
Query pattern Many small reads/writes Few complex analytical queries Varies by downstream consumer
Google Cloud example Cloud SQL, Firestore BigQuery Cloud Storage (with analytics tools)

Exam trap: A data lake is not a database. A data lake stores raw, unprocessed data in its native format. The processing and schema are applied when the data is read, not when it is written. BigQuery blurs this line because it now functions as both a data warehouse and a data lakehouse, supporting structured and unstructured data with open table formats like Apache Iceberg. (BigQuery Introduction)

1.3 Data Types

Type Definition Examples Google Cloud Storage
Structured Predefined schema, rows/columns Financial transactions, inventory records Cloud SQL, Cloud Spanner, BigQuery
Semi-structured Flexible schema, self-describing JSON, XML, Avro, Parquet Firestore, BigQuery, Cloud Storage
Unstructured No predefined schema Images, video, audio, PDFs, logs Cloud Storage

1.4 Data Governance

Data governance is the set of policies, processes, and standards that ensure data is managed properly throughout its lifecycle. The exam tests awareness of governance, not deep implementation details. (Cloud Digital Leader Exam Guide)

Core principles:

  • Data quality -- accuracy, completeness, consistency, timeliness
  • Data security -- access controls, encryption, audit trails
  • Data privacy -- compliance with regulations (GDPR, HIPAA, CCPA)
  • Data lineage -- tracking where data came from and how it was transformed
  • Data cataloging -- metadata management so users can discover and understand available data

Google Cloud's governance tool is Dataplex Universal Catalog, which provides unified data governance across data lakes, warehouses, and databases. It integrates with BigQuery to manage metadata, data quality, and lineage across your entire data landscape. (Dataplex Universal Catalog)

1.5 Real-Time vs. Batch Processing

Aspect Batch Processing Real-Time (Stream) Processing
Data handling Collects data over time, processes in bulk Processes data as it arrives
Latency Minutes to hours Milliseconds to seconds
Use cases Monthly reports, ETL jobs, billing Fraud detection, live dashboards, IoT monitoring
Google Cloud tools Dataflow (batch mode), BigQuery batch loads Dataflow (streaming mode), Pub/Sub, BigQuery streaming
Trade-off Higher throughput, lower cost per record Lower latency, higher cost per record

2. Google Cloud Data Management Solutions (Section 2.2)

This is the core of Domain 2. You need to match each service to its correct use case. The exam tests whether you can pick the right tool for a given scenario.

2.1 Cloud SQL

Cloud SQL is a fully managed relational database service supporting MySQL, PostgreSQL, and SQL Server.

Feature Detail
Type Relational (SQL), ACID-compliant
Managed engines MySQL, PostgreSQL, SQL Server
Scale Vertical scaling (larger machines); up to a few TB
Scope Regional (single region)
High availability Regional HA with automatic failover across zones
Read scaling Read replicas (same region or cross-region)
Use cases Web apps, CMS, CRM, ERP, e-commerce, SaaS
When to choose Traditional relational workloads that fit in one region and do not need horizontal scaling

Exam trap: Cloud SQL scales vertically, not horizontally. If the question describes a globally distributed application requiring unlimited horizontal scaling with strong consistency, the answer is Cloud Spanner, not Cloud SQL.

2.2 Cloud Spanner

Cloud Spanner is a globally distributed, horizontally scalable, strongly consistent relational database.

Feature Detail
Type Relational (SQL), ACID-compliant
Scale Horizontal scaling across regions; petabyte-scale
Scope Regional or multi-regional (global)
Availability Up to 99.999% (five nines) for multi-region configurations
Consistency Strong external consistency (strongest possible)
Use cases Global financial ledgers, gaming leaderboards, supply chain, payment processing, inventory management
When to choose You need relational semantics (SQL, joins, ACID) at global scale with strong consistency

Exam trap: Spanner is significantly more expensive than Cloud SQL. If a question describes a simple regional web application, Cloud SQL is the correct answer. Spanner is overkill for workloads that do not require global distribution or massive horizontal scaling.

2.3 Cloud Bigtable

Cloud Bigtable is a fully managed, wide-column NoSQL database designed for large analytical and operational workloads.

Feature Detail
Type NoSQL, wide-column (HBase-compatible API)
Scale Petabytes of data, millions of reads/writes per second
Latency Single-digit millisecond
Consistency Eventual consistency (single-row reads are strongly consistent)
Query support No SQL, no joins, no multi-row transactions
Use cases IoT time-series data, financial tick data, ad tech, personalization/recommendations, monitoring, geospatial
When to choose Massive throughput of simple key-value lookups or range scans; data > 10 TB

Exam trap: Bigtable does not support SQL queries, joins, or multi-row transactions. If a question requires complex queries with joins, Bigtable is wrong. If the question emphasizes low-latency reads/writes on terabytes to petabytes of time-series or IoT data, Bigtable is correct.

2.4 Firestore

Firestore is a serverless, NoSQL document database designed for mobile, web, and IoT applications.

Feature Detail
Type NoSQL, document-oriented (collections/documents)
Scale Automatic scaling, suitable for 0 to a few TB
Consistency Strong consistency
Real-time Built-in real-time listeners (data syncs to clients instantly)
Offline support Client SDKs support offline data access and sync
Use cases Mobile/web apps, real-time collaboration, user profiles, game state, shopping carts
When to choose Mobile/web apps needing real-time sync, offline support, and flexible document schemas

Exam trap: Firestore is the successor to Cloud Datastore. If you see "Datastore" in an exam question, understand it is the legacy name. For new applications, Firestore (in Native mode) is the recommended choice. Firestore is strongly consistent; Bigtable is eventually consistent (except single-row reads).

2.5 Cloud Storage

Cloud Storage is a unified object storage service for any amount of data. It is not a database -- it stores files (objects) in buckets.

Storage Classes

All four classes use the same API and tools. The difference is cost structure: cheaper storage costs are offset by higher retrieval costs and minimum storage durations. (Storage Classes)

Storage Class Min Duration Access Pattern Availability SLA (multi-region) Use Cases
Standard None Frequently accessed ("hot" data) 99.95% Website content, streaming media, active data
Nearline 30 days ~Once per month 99.9% Backups, long-tail content, monthly analytics
Coldline 90 days ~Once per quarter 99.9% Disaster recovery, quarterly reporting
Archive 365 days Less than once per year 99.9% Regulatory compliance, long-term retention

Key facts for the exam:

  • Durability: 99.999999999% (eleven 9s) annual durability across all classes
  • Retrieval: Archive data is available within milliseconds, not hours or days (unlike AWS Glacier's restore delay)
  • Autoclass: Automatically transitions objects between storage classes based on access patterns to optimize cost
  • Unified API: Same tools and API regardless of storage class; no need to change application code
  • No minimum object size: Unlike some competitors

Exam trap: Archive storage has a 365-day minimum storage duration. If you delete or overwrite an object before 365 days, you are charged for the full 365 days. The same principle applies to Nearline (30 days) and Coldline (90 days). Standard has no minimum.

2.6 BigQuery

BigQuery is Google Cloud's fully managed, serverless, enterprise data warehouse and analytics platform. It is arguably the most important service in Domain 2.

Feature Detail
Type Serverless data warehouse / data lakehouse
Architecture Separated storage and compute (scale independently)
Query language Standard SQL (ANSI SQL:2011 compliant)
Performance Terabytes in seconds, petabytes in minutes
Data formats Structured and semi-structured; supports Apache Iceberg, Delta, Hudi
ML integration BigQuery ML -- create and run ML models using SQL
Multicloud BigQuery Omni allows querying data in AWS and Azure without moving it
Streaming Supports real-time streaming ingestion
BI integration Native integration with Looker, Looker Studio, Google Sheets, and third-party tools (Tableau, Power BI)
Pricing models On-demand (pay per TB scanned) or capacity-based (reserved slots)
Use cases Enterprise analytics, reporting, data lakehouse, ML model training, multicloud analytics

(BigQuery Introduction)

Key concepts to know:

  • Storage-compute separation: BigQuery stores data in its distributed storage layer (Colossus) and uses a separate compute engine (Dremel) for queries. This means storage and compute scale independently, and you are not paying for idle compute.
  • BigQuery ML: Lets you create, train, and predict with ML models using SQL. No need to export data or learn a separate ML framework. Supports linear regression, logistic regression, k-means clustering, time-series forecasting, and more.
  • Federated queries: Query data in Cloud Storage, Bigtable, Spanner, or Google Sheets without loading it into BigQuery.
  • BigQuery Omni: Run BigQuery analytics on data stored in AWS S3 or Azure Blob Storage without copying data to Google Cloud.

Exam trap: BigQuery is serverless -- you do not manage servers, clusters, or infrastructure. If a question asks about a managed analytics solution that requires no infrastructure management, BigQuery is likely the answer. Also, BigQuery is not a transactional database. Do not confuse it with Cloud SQL or Cloud Spanner.

2.7 Database Selection Decision Guide

Use this table to match scenario keywords to the correct service:

Scenario Keywords Correct Service
MySQL, PostgreSQL, SQL Server, regional, lift-and-shift Cloud SQL
Global, horizontal scaling, relational, 99.999% availability, financial Cloud Spanner
IoT, time-series, low-latency, wide-column, petabytes, HBase Cloud Bigtable
Mobile app, web app, real-time sync, offline, document database Firestore
Object storage, images, videos, backups, archive, unstructured files Cloud Storage
Analytics, data warehouse, serverless, SQL on petabytes, ML, dashboards BigQuery

3. Making Data Useful and Accessible (Section 2.3)

3.1 Looker

Looker is Google Cloud's enterprise business intelligence (BI) platform. It makes data accessible to non-technical users through self-service analytics.

Key capabilities:

  • LookML: A modeling language that defines data relationships, business logic, and metrics in a reusable, version-controlled layer. This ensures everyone in the organization works from a single source of truth.
  • Data democratization: Enables business users to explore data and build their own reports without writing SQL.
  • Embedded analytics: Looker can embed dashboards and visualizations directly into applications and portals.
  • BigQuery integration: Native, optimized connection to BigQuery for real-time analytics.
  • Governed metrics: Centralized metric definitions prevent conflicting interpretations of business KPIs.

Looker vs. Looker Studio: Looker is the enterprise BI platform (LookML modeling, governed metrics, embedded analytics, API access). Looker Studio (formerly Data Studio) is a free, self-service dashboarding and reporting tool. Looker Studio is simpler and focused on visualization; Looker adds the governance and semantic modeling layer.

Exam trap: When a question mentions "data democratization," "self-service analytics," or "single source of truth for metrics," the answer is Looker. When the question just needs a simple dashboard or report, Looker Studio may be sufficient.

3.2 Pub/Sub

Pub/Sub is a fully managed, real-time messaging service for event-driven architectures.

Feature Detail
Model Publish-subscribe (producers publish messages to topics; subscribers receive messages)
Delivery At-least-once delivery; supports push and pull subscriptions
Scale Billions of messages per day, globally
Ordering Optional message ordering per key
Retention Configurable message retention (up to 31 days)
Serverless No provisioning or capacity planning required

(Pub/Sub Overview)

Use cases:

  • Event ingestion: Capture user interactions, IoT sensor data, application logs
  • Decoupling microservices: Publishers and subscribers operate independently; neither needs to know about the other
  • Streaming data pipelines: Feed real-time data into Dataflow, BigQuery, or Cloud Storage
  • Enterprise event bus: Distribute business events across teams and applications

How Pub/Sub fits into pipelines: In a typical streaming architecture, data producers publish events to Pub/Sub topics. Dataflow subscribes to those topics, transforms the data in real time, and writes results to BigQuery or Cloud Storage for analysis in Looker.

3.3 Dataflow

Dataflow is a fully managed service for stream and batch data processing, built on the open-source Apache Beam SDK.

Feature Detail
Programming model Apache Beam (unified batch and stream processing)
Execution Fully managed, serverless (auto-scaling workers)
Scale Up to 4,000 workers per job; routinely processes petabytes
Languages Java, Python, Go
Use cases ETL pipelines, real-time analytics, data enrichment, log processing

(Dataflow Overview)

Key concepts:

  • Unified model: The same Apache Beam pipeline code processes both batch (bounded) and streaming (unbounded) data. You write once and run either way.
  • PCollections and PTransforms: Data in Beam is a PCollection (a dataset). Transformations are PTransforms (operations applied to PCollections).
  • Windowing: Groups unbounded streaming data into finite windows (fixed, sliding, session) for aggregation.
  • Autoscaling: Dataflow automatically adds or removes workers based on pipeline throughput, optimizing cost and performance.

Exam trap: Dataflow is the processing engine; Pub/Sub is the messaging layer. They are complementary, not competing. A common exam scenario: "ingest real-time events" (Pub/Sub) then "transform and load into BigQuery" (Dataflow).

3.4 End-to-End Streaming Architecture

The exam may test your understanding of how these services connect. Here is the canonical Google Cloud streaming analytics pipeline:

Data Sources → Pub/Sub → Dataflow → BigQuery → Looker
(IoT, apps,    (ingest)   (transform)  (store/    (visualize/
 logs, events)                          analyze)    report)
Stage Service Role
Ingest Pub/Sub Collect events from any number of producers
Process Dataflow Transform, enrich, aggregate data in real time
Store/Analyze BigQuery Warehouse the processed data; run SQL analytics
Visualize Looker / Looker Studio Build dashboards and reports; share insights

3.5 Database Migration and Modernization

The exam covers awareness of migration strategies, not deep implementation details.

Database Migration Service (DMS): A fully managed service that migrates databases to Google Cloud with minimal downtime. Supports migrations to Cloud SQL and AlloyDB from sources like MySQL, PostgreSQL, SQL Server, and Oracle. (Database Migration Service)

Common migration strategies:

Strategy Description Example
Lift and shift Move database as-is to a managed service On-prem MySQL to Cloud SQL for MySQL
Migrate and modernize Move and upgrade to a cloud-native service Oracle DB to Cloud Spanner or AlloyDB
Replatform Change the underlying engine SQL Server to PostgreSQL on Cloud SQL

Exam trap: "Migration" questions on this exam are high-level. They want you to know that Google Cloud offers managed migration tools and that organizations can move to fully managed databases to reduce operational overhead. You will not be asked to configure DMS step by step.

4. Exam Preparation Tips for Domain 2

High-Frequency Topics

  1. BigQuery -- most-tested service in this domain. Know it is serverless, separates storage/compute, supports ML via SQL, and works as a data lakehouse.
  2. Storage class selection -- memorize the four classes, their minimum durations (none, 30, 90, 365 days), and access patterns.
  3. Database selection -- given a scenario, pick the right database. Cloud SQL for regional relational, Spanner for global relational, Bigtable for massive NoSQL throughput, Firestore for mobile/web with real-time sync.
  4. Pub/Sub + Dataflow pipeline -- understand the canonical streaming architecture and each component's role.
  5. Data types (structured/semi-structured/unstructured) and where each is stored.

Common Exam Traps

Trap Correct Answer
Using Bigtable for complex SQL queries with joins Bigtable has no SQL, no joins -- use BigQuery or Cloud SQL
Choosing Spanner for a simple regional web app database Cloud SQL is cheaper and sufficient for regional workloads
Confusing BigQuery (analytics) with Cloud SQL (transactions) BigQuery is for analytics/OLAP; Cloud SQL is for transactions/OLTP
Thinking Archive storage takes hours to retrieve Cloud Storage Archive retrieves in milliseconds (unlike AWS Glacier)
Mixing up Pub/Sub (messaging) and Dataflow (processing) Pub/Sub ingests; Dataflow transforms. They work together, not interchangeably.
Choosing Firestore when data exceeds 10+ TB Firestore is for 0 to a few TB. For massive NoSQL, use Bigtable.

References