Reference

Domain 2: Exploring Data Transformation with Google Cloud (~16%)

Domain 2 of the Google Cloud Digital Leader exam tests your understanding of how data drives business value and which Google Cloud products manage, process, and analyze that data. At approximately 16% of the exam, expect 8-10 questions across three objectives: the value of data, Google Cloud data management solutions, and making data useful and accessible.

This domain is product-heavy. You must know what each service does, when to choose it over alternatives, and how the services connect to form end-to-end data pipelines.

1. The Value of Data (Section 2.1)

1.1 Data as a Business Driver

Organizations use data to generate business insights, drive decision-making, and create new value. The exam expects you to understand that digital transformation fundamentally depends on an organization's ability to collect, manage, and act on data. Data is not just a byproduct of operations; it is a strategic asset. (Cloud Digital Leader Exam Guide)

Key concepts:

Descriptive analytics -- what happened (dashboards, reports)
Diagnostic analytics -- why it happened (drill-down analysis)
Predictive analytics -- what will happen (ML models, forecasting)
Prescriptive analytics -- what should we do (optimization, recommendations)

Cloud makes advanced analytics accessible because organizations no longer need to provision massive on-premises infrastructure to store and process data. They can scale compute and storage independently, paying only for what they use.

1.2 Databases vs. Data Warehouses vs. Data Lakes

This is a high-probability exam topic. Know the definitions and the differences cold.

Characteristic	Database	Data Warehouse	Data Lake
Primary purpose	Transactional operations (OLTP)	Analytical queries (OLAP)	Raw data storage for future processing
Data structure	Structured (schema-on-write)	Structured/semi-structured (schema-on-write)	Structured, semi-structured, and unstructured (schema-on-read)
Data freshness	Real-time current state	Periodically loaded (ETL)	Raw ingestion, processed later
Typical users	Applications, developers	Business analysts, data scientists	Data engineers, data scientists
Query pattern	Many small reads/writes	Few complex analytical queries	Varies by downstream consumer
Google Cloud example	Cloud SQL, Firestore	BigQuery	Cloud Storage (with analytics tools)

Exam trap: A data lake is not a database. A data lake stores raw, unprocessed data in its native format. The processing and schema are applied when the data is read, not when it is written. BigQuery blurs this line because it now functions as both a data warehouse and a data lakehouse, supporting structured and unstructured data with open table formats like Apache Iceberg. (BigQuery Introduction)

1.3 Data Types

Type	Definition	Examples	Google Cloud Storage
Structured	Predefined schema, rows/columns	Financial transactions, inventory records	Cloud SQL, Cloud Spanner, BigQuery
Semi-structured	Flexible schema, self-describing	JSON, XML, Avro, Parquet	Firestore, BigQuery, Cloud Storage
Unstructured	No predefined schema	Images, video, audio, PDFs, logs	Cloud Storage

1.4 Data Governance

Data governance is the set of policies, processes, and standards that ensure data is managed properly throughout its lifecycle. The exam tests awareness of governance, not deep implementation details. (Cloud Digital Leader Exam Guide)

Core principles:

Data quality -- accuracy, completeness, consistency, timeliness
Data security -- access controls, encryption, audit trails
Data privacy -- compliance with regulations (GDPR, HIPAA, CCPA)
Data lineage -- tracking where data came from and how it was transformed
Data cataloging -- metadata management so users can discover and understand available data

Google Cloud's governance tool is Dataplex Universal Catalog, which provides unified data governance across data lakes, warehouses, and databases. It integrates with BigQuery to manage metadata, data quality, and lineage across your entire data landscape. (Dataplex Universal Catalog)

1.5 Real-Time vs. Batch Processing

Aspect	Batch Processing	Real-Time (Stream) Processing
Data handling	Collects data over time, processes in bulk	Processes data as it arrives
Latency	Minutes to hours	Milliseconds to seconds
Use cases	Monthly reports, ETL jobs, billing	Fraud detection, live dashboards, IoT monitoring
Google Cloud tools	Dataflow (batch mode), BigQuery batch loads	Dataflow (streaming mode), Pub/Sub, BigQuery streaming
Trade-off	Higher throughput, lower cost per record	Lower latency, higher cost per record

2. Google Cloud Data Management Solutions (Section 2.2)

This is the core of Domain 2. You need to match each service to its correct use case. The exam tests whether you can pick the right tool for a given scenario.

2.1 Cloud SQL

Cloud SQL is a fully managed relational database service supporting MySQL, PostgreSQL, and SQL Server.

Feature	Detail
Type	Relational (SQL), ACID-compliant
Managed engines	MySQL, PostgreSQL, SQL Server
Scale	Vertical scaling (larger machines); up to a few TB
Scope	Regional (single region)
High availability	Regional HA with automatic failover across zones
Read scaling	Read replicas (same region or cross-region)
Use cases	Web apps, CMS, CRM, ERP, e-commerce, SaaS
When to choose	Traditional relational workloads that fit in one region and do not need horizontal scaling

Exam trap: Cloud SQL scales vertically, not horizontally. If the question describes a globally distributed application requiring unlimited horizontal scaling with strong consistency, the answer is Cloud Spanner, not Cloud SQL.

2.2 Cloud Spanner

Cloud Spanner is a globally distributed, horizontally scalable, strongly consistent relational database.

Feature	Detail
Type	Relational (SQL), ACID-compliant
Scale	Horizontal scaling across regions; petabyte-scale
Scope	Regional or multi-regional (global)
Availability	Up to 99.999% (five nines) for multi-region configurations
Consistency	Strong external consistency (strongest possible)
Use cases	Global financial ledgers, gaming leaderboards, supply chain, payment processing, inventory management
When to choose	You need relational semantics (SQL, joins, ACID) at global scale with strong consistency

Exam trap: Spanner is significantly more expensive than Cloud SQL. If a question describes a simple regional web application, Cloud SQL is the correct answer. Spanner is overkill for workloads that do not require global distribution or massive horizontal scaling.

2.3 Cloud Bigtable

Cloud Bigtable is a fully managed, wide-column NoSQL database designed for large analytical and operational workloads.

Feature	Detail
Type	NoSQL, wide-column (HBase-compatible API)
Scale	Petabytes of data, millions of reads/writes per second
Latency	Single-digit millisecond
Consistency	Eventual consistency (single-row reads are strongly consistent)
Query support	No SQL, no joins, no multi-row transactions
Use cases	IoT time-series data, financial tick data, ad tech, personalization/recommendations, monitoring, geospatial
When to choose	Massive throughput of simple key-value lookups or range scans; data > 10 TB

Exam trap: Bigtable does not support SQL queries, joins, or multi-row transactions. If a question requires complex queries with joins, Bigtable is wrong. If the question emphasizes low-latency reads/writes on terabytes to petabytes of time-series or IoT data, Bigtable is correct.

2.4 Firestore

Firestore is a serverless, NoSQL document database designed for mobile, web, and IoT applications.

Feature	Detail
Type	NoSQL, document-oriented (collections/documents)
Scale	Automatic scaling, suitable for 0 to a few TB
Consistency	Strong consistency
Real-time	Built-in real-time listeners (data syncs to clients instantly)
Offline support	Client SDKs support offline data access and sync
Use cases	Mobile/web apps, real-time collaboration, user profiles, game state, shopping carts
When to choose	Mobile/web apps needing real-time sync, offline support, and flexible document schemas

Exam trap: Firestore is the successor to Cloud Datastore. If you see "Datastore" in an exam question, understand it is the legacy name. For new applications, Firestore (in Native mode) is the recommended choice. Firestore is strongly consistent; Bigtable is eventually consistent (except single-row reads).

2.5 Cloud Storage

Cloud Storage is a unified object storage service for any amount of data. It is not a database -- it stores files (objects) in buckets.

Storage Classes

All four classes use the same API and tools. The difference is cost structure: cheaper storage costs are offset by higher retrieval costs and minimum storage durations. (Storage Classes)

Storage Class	Min Duration	Access Pattern	Availability SLA (multi-region)	Use Cases
Standard	None	Frequently accessed ("hot" data)	99.95%	Website content, streaming media, active data
Nearline	30 days	~Once per month	99.9%	Backups, long-tail content, monthly analytics
Coldline	90 days	~Once per quarter	99.9%	Disaster recovery, quarterly reporting
Archive	365 days	Less than once per year	99.9%	Regulatory compliance, long-term retention

Key facts for the exam:

Durability: 99.999999999% (eleven 9s) annual durability across all classes
Retrieval: Archive data is available within milliseconds, not hours or days (unlike AWS Glacier's restore delay)
Autoclass: Automatically transitions objects between storage classes based on access patterns to optimize cost
Unified API: Same tools and API regardless of storage class; no need to change application code
No minimum object size: Unlike some competitors

Exam trap: Archive storage has a 365-day minimum storage duration. If you delete or overwrite an object before 365 days, you are charged for the full 365 days. The same principle applies to Nearline (30 days) and Coldline (90 days). Standard has no minimum.

2.6 BigQuery

BigQuery is Google Cloud's fully managed, serverless, enterprise data warehouse and analytics platform. It is arguably the most important service in Domain 2.

Feature	Detail
Type	Serverless data warehouse / data lakehouse
Architecture	Separated storage and compute (scale independently)
Query language	Standard SQL (ANSI SQL:2011 compliant)
Performance	Terabytes in seconds, petabytes in minutes
Data formats	Structured and semi-structured; supports Apache Iceberg, Delta, Hudi
ML integration	BigQuery ML -- create and run ML models using SQL
Multicloud	BigQuery Omni allows querying data in AWS and Azure without moving it
Streaming	Supports real-time streaming ingestion
BI integration	Native integration with Looker, Looker Studio, Google Sheets, and third-party tools (Tableau, Power BI)
Pricing models	On-demand (pay per TB scanned) or capacity-based (reserved slots)
Use cases	Enterprise analytics, reporting, data lakehouse, ML model training, multicloud analytics

(BigQuery Introduction)

Key concepts to know:

Storage-compute separation: BigQuery stores data in its distributed storage layer (Colossus) and uses a separate compute engine (Dremel) for queries. This means storage and compute scale independently, and you are not paying for idle compute.
BigQuery ML: Lets you create, train, and predict with ML models using SQL. No need to export data or learn a separate ML framework. Supports linear regression, logistic regression, k-means clustering, time-series forecasting, and more.
Federated queries: Query data in Cloud Storage, Bigtable, Spanner, or Google Sheets without loading it into BigQuery.
BigQuery Omni: Run BigQuery analytics on data stored in AWS S3 or Azure Blob Storage without copying data to Google Cloud.

Exam trap: BigQuery is serverless -- you do not manage servers, clusters, or infrastructure. If a question asks about a managed analytics solution that requires no infrastructure management, BigQuery is likely the answer. Also, BigQuery is not a transactional database. Do not confuse it with Cloud SQL or Cloud Spanner.

2.7 Database Selection Decision Guide

Use this table to match scenario keywords to the correct service:

Scenario Keywords	Correct Service
MySQL, PostgreSQL, SQL Server, regional, lift-and-shift	Cloud SQL
Global, horizontal scaling, relational, 99.999% availability, financial	Cloud Spanner
IoT, time-series, low-latency, wide-column, petabytes, HBase	Cloud Bigtable
Mobile app, web app, real-time sync, offline, document database	Firestore
Object storage, images, videos, backups, archive, unstructured files	Cloud Storage
Analytics, data warehouse, serverless, SQL on petabytes, ML, dashboards	BigQuery

3. Making Data Useful and Accessible (Section 2.3)

3.1 Looker

Looker is Google Cloud's enterprise business intelligence (BI) platform. It makes data accessible to non-technical users through self-service analytics.

Key capabilities:

LookML: A modeling language that defines data relationships, business logic, and metrics in a reusable, version-controlled layer. This ensures everyone in the organization works from a single source of truth.
Data democratization: Enables business users to explore data and build their own reports without writing SQL.
Embedded analytics: Looker can embed dashboards and visualizations directly into applications and portals.
BigQuery integration: Native, optimized connection to BigQuery for real-time analytics.
Governed metrics: Centralized metric definitions prevent conflicting interpretations of business KPIs.

Looker vs. Looker Studio: Looker is the enterprise BI platform (LookML modeling, governed metrics, embedded analytics, API access). Looker Studio (formerly Data Studio) is a free, self-service dashboarding and reporting tool. Looker Studio is simpler and focused on visualization; Looker adds the governance and semantic modeling layer.

Exam trap: When a question mentions "data democratization," "self-service analytics," or "single source of truth for metrics," the answer is Looker. When the question just needs a simple dashboard or report, Looker Studio may be sufficient.

3.2 Pub/Sub

Pub/Sub is a fully managed, real-time messaging service for event-driven architectures.

Feature	Detail
Model	Publish-subscribe (producers publish messages to topics; subscribers receive messages)
Delivery	At-least-once delivery; supports push and pull subscriptions
Scale	Billions of messages per day, globally
Ordering	Optional message ordering per key
Retention	Configurable message retention (up to 31 days)
Serverless	No provisioning or capacity planning required

(Pub/Sub Overview)

Use cases:

Event ingestion: Capture user interactions, IoT sensor data, application logs
Decoupling microservices: Publishers and subscribers operate independently; neither needs to know about the other
Streaming data pipelines: Feed real-time data into Dataflow, BigQuery, or Cloud Storage
Enterprise event bus: Distribute business events across teams and applications

How Pub/Sub fits into pipelines: In a typical streaming architecture, data producers publish events to Pub/Sub topics. Dataflow subscribes to those topics, transforms the data in real time, and writes results to BigQuery or Cloud Storage for analysis in Looker.

3.3 Dataflow

Dataflow is a fully managed service for stream and batch data processing, built on the open-source Apache Beam SDK.

Feature	Detail
Programming model	Apache Beam (unified batch and stream processing)
Execution	Fully managed, serverless (auto-scaling workers)
Scale	Up to 4,000 workers per job; routinely processes petabytes
Languages	Java, Python, Go
Use cases	ETL pipelines, real-time analytics, data enrichment, log processing

(Dataflow Overview)

Key concepts:

Unified model: The same Apache Beam pipeline code processes both batch (bounded) and streaming (unbounded) data. You write once and run either way.
PCollections and PTransforms: Data in Beam is a PCollection (a dataset). Transformations are PTransforms (operations applied to PCollections).
Windowing: Groups unbounded streaming data into finite windows (fixed, sliding, session) for aggregation.
Autoscaling: Dataflow automatically adds or removes workers based on pipeline throughput, optimizing cost and performance.

Exam trap: Dataflow is the processing engine; Pub/Sub is the messaging layer. They are complementary, not competing. A common exam scenario: "ingest real-time events" (Pub/Sub) then "transform and load into BigQuery" (Dataflow).

3.4 End-to-End Streaming Architecture

The exam may test your understanding of how these services connect. Here is the canonical Google Cloud streaming analytics pipeline:

Data Sources → Pub/Sub → Dataflow → BigQuery → Looker
(IoT, apps,    (ingest)   (transform)  (store/    (visualize/
 logs, events)                          analyze)    report)

Stage	Service	Role
Ingest	Pub/Sub	Collect events from any number of producers
Process	Dataflow	Transform, enrich, aggregate data in real time
Store/Analyze	BigQuery	Warehouse the processed data; run SQL analytics
Visualize	Looker / Looker Studio	Build dashboards and reports; share insights

3.5 Database Migration and Modernization

The exam covers awareness of migration strategies, not deep implementation details.

Database Migration Service (DMS): A fully managed service that migrates databases to Google Cloud with minimal downtime. Supports migrations to Cloud SQL and AlloyDB from sources like MySQL, PostgreSQL, SQL Server, and Oracle. (Database Migration Service)

Common migration strategies:

Strategy	Description	Example
Lift and shift	Move database as-is to a managed service	On-prem MySQL to Cloud SQL for MySQL
Migrate and modernize	Move and upgrade to a cloud-native service	Oracle DB to Cloud Spanner or AlloyDB
Replatform	Change the underlying engine	SQL Server to PostgreSQL on Cloud SQL

Exam trap: "Migration" questions on this exam are high-level. They want you to know that Google Cloud offers managed migration tools and that organizations can move to fully managed databases to reduce operational overhead. You will not be asked to configure DMS step by step.

4. Exam Preparation Tips for Domain 2

High-Frequency Topics

BigQuery -- most-tested service in this domain. Know it is serverless, separates storage/compute, supports ML via SQL, and works as a data lakehouse.
Storage class selection -- memorize the four classes, their minimum durations (none, 30, 90, 365 days), and access patterns.
Database selection -- given a scenario, pick the right database. Cloud SQL for regional relational, Spanner for global relational, Bigtable for massive NoSQL throughput, Firestore for mobile/web with real-time sync.
Pub/Sub + Dataflow pipeline -- understand the canonical streaming architecture and each component's role.
Data types (structured/semi-structured/unstructured) and where each is stored.

Common Exam Traps

Trap	Correct Answer
Using Bigtable for complex SQL queries with joins	Bigtable has no SQL, no joins -- use BigQuery or Cloud SQL
Choosing Spanner for a simple regional web app database	Cloud SQL is cheaper and sufficient for regional workloads
Confusing BigQuery (analytics) with Cloud SQL (transactions)	BigQuery is for analytics/OLAP; Cloud SQL is for transactions/OLTP
Thinking Archive storage takes hours to retrieve	Cloud Storage Archive retrieves in milliseconds (unlike AWS Glacier)
Mixing up Pub/Sub (messaging) and Dataflow (processing)	Pub/Sub ingests; Dataflow transforms. They work together, not interchangeably.
Choosing Firestore when data exceeds 10+ TB	Firestore is for 0 to a few TB. For massive NoSQL, use Bigtable.