Domain 2: Exploring Data Transformation with Google Cloud (~16%)
Domain 2 of the Google Cloud Digital Leader exam tests your understanding of how data drives business value and which Google Cloud products manage, process, and analyze that data. At approximately 16% of the exam, expect 8-10 questions across three objectives: the value of data, Google Cloud data management solutions, and making data useful and accessible.
This domain is product-heavy. You must know what each service does, when to choose it over alternatives, and how the services connect to form end-to-end data pipelines.
1. The Value of Data (Section 2.1)
1.1 Data as a Business Driver
Organizations use data to generate business insights, drive decision-making, and create new value. The exam expects you to understand that digital transformation fundamentally depends on an organization's ability to collect, manage, and act on data. Data is not just a byproduct of operations; it is a strategic asset. (Cloud Digital Leader Exam Guide)
Key concepts:
- Descriptive analytics -- what happened (dashboards, reports)
- Diagnostic analytics -- why it happened (drill-down analysis)
- Predictive analytics -- what will happen (ML models, forecasting)
- Prescriptive analytics -- what should we do (optimization, recommendations)
Cloud makes advanced analytics accessible because organizations no longer need to provision massive on-premises infrastructure to store and process data. They can scale compute and storage independently, paying only for what they use.
1.2 Databases vs. Data Warehouses vs. Data Lakes
This is a high-probability exam topic. Know the definitions and the differences cold.
| Characteristic | Database | Data Warehouse | Data Lake |
|---|---|---|---|
| Primary purpose | Transactional operations (OLTP) | Analytical queries (OLAP) | Raw data storage for future processing |
| Data structure | Structured (schema-on-write) | Structured/semi-structured (schema-on-write) | Structured, semi-structured, and unstructured (schema-on-read) |
| Data freshness | Real-time current state | Periodically loaded (ETL) | Raw ingestion, processed later |
| Typical users | Applications, developers | Business analysts, data scientists | Data engineers, data scientists |
| Query pattern | Many small reads/writes | Few complex analytical queries | Varies by downstream consumer |
| Google Cloud example | Cloud SQL, Firestore | BigQuery | Cloud Storage (with analytics tools) |
Exam trap: A data lake is not a database. A data lake stores raw, unprocessed data in its native format. The processing and schema are applied when the data is read, not when it is written. BigQuery blurs this line because it now functions as both a data warehouse and a data lakehouse, supporting structured and unstructured data with open table formats like Apache Iceberg. (BigQuery Introduction)
1.3 Data Types
| Type | Definition | Examples | Google Cloud Storage |
|---|---|---|---|
| Structured | Predefined schema, rows/columns | Financial transactions, inventory records | Cloud SQL, Cloud Spanner, BigQuery |
| Semi-structured | Flexible schema, self-describing | JSON, XML, Avro, Parquet | Firestore, BigQuery, Cloud Storage |
| Unstructured | No predefined schema | Images, video, audio, PDFs, logs | Cloud Storage |
1.4 Data Governance
Data governance is the set of policies, processes, and standards that ensure data is managed properly throughout its lifecycle. The exam tests awareness of governance, not deep implementation details. (Cloud Digital Leader Exam Guide)
Core principles:
- Data quality -- accuracy, completeness, consistency, timeliness
- Data security -- access controls, encryption, audit trails
- Data privacy -- compliance with regulations (GDPR, HIPAA, CCPA)
- Data lineage -- tracking where data came from and how it was transformed
- Data cataloging -- metadata management so users can discover and understand available data
Google Cloud's governance tool is Dataplex Universal Catalog, which provides unified data governance across data lakes, warehouses, and databases. It integrates with BigQuery to manage metadata, data quality, and lineage across your entire data landscape. (Dataplex Universal Catalog)
1.5 Real-Time vs. Batch Processing
| Aspect | Batch Processing | Real-Time (Stream) Processing |
|---|---|---|
| Data handling | Collects data over time, processes in bulk | Processes data as it arrives |
| Latency | Minutes to hours | Milliseconds to seconds |
| Use cases | Monthly reports, ETL jobs, billing | Fraud detection, live dashboards, IoT monitoring |
| Google Cloud tools | Dataflow (batch mode), BigQuery batch loads | Dataflow (streaming mode), Pub/Sub, BigQuery streaming |
| Trade-off | Higher throughput, lower cost per record | Lower latency, higher cost per record |
2. Google Cloud Data Management Solutions (Section 2.2)
This is the core of Domain 2. You need to match each service to its correct use case. The exam tests whether you can pick the right tool for a given scenario.
2.1 Cloud SQL
Cloud SQL is a fully managed relational database service supporting MySQL, PostgreSQL, and SQL Server.
| Feature | Detail |
|---|---|
| Type | Relational (SQL), ACID-compliant |
| Managed engines | MySQL, PostgreSQL, SQL Server |
| Scale | Vertical scaling (larger machines); up to a few TB |
| Scope | Regional (single region) |
| High availability | Regional HA with automatic failover across zones |
| Read scaling | Read replicas (same region or cross-region) |
| Use cases | Web apps, CMS, CRM, ERP, e-commerce, SaaS |
| When to choose | Traditional relational workloads that fit in one region and do not need horizontal scaling |
Exam trap: Cloud SQL scales vertically, not horizontally. If the question describes a globally distributed application requiring unlimited horizontal scaling with strong consistency, the answer is Cloud Spanner, not Cloud SQL.
2.2 Cloud Spanner
Cloud Spanner is a globally distributed, horizontally scalable, strongly consistent relational database.
| Feature | Detail |
|---|---|
| Type | Relational (SQL), ACID-compliant |
| Scale | Horizontal scaling across regions; petabyte-scale |
| Scope | Regional or multi-regional (global) |
| Availability | Up to 99.999% (five nines) for multi-region configurations |
| Consistency | Strong external consistency (strongest possible) |
| Use cases | Global financial ledgers, gaming leaderboards, supply chain, payment processing, inventory management |
| When to choose | You need relational semantics (SQL, joins, ACID) at global scale with strong consistency |
Exam trap: Spanner is significantly more expensive than Cloud SQL. If a question describes a simple regional web application, Cloud SQL is the correct answer. Spanner is overkill for workloads that do not require global distribution or massive horizontal scaling.
2.3 Cloud Bigtable
Cloud Bigtable is a fully managed, wide-column NoSQL database designed for large analytical and operational workloads.
| Feature | Detail |
|---|---|
| Type | NoSQL, wide-column (HBase-compatible API) |
| Scale | Petabytes of data, millions of reads/writes per second |
| Latency | Single-digit millisecond |
| Consistency | Eventual consistency (single-row reads are strongly consistent) |
| Query support | No SQL, no joins, no multi-row transactions |
| Use cases | IoT time-series data, financial tick data, ad tech, personalization/recommendations, monitoring, geospatial |
| When to choose | Massive throughput of simple key-value lookups or range scans; data > 10 TB |
Exam trap: Bigtable does not support SQL queries, joins, or multi-row transactions. If a question requires complex queries with joins, Bigtable is wrong. If the question emphasizes low-latency reads/writes on terabytes to petabytes of time-series or IoT data, Bigtable is correct.
2.4 Firestore
Firestore is a serverless, NoSQL document database designed for mobile, web, and IoT applications.
| Feature | Detail |
|---|---|
| Type | NoSQL, document-oriented (collections/documents) |
| Scale | Automatic scaling, suitable for 0 to a few TB |
| Consistency | Strong consistency |
| Real-time | Built-in real-time listeners (data syncs to clients instantly) |
| Offline support | Client SDKs support offline data access and sync |
| Use cases | Mobile/web apps, real-time collaboration, user profiles, game state, shopping carts |
| When to choose | Mobile/web apps needing real-time sync, offline support, and flexible document schemas |
Exam trap: Firestore is the successor to Cloud Datastore. If you see "Datastore" in an exam question, understand it is the legacy name. For new applications, Firestore (in Native mode) is the recommended choice. Firestore is strongly consistent; Bigtable is eventually consistent (except single-row reads).
2.5 Cloud Storage
Cloud Storage is a unified object storage service for any amount of data. It is not a database -- it stores files (objects) in buckets.
Storage Classes
All four classes use the same API and tools. The difference is cost structure: cheaper storage costs are offset by higher retrieval costs and minimum storage durations. (Storage Classes)
| Storage Class | Min Duration | Access Pattern | Availability SLA (multi-region) | Use Cases |
|---|---|---|---|---|
| Standard | None | Frequently accessed ("hot" data) | 99.95% | Website content, streaming media, active data |
| Nearline | 30 days | ~Once per month | 99.9% | Backups, long-tail content, monthly analytics |
| Coldline | 90 days | ~Once per quarter | 99.9% | Disaster recovery, quarterly reporting |
| Archive | 365 days | Less than once per year | 99.9% | Regulatory compliance, long-term retention |
Key facts for the exam:
- Durability: 99.999999999% (eleven 9s) annual durability across all classes
- Retrieval: Archive data is available within milliseconds, not hours or days (unlike AWS Glacier's restore delay)
- Autoclass: Automatically transitions objects between storage classes based on access patterns to optimize cost
- Unified API: Same tools and API regardless of storage class; no need to change application code
- No minimum object size: Unlike some competitors
Exam trap: Archive storage has a 365-day minimum storage duration. If you delete or overwrite an object before 365 days, you are charged for the full 365 days. The same principle applies to Nearline (30 days) and Coldline (90 days). Standard has no minimum.
2.6 BigQuery
BigQuery is Google Cloud's fully managed, serverless, enterprise data warehouse and analytics platform. It is arguably the most important service in Domain 2.
| Feature | Detail |
|---|---|
| Type | Serverless data warehouse / data lakehouse |
| Architecture | Separated storage and compute (scale independently) |
| Query language | Standard SQL (ANSI SQL:2011 compliant) |
| Performance | Terabytes in seconds, petabytes in minutes |
| Data formats | Structured and semi-structured; supports Apache Iceberg, Delta, Hudi |
| ML integration | BigQuery ML -- create and run ML models using SQL |
| Multicloud | BigQuery Omni allows querying data in AWS and Azure without moving it |
| Streaming | Supports real-time streaming ingestion |
| BI integration | Native integration with Looker, Looker Studio, Google Sheets, and third-party tools (Tableau, Power BI) |
| Pricing models | On-demand (pay per TB scanned) or capacity-based (reserved slots) |
| Use cases | Enterprise analytics, reporting, data lakehouse, ML model training, multicloud analytics |
Key concepts to know:
- Storage-compute separation: BigQuery stores data in its distributed storage layer (Colossus) and uses a separate compute engine (Dremel) for queries. This means storage and compute scale independently, and you are not paying for idle compute.
- BigQuery ML: Lets you create, train, and predict with ML models using SQL. No need to export data or learn a separate ML framework. Supports linear regression, logistic regression, k-means clustering, time-series forecasting, and more.
- Federated queries: Query data in Cloud Storage, Bigtable, Spanner, or Google Sheets without loading it into BigQuery.
- BigQuery Omni: Run BigQuery analytics on data stored in AWS S3 or Azure Blob Storage without copying data to Google Cloud.
Exam trap: BigQuery is serverless -- you do not manage servers, clusters, or infrastructure. If a question asks about a managed analytics solution that requires no infrastructure management, BigQuery is likely the answer. Also, BigQuery is not a transactional database. Do not confuse it with Cloud SQL or Cloud Spanner.
2.7 Database Selection Decision Guide
Use this table to match scenario keywords to the correct service:
| Scenario Keywords | Correct Service |
|---|---|
| MySQL, PostgreSQL, SQL Server, regional, lift-and-shift | Cloud SQL |
| Global, horizontal scaling, relational, 99.999% availability, financial | Cloud Spanner |
| IoT, time-series, low-latency, wide-column, petabytes, HBase | Cloud Bigtable |
| Mobile app, web app, real-time sync, offline, document database | Firestore |
| Object storage, images, videos, backups, archive, unstructured files | Cloud Storage |
| Analytics, data warehouse, serverless, SQL on petabytes, ML, dashboards | BigQuery |
3. Making Data Useful and Accessible (Section 2.3)
3.1 Looker
Looker is Google Cloud's enterprise business intelligence (BI) platform. It makes data accessible to non-technical users through self-service analytics.
Key capabilities:
- LookML: A modeling language that defines data relationships, business logic, and metrics in a reusable, version-controlled layer. This ensures everyone in the organization works from a single source of truth.
- Data democratization: Enables business users to explore data and build their own reports without writing SQL.
- Embedded analytics: Looker can embed dashboards and visualizations directly into applications and portals.
- BigQuery integration: Native, optimized connection to BigQuery for real-time analytics.
- Governed metrics: Centralized metric definitions prevent conflicting interpretations of business KPIs.
Looker vs. Looker Studio: Looker is the enterprise BI platform (LookML modeling, governed metrics, embedded analytics, API access). Looker Studio (formerly Data Studio) is a free, self-service dashboarding and reporting tool. Looker Studio is simpler and focused on visualization; Looker adds the governance and semantic modeling layer.
Exam trap: When a question mentions "data democratization," "self-service analytics," or "single source of truth for metrics," the answer is Looker. When the question just needs a simple dashboard or report, Looker Studio may be sufficient.
3.2 Pub/Sub
Pub/Sub is a fully managed, real-time messaging service for event-driven architectures.
| Feature | Detail |
|---|---|
| Model | Publish-subscribe (producers publish messages to topics; subscribers receive messages) |
| Delivery | At-least-once delivery; supports push and pull subscriptions |
| Scale | Billions of messages per day, globally |
| Ordering | Optional message ordering per key |
| Retention | Configurable message retention (up to 31 days) |
| Serverless | No provisioning or capacity planning required |
Use cases:
- Event ingestion: Capture user interactions, IoT sensor data, application logs
- Decoupling microservices: Publishers and subscribers operate independently; neither needs to know about the other
- Streaming data pipelines: Feed real-time data into Dataflow, BigQuery, or Cloud Storage
- Enterprise event bus: Distribute business events across teams and applications
How Pub/Sub fits into pipelines: In a typical streaming architecture, data producers publish events to Pub/Sub topics. Dataflow subscribes to those topics, transforms the data in real time, and writes results to BigQuery or Cloud Storage for analysis in Looker.
3.3 Dataflow
Dataflow is a fully managed service for stream and batch data processing, built on the open-source Apache Beam SDK.
| Feature | Detail |
|---|---|
| Programming model | Apache Beam (unified batch and stream processing) |
| Execution | Fully managed, serverless (auto-scaling workers) |
| Scale | Up to 4,000 workers per job; routinely processes petabytes |
| Languages | Java, Python, Go |
| Use cases | ETL pipelines, real-time analytics, data enrichment, log processing |
Key concepts:
- Unified model: The same Apache Beam pipeline code processes both batch (bounded) and streaming (unbounded) data. You write once and run either way.
- PCollections and PTransforms: Data in Beam is a PCollection (a dataset). Transformations are PTransforms (operations applied to PCollections).
- Windowing: Groups unbounded streaming data into finite windows (fixed, sliding, session) for aggregation.
- Autoscaling: Dataflow automatically adds or removes workers based on pipeline throughput, optimizing cost and performance.
Exam trap: Dataflow is the processing engine; Pub/Sub is the messaging layer. They are complementary, not competing. A common exam scenario: "ingest real-time events" (Pub/Sub) then "transform and load into BigQuery" (Dataflow).
3.4 End-to-End Streaming Architecture
The exam may test your understanding of how these services connect. Here is the canonical Google Cloud streaming analytics pipeline:
Data Sources → Pub/Sub → Dataflow → BigQuery → Looker
(IoT, apps, (ingest) (transform) (store/ (visualize/
logs, events) analyze) report)
| Stage | Service | Role |
|---|---|---|
| Ingest | Pub/Sub | Collect events from any number of producers |
| Process | Dataflow | Transform, enrich, aggregate data in real time |
| Store/Analyze | BigQuery | Warehouse the processed data; run SQL analytics |
| Visualize | Looker / Looker Studio | Build dashboards and reports; share insights |
3.5 Database Migration and Modernization
The exam covers awareness of migration strategies, not deep implementation details.
Database Migration Service (DMS): A fully managed service that migrates databases to Google Cloud with minimal downtime. Supports migrations to Cloud SQL and AlloyDB from sources like MySQL, PostgreSQL, SQL Server, and Oracle. (Database Migration Service)
Common migration strategies:
| Strategy | Description | Example |
|---|---|---|
| Lift and shift | Move database as-is to a managed service | On-prem MySQL to Cloud SQL for MySQL |
| Migrate and modernize | Move and upgrade to a cloud-native service | Oracle DB to Cloud Spanner or AlloyDB |
| Replatform | Change the underlying engine | SQL Server to PostgreSQL on Cloud SQL |
Exam trap: "Migration" questions on this exam are high-level. They want you to know that Google Cloud offers managed migration tools and that organizations can move to fully managed databases to reduce operational overhead. You will not be asked to configure DMS step by step.
4. Exam Preparation Tips for Domain 2
High-Frequency Topics
- BigQuery -- most-tested service in this domain. Know it is serverless, separates storage/compute, supports ML via SQL, and works as a data lakehouse.
- Storage class selection -- memorize the four classes, their minimum durations (none, 30, 90, 365 days), and access patterns.
- Database selection -- given a scenario, pick the right database. Cloud SQL for regional relational, Spanner for global relational, Bigtable for massive NoSQL throughput, Firestore for mobile/web with real-time sync.
- Pub/Sub + Dataflow pipeline -- understand the canonical streaming architecture and each component's role.
- Data types (structured/semi-structured/unstructured) and where each is stored.
Common Exam Traps
| Trap | Correct Answer |
|---|---|
| Using Bigtable for complex SQL queries with joins | Bigtable has no SQL, no joins -- use BigQuery or Cloud SQL |
| Choosing Spanner for a simple regional web app database | Cloud SQL is cheaper and sufficient for regional workloads |
| Confusing BigQuery (analytics) with Cloud SQL (transactions) | BigQuery is for analytics/OLAP; Cloud SQL is for transactions/OLTP |
| Thinking Archive storage takes hours to retrieve | Cloud Storage Archive retrieves in milliseconds (unlike AWS Glacier) |
| Mixing up Pub/Sub (messaging) and Dataflow (processing) | Pub/Sub ingests; Dataflow transforms. They work together, not interchangeably. |
| Choosing Firestore when data exceeds 10+ TB | Firestore is for 0 to a few TB. For massive NoSQL, use Bigtable. |
References
- Cloud Digital Leader Exam Guide
- BigQuery Introduction
- Cloud Storage - Storage Classes
- Pub/Sub Overview
- Dataflow Overview
- Cloud SQL Documentation
- Cloud Spanner Documentation
- Cloud Bigtable Documentation
- Firestore Documentation
- Looker Documentation
- Dataplex Universal Catalog
- Database Migration Service
- Your Google Cloud Database Options, Explained