Reference

Cloud Services Cross-Reference: Analytics & Big Data

This document maps analytics and big data services across AWS, Azure, Oracle Cloud Infrastructure (OCI), and Google Cloud Platform (GCP). Coverage spans data warehousing, streaming, ETL/data integration, data lakes, business intelligence, data catalog and governance, distributed compute (Hadoop/Spark), and batch processing. All four providers offer competing managed services in each category, but their architectural models, pricing structures, and native ecosystem integrations differ substantially. GCP's BigQuery stands alone as a fully serverless, storage-compute-separated warehouse with no cluster management. Azure has shifted its strategic emphasis from Azure Synapse Analytics toward Microsoft Fabric, a SaaS-unified analytics platform. AWS continues to expand the open lakehouse pattern around Apache Iceberg across all analytics services. OCI differentiates on price and deep Oracle Database integration across the stack.


1. Data Warehousing

Data warehouse services provide managed, columnar-storage SQL query engines optimized for analytical workloads against large structured datasets.

AWS — Amazon Redshift Redshift offers two deployment modes: provisioned clusters (pay per node-hour) and Redshift Serverless (pay per RPU-hour at $0.36/RPU-hour, minimum 60-second billing, automatic capacity scaling). As of 2025, Redshift writes directly to Apache Iceberg tables, enabling open lakehouse patterns where Redshift, EMR, Athena, and Glue all share data in S3 without copying. Redshift Spectrum allows querying external S3 data from Redshift SQL. Redshift is a traditional cluster-oriented warehouse that requires explicit capacity management even in serverless mode (RPUs are provisioned, not per-query).

Azure — Azure Synapse Analytics / Microsoft Fabric Azure Synapse Analytics is a PaaS unified analytics workspace combining SQL dedicated pools (provisioned data warehouse), SQL serverless pools (pay per TB queried), and Apache Spark pools. Microsoft Fabric — GA since late 2023 — is the strategic successor: a SaaS platform that unifies data engineering, warehousing, real-time analytics, and BI under OneLake, a single data lake using Delta Lake Parquet format. Synapse remains supported for existing workloads; new deployments should consider Fabric. Microsoft explicitly positions Fabric's Data Warehouse as "purpose-built for the demands of the 2020s." Synapse dedicated pools use SQL Server proprietary format; Fabric warehouses store data as Delta Lake.

OCI — Oracle Autonomous Data Warehouse (ADW) ADW is a fully autonomous cloud data warehouse that automates provisioning, patching, tuning, backup, and scaling without DBA intervention. In 2025 Oracle renamed the offering "Autonomous AI Lakehouse," reflecting native Apache Iceberg support and integration with object storage for open format data. ADW runs on Oracle Exadata infrastructure, providing high I/O throughput. Auto-indexing and auto-partitioning eliminate manual tuning. Strong integration with Oracle Analytics Cloud for end-to-end pipeline. Serverless and dedicated deployment options available.

GCP — BigQuery BigQuery is fundamentally different from the other three: it is architecturally serverless with complete storage-compute separation via Google's petabit-scale internal network. There are no clusters, no node provisioning decisions, and no capacity management. Compute scales automatically per query; storage scales independently. Pricing is on-demand ($5 per TB queried) or capacity-based (slots). BigQuery Serverless Spark (GA 2025) allows running Spark workloads directly inside BigQuery Studio without separate cluster provisioning. BigQuery supports open table formats (Iceberg, Delta, Hudi) via BigLake.

Feature AWS Redshift Azure Synapse / Fabric OCI ADW GCP BigQuery
Deployment model Provisioned clusters + Serverless PaaS (Synapse) / SaaS (Fabric) Serverless + Dedicated (Exadata) Fully serverless, no clusters
Serverless pricing $0.36/RPU-hour (min 60s) Per TB queried (serverless pools) Pay per OCPU/storage $5/TB queried (on-demand)
Storage format Proprietary + Iceberg (S3) Proprietary (Synapse) / Delta Lake (Fabric) Proprietary + Iceberg (Object Storage) Capacitor (native) + open formats via BigLake
Open table support Apache Iceberg Delta Lake (Fabric native) Apache Iceberg Iceberg, Delta, Hudi via BigLake
Auto-scaling Yes (serverless RPU auto-scale) Yes (serverless pools) Yes (autonomous) Always automatic
Cluster management required Yes (provisioned) / No (serverless) Yes (Synapse pools) / No (Fabric) No (fully autonomous) Never
Integrated ML Redshift ML (SageMaker) Synapse Analytics + Azure ML AutoML, OML in-database BigQuery ML (native SQL)
Unique differentiator Deep Iceberg lakehouse ecosystem Fabric OneLake unified SaaS Oracle Exadata performance, autonomous ops No-infrastructure serverless; per-query pricing

Key differentiators:

  • BigQuery's serverless model eliminates all infrastructure decisions; it is the only warehouse where you never touch a cluster or capacity setting.
  • ADW's autonomous operations (auto-indexing, auto-patching, auto-scaling) differentiate OCI for organizations running Oracle workloads who want zero DBA overhead.
  • Microsoft Fabric's OneLake and Delta Lake foundation unifies BI, engineering, and warehousing in one SaaS subscription, but represents a migration from existing Synapse investments.
  • Redshift's tight Iceberg integration across the entire AWS analytics ecosystem (EMR, Athena, Glue) provides the broadest multi-engine open lakehouse on a single cloud.

2. Streaming / Real-Time Data

Streaming services ingest, transport, and process high-throughput event data in real time, supporting event-driven architectures, IoT pipelines, and real-time analytics.

AWS — Amazon Kinesis + Amazon MSK The Kinesis family covers four distinct services: Kinesis Data Streams (managed real-time data streaming, millisecond latency, 1-year retention), Amazon Data Firehose (formerly Kinesis Data Firehose — load streaming data directly to S3, Redshift, OpenSearch, and third-party destinations without writing consumer code), Amazon Managed Service for Apache Flink (stream processing with SQL or Java/Python, formerly Kinesis Data Analytics), and Kinesis Video Streams (video ingest and playback). Amazon MSK (Managed Streaming for Apache Kafka) is the fully managed Kafka offering for teams that need Kafka API compatibility. MSK Serverless provides Kafka without cluster provisioning.

Azure — Azure Event Hubs + Azure Stream Analytics Azure Event Hubs is a fully managed big data streaming platform and event ingestion service with Apache Kafka protocol compatibility (no code change required for Kafka producers/consumers). Event Hubs Capture writes raw streams directly to Azure Blob Storage or ADLS Gen2. Azure Stream Analytics is the managed real-time query engine: reads from Event Hubs (or IoT Hub), applies SQL-like queries with windowing functions, and writes results to storage, databases, or Power BI. Microsoft Fabric Real-Time Intelligence (formerly Real-Time Analytics) provides an integrated streaming analytics workload within the Fabric SaaS platform.

OCI — OCI Streaming + OCI Streaming with Apache Kafka OCI Streaming is the native managed event streaming service compatible with Apache Kafka APIs. OCI offers a separate fully managed Kafka service — "Streaming with Apache Kafka" — which is 100% API-compatible with Apache Kafka, allowing existing Kafka applications to migrate without code changes. Oracle claims up to 31% lower cost than Amazon MSK and up to 73% lower cost than Confluent. OCI GoldenGate provides real-time change data capture (CDC) from databases (Oracle, MySQL, PostgreSQL, SQL Server) for continuous replication and event generation.

GCP — Pub/Sub + Dataflow Cloud Pub/Sub is Google's fully managed, serverless real-time messaging service: global, durable, at-least-once delivery, Kafka-compatible via a Kafka connector. It underpins Google's internal messaging at scale. Dataflow is Google's managed implementation of Apache Beam for unified batch and streaming data processing — the same pipeline code handles both. Pub/Sub feeds Dataflow, which processes and writes results to BigQuery or Cloud Storage. Datastream is a serverless CDC and replication service for reading changes from operational databases (Oracle, MySQL, PostgreSQL, AlloyDB) into BigQuery or Cloud Storage in real time.

Feature AWS Azure OCI GCP
Managed Kafka Amazon MSK / MSK Serverless Event Hubs (Kafka-compatible) OCI Streaming with Apache Kafka Pub/Sub (Kafka connector available)
Native streaming ingest Kinesis Data Streams Event Hubs OCI Streaming Pub/Sub
Stream processing engine Managed Service for Apache Flink Stream Analytics / Fabric Real-Time Intelligence OCI GoldenGate (CDC/streaming) Dataflow (Apache Beam)
Serverless stream ingest Kinesis Data Streams (serverless scaling) Event Hubs (Kafka-compatible, auto-inflate) OCI Streaming (serverless) Pub/Sub (always serverless)
CDC / database replication AWS DMS + Kinesis Event Hubs + Azure DMS OCI GoldenGate Datastream
Delivery to data warehouse Data Firehose → Redshift / S3 Event Hubs Capture → ADLS / Synapse OCI GoldenGate → ADW Pub/Sub → Dataflow → BigQuery
Video streaming Kinesis Video Streams Azure Video Indexer None native None native

Key differentiators:

  • AWS has the most granular streaming service decomposition: separate managed services for ingest (Kinesis Data Streams), delivery (Firehose), stream SQL (Flink), and Kafka (MSK).
  • OCI GoldenGate is uniquely positioned for Oracle-to-Oracle CDC scenarios and supports ZeroETL Mirror for direct, low-latency replication into ADW.
  • GCP Dataflow's Apache Beam model is the only major streaming engine where identical code runs both batch and streaming workloads without modification.
  • Azure Event Hubs' native Kafka API compatibility allows Kafka workloads to migrate to Azure without application changes, making it the lowest-friction Kafka migration path.

3. ETL / Data Integration

ETL and data integration services extract data from source systems, transform it, and load it into analytical targets. Modern services increasingly offer visual no-code/low-code pipeline design alongside code-based options.

AWS — AWS Glue AWS Glue is the primary serverless ETL service: auto-generates ETL code (Python/Scala PySpark), provides a Data Catalog for schema discovery and metadata management, and includes Glue DataBrew for visual no-code data preparation. Glue crawlers auto-discover schema from S3, databases, and other sources. Glue Studio provides a visual drag-and-drop job authoring interface. AWS Data Pipeline is an older orchestration service for moving data between AWS services; new workloads should prefer Glue or Step Functions. AWS AppFlow handles SaaS-to-AWS data integration (Salesforce, SAP, etc.).

Azure — Azure Data Factory + Microsoft Fabric Data Engineering Azure Data Factory (ADF) is the cloud-scale ETL and data orchestration service with 90+ built-in connectors, a visual pipeline canvas, and support for mapping data flows (code-free Spark-based transformations). ADF integrates natively with Azure Synapse and Azure Databricks. Within Microsoft Fabric, the Data Engineering workload (notebooks, Spark jobs, data pipelines using a Fabric-native version of ADF pipelines) is the preferred path for new Fabric deployments. Azure Databricks — a first-party Azure service based on Databricks — is widely used for large-scale PySpark ETL.

OCI — OCI Data Integration + Oracle GoldenGate OCI Data Integration is Oracle's fully managed serverless ETL/ELT service: visual no-code pipeline design, supports bulk data loading into ADW and Object Storage, includes data quality and lineage capabilities. OCI GoldenGate's Data Transforms feature (added 2023) extends GoldenGate with batch ETL/ELT processing via drag-and-drop pipelines, enabling a single GoldenGate deployment to handle both real-time CDC and batch data integration. Oracle Data Integrator (ODI) is the on-premises ETL product, available as a cloud edition for lift-and-shift scenarios.

GCP — Cloud Data Fusion + Dataform + Dataprep Cloud Data Fusion is GCP's fully managed, code-free ETL service based on the open-source CDAP framework. It supports 150+ pre-built connectors, visual pipeline design, and generates Dataproc (Spark) or Dataflow jobs for execution. Dataprep by Trifacta (integrated into GCP as Google Cloud Dataprep) provides interactive visual data wrangling and generates Dataflow jobs; it is best suited for smaller to medium datasets and ad-hoc data preparation. Dataform is a SQL-based transformation tool for building and testing data models in BigQuery (similar to dbt), integrated into BigQuery Studio as of 2023.

Feature AWS Glue Azure Data Factory / Fabric OCI Data Integration / GoldenGate GCP Data Fusion / Dataform
Visual pipeline design Glue Studio ADF Canvas / Fabric Pipelines OCI DI Visual Designer / GoldenGate Data Transforms Cloud Data Fusion Canvas
Serverless execution Yes (Glue serverless Spark) Yes (Mapping Data Flows on serverless Spark) Yes (OCI DI serverless) Yes (Dataflow backend)
Code-based ETL PySpark / Scala PySpark (Databricks) PySpark (OCI Data Flow) Python / SQL (Dataflow / Dataform)
Built-in connectors 100+ 90+ Oracle-focused + common databases 150+ (Data Fusion)
SaaS integration AWS AppFlow ADF + Power Platform connectors Oracle Integration Cloud (OIC) Cloud Data Fusion connectors
Data wrangling / prep Glue DataBrew ADF Wrangling Data Flows / Power Query OCI GoldenGate Data Transforms Cloud Dataprep (Trifacta)
SQL transformation dbt (third-party) dbt (third-party) or Synapse pipelines OCI GoldenGate ELT Dataform (native, BigQuery-integrated)
Schema crawling Glue Crawlers → Data Catalog ADF with Purview integration OCI Data Catalog integration Dataplex auto-discovery

Key differentiators:

  • AWS Glue Crawlers + Data Catalog provide the most automated schema discovery and metadata management, tightly coupled to the overall AWS analytics ecosystem.
  • OCI's combination of OCI Data Integration (batch ETL) and OCI GoldenGate (real-time CDC + batch ELT in one service) is uniquely suited for Oracle Database source systems.
  • GCP's Dataform is native to BigQuery Studio, making SQL-based transformation and dbt-style data modeling a first-class citizen with no additional tooling.
  • Azure Data Factory's 90+ connectors and Fabric's pipeline integration make it the most enterprise-connector-rich option for heterogeneous SaaS and on-premises sources.

4. Data Lakes

Data lake services provide scalable object storage with metadata, access control, and governance features for storing raw and processed data in open formats.

AWS — Amazon S3 + AWS Lake Formation Amazon S3 is the underlying storage for all AWS data lake architectures. AWS Lake Formation is the governance and security layer on top of S3: it provides centralized data lake setup, column- and row-level fine-grained access control via Lake Formation permissions (extending IAM), tag-based access control, cross-account data sharing, and integration with AWS Glue Data Catalog for schema management. Lake Formation does not replace S3; it governs access to data stored in S3. Governed tables in Lake Formation support ACID transactions on S3.

Azure — Azure Data Lake Storage Gen2 (ADLS Gen2) + Microsoft Fabric OneLake ADLS Gen2 extends Azure Blob Storage with hierarchical namespace (directory/file semantics), POSIX ACLs, Azure RBAC, AES-256 encryption, and petabyte-scale throughput optimized for analytics engines (Spark, Hive, Presto). It is the standard storage layer for Azure analytics workloads (Synapse, Databricks, HDInsight). Microsoft Fabric OneLake is the next-generation data lake built on top of ADLS Gen2: a single, unified data lake per Fabric tenant, where all Fabric workloads (warehouses, lakehouses, notebooks) store data as Delta Lake Parquet automatically. OneLake eliminates data silos by providing one storage location accessible to all Fabric engines.

OCI — OCI Object Storage + Oracle Intelligent Data Lake (preview) OCI Object Storage is the S3-compatible object store underpinning OCI data lake architectures. It supports Apache Iceberg table format, AES-256 encryption, IAM-based access control, and lifecycle management. OCI Object Storage egress is free within OCI (10 TB/month external egress included), making it notably cheaper than AWS S3 or ADLS Gen2 for data-intensive workloads. Oracle Intelligent Data Lake (announced at Oracle CloudWorld 2024, limited availability 2025) is an emerging managed data lake platform integrating a unified catalog, Apache Spark and Flink processing, and Jupyter Notebook for data science in one experience on top of Object Storage.

GCP — Cloud Storage + BigLake + Dataplex Google Cloud Storage (GCS) is the object storage layer. BigLake is GCP's unified storage engine that spans structured data in BigQuery and open-format data in GCS, providing consistent fine-grained access control (column- and row-level), caching, and query acceleration regardless of whether data is in BigQuery native storage or external files (Iceberg, Delta, Parquet, ORC). Dataplex is the data mesh and data lake governance service: it auto-discovers data across GCS and BigQuery, applies unified policies, creates logical data zones (raw, curated, production), and enables cross-project data quality and lineage.

Feature AWS S3 + Lake Formation Azure ADLS Gen2 + OneLake OCI Object Storage + IDL GCP GCS + BigLake + Dataplex
Underlying storage Amazon S3 Azure Blob Storage (ADLS Gen2) OCI Object Storage Google Cloud Storage
Hierarchical namespace No (flat S3 prefix) Yes (POSIX ACLs, directory model) No (flat object model) No (flat bucket model)
Fine-grained access control Lake Formation (column/row/tag) Azure RBAC + POSIX ACLs OCI IAM Policies BigLake (column/row via policy tags)
Open table format Apache Iceberg (Lake Formation governed) Delta Lake (Fabric OneLake) Apache Iceberg Iceberg, Delta, Hudi (BigLake)
Unified multi-engine access Via Glue Catalog + Lake Formation OneLake (all Fabric engines) OCI Data Catalog + Object Storage BigLake (BigQuery + GCS)
Data mesh / governance Glue + Lake Formation + DataZone Fabric OneLake + Purview OCI Data Catalog + IDL (preview) Dataplex
Egress costs Standard S3 egress rates Standard Azure egress rates 10 TB/month free external egress Standard GCS egress rates
Managed data lake SaaS Lake Formation (governance only) Fabric OneLake (full SaaS) IDL (preview) Dataplex (governance)

Key differentiators:

  • Microsoft Fabric OneLake is the most opinionated data lake solution: a single unified SaaS lake per tenant, automatically shared across all Fabric workloads with zero copy, eliminating the integration overhead of separate storage accounts.
  • AWS Lake Formation provides the most mature and granular column- and row-level access control for S3-based data lakes, with tag-based policies and cross-account sharing.
  • OCI Object Storage's egress pricing advantage (effectively free within OCI) is significant for large-scale data movement workloads common in analytics pipelines.
  • GCP BigLake's ability to enforce the same column- and row-level policies on both BigQuery native tables and external GCS files (Iceberg/Parquet) is architecturally unique and simplifies governance across the storage boundary.

5. Business Intelligence

Business intelligence (BI) services provide managed dashboarding, reporting, and self-service analytics with governed semantic layers and embedded analytics capabilities.

AWS — Amazon QuickSight QuickSight is AWS's serverless, pay-per-session cloud BI service. It uses the SPICE engine (Super-fast, Parallel, In-memory Calculation Engine) to cache and accelerate queries, enabling sub-second dashboard load times for large datasets. QuickSight connects natively to Redshift, S3, Athena, RDS, and third-party sources. The Q feature provides natural language query against datasets. QuickSight Embedded enables BI embedding in custom applications. Pricing is per-author (fixed monthly) or per-reader-session ($0.30/session), making it cost-effective at scale.

Azure — Microsoft Power BI Power BI is Microsoft's industry-leading BI platform with the broadest enterprise adoption of the four providers. It is a SaaS service integrated into Microsoft 365 and natively connected to all Azure data sources, Dynamics 365, and hundreds of third-party connectors. Power BI Premium provides dedicated capacity for large organizations and enables paginated reports, deployment pipelines, and XMLA endpoint access for third-party tools. Within Microsoft Fabric, Power BI is a first-class workload: Direct Lake mode connects Power BI semantic models directly to OneLake Delta tables without data import, eliminating the traditional refresh cycle.

OCI — Oracle Analytics Cloud (OAC) Oracle Analytics Cloud is Oracle's managed BI and analytics platform on OCI. It provides self-service data discovery, dashboards, pixel-perfect reports, augmented analytics (AI-generated insights), and integration with ADW and Oracle Database. OAC supports Essbase for multidimensional OLAP analysis. The Analytics AI Assistant (GA 2025) enables natural language queries and AI-generated visualizations. OAC can be provisioned alongside ADW through OCI Resource Analytics for an integrated BI+data warehouse deployment.

GCP — Looker + Looker Studio Looker is Google's enterprise BI platform acquired in 2020, centered on LookML — a YAML-based semantic modeling language that defines business metrics, dimensions, and relationships centrally. LookML ensures consistent metric definitions across all dashboards and self-service reports. Looker integrates natively with BigQuery as its primary data source and connects to all major databases. Looker Studio (formerly Data Studio) is Google's free, lightweight reporting tool for building shareable dashboards from a wide range of sources; it lacks the governed semantic layer of Looker proper. Looker Studio Pro adds organizational sharing and SLA support.

Feature AWS QuickSight Azure Power BI OCI Analytics Cloud GCP Looker / Looker Studio
In-memory caching SPICE engine Import mode (VertiPaq) OAC in-memory caching LookML-defined caching / BigQuery materialization
Semantic / metric layer QuickSight datasets Power BI datasets OAC subject areas LookML (Looker)
Natural language query QuickSight Q Power BI Q&A AI Assistant (GA 2025) Looker conversational analytics
Self-service / free tier Reader session pricing Free Desktop / Power BI Free None (OAC is paid) Looker Studio (free)
Embedded analytics QuickSight Embedded Power BI Embedded OAC embedded Looker Embedded
OLAP / multidimensional Limited Analysis Services Oracle Essbase None native
Native warehouse connection Redshift, Athena, S3 Azure Synapse, ADX, Fabric OneLake ADW, Oracle Database BigQuery (primary), all via connectors
Pricing model Per author/reader-session Per user or capacity (Premium) OCPU-hour (fixed capacity) Looker: per user; Looker Studio: free

Key differentiators:

  • Power BI has the largest installed base and deepest Microsoft 365 integration; Fabric Direct Lake mode eliminates the data import latency that traditionally made large datasets slow in Power BI.
  • Looker's LookML semantic layer is architecturally unique among cloud-native BI tools: business logic is defined once in code, versioned in Git, and enforced consistently across all consumers.
  • OAC with Essbase provides OLAP multidimensional analysis capabilities that the other three providers lack natively, relevant for financial planning and enterprise performance management use cases.
  • QuickSight's per-session reader pricing ($0.30/session) is the most cost-effective model for large numbers of occasional report consumers.

6. Data Catalog / Governance

Data catalog and governance services provide metadata management, data discovery, lineage tracking, data quality monitoring, and access policy enforcement across data estates.

AWS — AWS Glue Data Catalog + Amazon DataZone + AWS Lake Formation AWS Glue Data Catalog is the central metadata repository for all AWS analytics services (Glue, Athena, Redshift Spectrum, EMR). Crawlers populate the catalog automatically from S3, databases, and streaming sources. Lake Formation uses the Glue Data Catalog as its underlying schema store and adds fine-grained access control, data governance policies, and cross-account sharing. Amazon DataZone (GA 2023) is the data mesh governance service: it provides a business catalog with a personalized search portal, data products, subscription-based access workflows, and lineage across AWS, on-premises, and SaaS sources.

Azure — Microsoft Purview Microsoft Purview (formerly Azure Purview) is the unified data governance and compliance platform covering the entire Microsoft data estate: Azure data services, Microsoft 365, Power BI, SQL Server on-premises, and multi-cloud sources. It provides automated data discovery and classification, end-to-end data lineage visualization, sensitivity labels, business glossary, and integration with Microsoft Defender for data security. Purview integrates directly into Fabric workspaces for unified governance of OneLake data.

OCI — OCI Data Catalog OCI Data Catalog is Oracle's managed metadata repository for OCI data assets (Object Storage, ADW, databases, OCI Big Data). It provides a searchable inventory of enterprise data assets, automated schema harvesting via "harvesting" jobs (analogous to crawlers), business glossary, data lineage, and tag-based discovery. OCI Data Catalog integrates with OCI Data Integration and OCI Data Flow for pipeline-level lineage. As of 2025, OCI Data Catalog capabilities are being integrated into the Oracle Intelligent Data Lake platform.

GCP — Dataplex + Analytics Hub Dataplex is GCP's intelligent data fabric and governance service: it discovers data across GCS buckets and BigQuery datasets, organizes data into Lakes, Zones, and Assets, applies unified IAM and VPC Service Controls policies, enables cross-project data quality rules, and tracks lineage. Dataplex Universal Catalog (previously Cloud Data Catalog) provides searchable metadata for BigQuery, GCS, Pub/Sub, and external sources. Analytics Hub is GCP's data exchange marketplace: organizations publish BigQuery datasets as listings, and subscribers access the data via Authorized Views without copying data, enabling large-scale governed data sharing and monetization.

Feature AWS Glue Data Catalog + DataZone Azure Purview OCI Data Catalog GCP Dataplex + Analytics Hub
Auto-discovery / crawling Glue Crawlers (S3, databases, streaming) Automated scanning (Azure + multi-cloud) Harvesting jobs (OCI assets) Dataplex auto-discovery
Lineage tracking DataZone lineage Purview end-to-end lineage OCI DI / Data Flow lineage Dataplex lineage (BigQuery, Dataflow, Spark)
Business glossary DataZone glossary Purview business glossary OCI Data Catalog glossary Dataplex business glossary
Data quality AWS Glue Data Quality Purview Data Health OCI Data Catalog quality rules Dataplex data quality tasks
Multi-cloud / hybrid DataZone (AWS + on-prem + SaaS) Purview (Azure + AWS + GCP + on-prem) OCI assets only Dataplex (GCP + BigQuery connector)
Data sharing / exchange AWS Data Exchange Purview + Azure Data Share No native exchange Analytics Hub (BigQuery data sharing)
Access governance Lake Formation policies Purview + Azure RBAC OCI IAM + Data Catalog tags BigLake + Dataplex policies
Sensitivity classification Amazon Macie (S3) Purview sensitivity labels (Microsoft 365 labels) OCI Cloud Guard Sensitive Data Protection (DLP)

Key differentiators:

  • Microsoft Purview has the broadest multi-cloud and hybrid coverage, extending governance to AWS, GCP, SAP, and on-premises sources from a single pane, and is uniquely integrated with Microsoft 365 compliance and sensitivity labels.
  • GCP Analytics Hub is the only native cloud data marketplace that enables cross-organization BigQuery data sharing as authorized views (zero-copy), supporting both internal data products and commercial data monetization.
  • AWS DataZone's data product and subscription model is the most structured approach to data mesh governance, enforcing access workflows and subscriptions across organizational boundaries.
  • OCI Data Catalog's integration scope is narrower than the other three, primarily optimized for OCI-native assets; multi-cloud governance scenarios require third-party tooling.

7. Hadoop / Spark (Distributed Compute)

Managed big data processing services provide Hadoop and Apache Spark clusters for large-scale distributed data processing, eliminating the need to manually provision and configure worker nodes.

AWS — Amazon EMR Amazon EMR (Elastic MapReduce) is the long-standing managed Hadoop and Spark platform supporting Hive, Presto, HBase, Flink, Hudi, and dozens of other open-source frameworks. Deployment options: EMR on EC2 (full cluster control), EMR on EKS (Spark jobs on EKS pods), and EMR Serverless (submit Spark or Hive jobs without managing clusters; auto-provisions worker capacity per job, charges per vCPU-second and GB-second). EMR Serverless now supports serverless storage (eliminating local disk provisioning), reducing costs by up to 20% and preventing disk-related job failures. Persistent clusters, transient job clusters, and spot-optimized instance fleets are all supported.

Azure — Azure HDInsight + Azure Databricks Azure HDInsight is Microsoft's managed Hadoop service (Hadoop, Spark, Kafka, HBase, Storm clusters). HDInsight 5.0 was retired March 31, 2025; HDInsight 5.1 is the current supported version, with strong guidance to migrate to Azure Databricks or Microsoft Fabric for new workloads. Azure Databricks (a first-party Azure service, jointly engineered with Databricks) is the strategic platform for large-scale PySpark, Delta Lake, and ML workloads on Azure. Databricks offers serverless compute, job clusters, SQL warehouses, and the Unity Catalog for data governance.

OCI — OCI Big Data Service + OCI Data Flow OCI Big Data Service is the managed Hadoop cluster service: provisions Oracle Distribution including Apache Hadoop (ODAH) clusters with Spark, Hive, Kafka, Hue, and other components. Supports auto-scaling and mixed shape clusters (general compute + high-performance storage). OCI Data Flow is OCI's managed Apache Spark service: submit Spark applications (Python, Java, Scala, SQL) without managing cluster infrastructure. Data Flow provisions compute per job, auto-scales, and shuts down after job completion, with pricing per OCPU-minute. Integrates natively with Object Storage and ADW.

GCP — Cloud Dataproc Cloud Dataproc is Google's managed Hadoop and Spark service, supporting Hadoop, Spark, Flink, Hive, and Presto on GCE VMs or GKE. Dataproc Serverless (for Spark batches and interactive notebooks) provisions Spark workers per job without a persistent cluster, charges per vCPU-second and GB-second, and is the preferred path for new Spark workloads. Dataproc supports open table formats natively (Iceberg, Delta, Hudi), integrates with GCS for storage, and publishes output to BigQuery. As of 2025, BigQuery Serverless Spark is GA — Spark workloads can run directly inside BigQuery Studio without a separate Dataproc cluster.

Feature AWS EMR Azure HDInsight + Databricks OCI Big Data + Data Flow GCP Dataproc
Managed Hadoop EMR on EC2 HDInsight 5.1 (legacy path) OCI Big Data Service Dataproc Standard clusters
Managed Spark (serverless) EMR Serverless Azure Databricks Serverless OCI Data Flow Dataproc Serverless
Kubernetes-based execution EMR on EKS Databricks on AKS None native Dataproc on GKE
Strategic platform EMR (all tiers) Databricks (new workloads) OCI Data Flow Dataproc Serverless + BigQuery Spark
Spot / preemptible workers EC2 Spot Azure Spot VMs OCI Preemptible VMs Spot VMs
Supported frameworks Hadoop, Spark, Hive, Presto, Flink, HBase, Hudi Spark, Delta Lake (Databricks), Hive (HDInsight) Hadoop, Spark, Hive, Kafka (Big Data); Spark (Data Flow) Hadoop, Spark, Flink, Hive, Presto
Open table format Iceberg, Hudi, Delta Delta Lake (native in Databricks) Iceberg (Object Storage + Data Flow) Iceberg, Delta, Hudi
Integrated BI QuickSight Power BI (Databricks connector) Oracle Analytics Cloud Looker / BigQuery

Key differentiators:

  • Azure's strategic shift to Databricks as the primary Spark platform (with HDInsight in legacy maintenance mode) gives Azure Databricks Unity Catalog a uniquely strong position for Delta Lake lakehouse architectures.
  • GCP's BigQuery Serverless Spark collapses the boundary between the data warehouse and Spark: Spark workloads run inside BigQuery Studio against BigQuery or GCS data without switching tools or clusters.
  • EMR Serverless's per-job auto-provisioning model eliminates persistent cluster costs while retaining access to the full EMR framework ecosystem.
  • OCI Data Flow's OCPU-minute pricing and zero cluster management make it a cost-effective Spark option; OCI Big Data Service serves the full Hadoop ecosystem for teams requiring Hive, HBase, or Kafka.

8. Batch Processing

Batch processing services schedule and execute large-scale compute jobs against datasets, distinct from streaming (which is continuous) and interactive warehousing (which is query-on-demand).

AWS — AWS Batch + AWS Glue (ETL jobs) + Amazon EMR (transient clusters) AWS Batch is the dedicated batch compute orchestration service: it manages job queues, compute environments (EC2 or Fargate), and job dependencies. AWS Batch dynamically provisions the optimal instance type and count for each batch job, supporting containerized applications on ECS, EKS, or Fargate. Step Functions integrates with AWS Batch for dependency-chained workflows with retries and conditional logic. For data-centric batch ETL, AWS Glue serverless jobs and EMR transient clusters are the more common patterns. Amazon EventBridge Scheduler provides time-based and event-based job triggering.

Azure — Azure Batch + Azure Data Factory Pipelines + Microsoft Fabric Azure Batch is the managed batch compute service: manages VM pools, job scheduling, and task execution for HPC and parallel workloads. Azure Data Factory pipelines provide scheduled and event-triggered ETL batch workflows with support for parallel execution, dependency management, retry policies, and integration with Databricks, HDInsight, and SQL endpoints. Microsoft Fabric Data Pipelines (the Fabric-integrated version of ADF) is the recommended path for new Fabric deployments. Azure Logic Apps provides event-driven workflow automation for lighter-weight batch triggers.

OCI — OCI Data Flow (Spark batch jobs) + OCI Data Integration (batch ETL pipelines) OCI does not have a dedicated general-purpose batch compute service equivalent to AWS Batch or Azure Batch. Batch data processing on OCI is achieved through OCI Data Flow (Apache Spark batch jobs submitted on-demand), OCI Data Integration (scheduled ETL/ELT pipeline runs), OCI GoldenGate Data Transforms (batch ELT pipelines), and OCI Functions + OCI Scheduler for compute-light batch tasks. Oracle Database scheduled jobs (DBMS_SCHEDULER) are commonly used for database-tier batch processing.

GCP — Cloud Batch + Dataflow (batch mode) + Dataproc Serverless Cloud Batch (GA 2022) is GCP's managed batch compute service for containerized and script-based batch jobs: manages VM provisioning, job queues, retries, and logging without cluster management. Cloud Dataflow in batch mode (Apache Beam) handles large-scale data processing pipelines: it auto-scales workers, handles backpressure, and shuts down after job completion. Dataproc Serverless batch runs Spark batch jobs without a persistent cluster. Cloud Composer (managed Apache Airflow) is the standard workflow orchestration layer for scheduling and monitoring complex multi-step batch pipelines across GCP services.

Feature AWS Azure OCI GCP
Managed batch compute service AWS Batch (ECS/EKS/Fargate) Azure Batch (VM pools) None native (use Data Flow / OCI Functions) Cloud Batch (VMs / containers)
Batch ETL / data pipelines Glue jobs + EMR transient ADF Pipelines / Fabric Pipelines OCI Data Integration + GoldenGate Dataflow batch + Data Fusion
Workflow orchestration Step Functions + EventBridge ADF + Logic Apps OCI Data Integration orchestration Cloud Composer (Apache Airflow)
Serverless batch Glue Serverless + EMR Serverless Databricks Serverless + ADF Serverless IR OCI Data Flow (per-job Spark) Dataproc Serverless + Dataflow batch
HPC / parallel batch AWS Batch + EC2 HPC instances Azure Batch + HPC VM sizes OCI HPC shapes + OCI Data Flow Cloud Batch + N2 / HPC VMs
Job dependency management Step Functions ADF dependency / Logic Apps OCI DI pipeline dependencies Cloud Composer DAGs
Spot / preemptible batch AWS Batch Spot Instances Azure Batch Spot VMs OCI Preemptible Instances Cloud Batch Spot VMs

Key differentiators:

  • Cloud Composer (Apache Airflow) is the only cloud-native managed Airflow service among the four providers, giving GCP the richest ecosystem for complex multi-step batch pipeline orchestration using the industry-standard DAG model.
  • AWS Batch is the most mature and feature-complete dedicated batch compute service, with deep Step Functions integration for complex dependency graphs and native support for containerized HPC on Fargate or GPU instances.
  • OCI's lack of a native batch compute service (equivalent to AWS Batch or Azure Batch) means general-purpose parallel compute batch workloads must be composed from OCI Data Flow, OCI Functions, or VM-based solutions.
  • Azure Data Factory's batch pipeline model — with conditional execution, retry policies, parallel branches, and integration with Databricks and SQL endpoints — makes it the most data-engineer-friendly batch orchestration tool for Azure-native workloads.

Summary Comparison

Category AWS Azure OCI GCP
Data Warehouse Amazon Redshift / Redshift Serverless Azure Synapse Analytics / Microsoft Fabric Autonomous Data Warehouse (ADW) BigQuery
Streaming Ingest Kinesis Data Streams Azure Event Hubs OCI Streaming Cloud Pub/Sub
Managed Kafka Amazon MSK / MSK Serverless Event Hubs (Kafka-compatible) OCI Streaming with Apache Kafka Pub/Sub (Kafka connector)
Stream Processing Managed Service for Apache Flink Azure Stream Analytics / Fabric Real-Time OCI GoldenGate (CDC + streaming) Cloud Dataflow (Apache Beam)
ETL / Data Integration AWS Glue Azure Data Factory / Fabric Pipelines OCI Data Integration / GoldenGate Cloud Data Fusion / Dataform
Data Lake Storage Amazon S3 ADLS Gen2 / Fabric OneLake OCI Object Storage Google Cloud Storage
Data Lake Governance AWS Lake Formation Microsoft Purview OCI Data Catalog Dataplex
BI / Dashboarding Amazon QuickSight Microsoft Power BI Oracle Analytics Cloud Looker / Looker Studio
Data Catalog AWS Glue Data Catalog + DataZone Microsoft Purview OCI Data Catalog Dataplex Universal Catalog
Hadoop / Spark Amazon EMR Azure HDInsight + Azure Databricks OCI Big Data + OCI Data Flow Cloud Dataproc
Batch Compute AWS Batch Azure Batch OCI Data Flow (no native batch) Cloud Batch
Workflow Orchestration Step Functions + EventBridge ADF + Logic Apps OCI DI Pipelines Cloud Composer (Apache Airflow)

References