Reference

Cloud Services Cross-Reference: Analytics & Big Data

This document maps analytics and big data services across AWS, Azure, Oracle Cloud Infrastructure (OCI), and Google Cloud Platform (GCP). Coverage spans data warehousing, streaming, ETL/data integration, data lakes, business intelligence, data catalog and governance, distributed compute (Hadoop/Spark), and batch processing. All four providers offer competing managed services in each category, but their architectural models, pricing structures, and native ecosystem integrations differ substantially. GCP's BigQuery stands alone as a fully serverless, storage-compute-separated warehouse with no cluster management. Azure has shifted its strategic emphasis from Azure Synapse Analytics toward Microsoft Fabric, a SaaS-unified analytics platform. AWS continues to expand the open lakehouse pattern around Apache Iceberg across all analytics services. OCI differentiates on price and deep Oracle Database integration across the stack.

1. Data Warehousing

Data warehouse services provide managed, columnar-storage SQL query engines optimized for analytical workloads against large structured datasets.

AWS — Amazon Redshift Redshift offers two deployment modes: provisioned clusters (pay per node-hour) and Redshift Serverless (pay per RPU-hour at $0.36/RPU-hour, minimum 60-second billing, automatic capacity scaling). As of 2025, Redshift writes directly to Apache Iceberg tables, enabling open lakehouse patterns where Redshift, EMR, Athena, and Glue all share data in S3 without copying. Redshift Spectrum allows querying external S3 data from Redshift SQL. Redshift is a traditional cluster-oriented warehouse that requires explicit capacity management even in serverless mode (RPUs are provisioned, not per-query).

Azure — Azure Synapse Analytics / Microsoft Fabric Azure Synapse Analytics is a PaaS unified analytics workspace combining SQL dedicated pools (provisioned data warehouse), SQL serverless pools (pay per TB queried), and Apache Spark pools. Microsoft Fabric — GA since late 2023 — is the strategic successor: a SaaS platform that unifies data engineering, warehousing, real-time analytics, and BI under OneLake, a single data lake using Delta Lake Parquet format. Synapse remains supported for existing workloads; new deployments should consider Fabric. Microsoft explicitly positions Fabric's Data Warehouse as "purpose-built for the demands of the 2020s." Synapse dedicated pools use SQL Server proprietary format; Fabric warehouses store data as Delta Lake.

OCI — Oracle Autonomous Data Warehouse (ADW) ADW is a fully autonomous cloud data warehouse that automates provisioning, patching, tuning, backup, and scaling without DBA intervention. In 2025 Oracle renamed the offering "Autonomous AI Lakehouse," reflecting native Apache Iceberg support and integration with object storage for open format data. ADW runs on Oracle Exadata infrastructure, providing high I/O throughput. Auto-indexing and auto-partitioning eliminate manual tuning. Strong integration with Oracle Analytics Cloud for end-to-end pipeline. Serverless and dedicated deployment options available.

GCP — BigQuery BigQuery is fundamentally different from the other three: it is architecturally serverless with complete storage-compute separation via Google's petabit-scale internal network. There are no clusters, no node provisioning decisions, and no capacity management. Compute scales automatically per query; storage scales independently. Pricing is on-demand ($5 per TB queried) or capacity-based (slots). BigQuery Serverless Spark (GA 2025) allows running Spark workloads directly inside BigQuery Studio without separate cluster provisioning. BigQuery supports open table formats (Iceberg, Delta, Hudi) via BigLake.

Feature	AWS Redshift	Azure Synapse / Fabric	OCI ADW	GCP BigQuery
Deployment model	Provisioned clusters + Serverless	PaaS (Synapse) / SaaS (Fabric)	Serverless + Dedicated (Exadata)	Fully serverless, no clusters
Serverless pricing	$0.36/RPU-hour (min 60s)	Per TB queried (serverless pools)	Pay per OCPU/storage	$5/TB queried (on-demand)
Storage format	Proprietary + Iceberg (S3)	Proprietary (Synapse) / Delta Lake (Fabric)	Proprietary + Iceberg (Object Storage)	Capacitor (native) + open formats via BigLake
Open table support	Apache Iceberg	Delta Lake (Fabric native)	Apache Iceberg	Iceberg, Delta, Hudi via BigLake
Auto-scaling	Yes (serverless RPU auto-scale)	Yes (serverless pools)	Yes (autonomous)	Always automatic
Cluster management required	Yes (provisioned) / No (serverless)	Yes (Synapse pools) / No (Fabric)	No (fully autonomous)	Never
Integrated ML	Redshift ML (SageMaker)	Synapse Analytics + Azure ML	AutoML, OML in-database	BigQuery ML (native SQL)
Unique differentiator	Deep Iceberg lakehouse ecosystem	Fabric OneLake unified SaaS	Oracle Exadata performance, autonomous ops	No-infrastructure serverless; per-query pricing

Key differentiators:

BigQuery's serverless model eliminates all infrastructure decisions; it is the only warehouse where you never touch a cluster or capacity setting.
ADW's autonomous operations (auto-indexing, auto-patching, auto-scaling) differentiate OCI for organizations running Oracle workloads who want zero DBA overhead.
Microsoft Fabric's OneLake and Delta Lake foundation unifies BI, engineering, and warehousing in one SaaS subscription, but represents a migration from existing Synapse investments.
Redshift's tight Iceberg integration across the entire AWS analytics ecosystem (EMR, Athena, Glue) provides the broadest multi-engine open lakehouse on a single cloud.

2. Streaming / Real-Time Data

Streaming services ingest, transport, and process high-throughput event data in real time, supporting event-driven architectures, IoT pipelines, and real-time analytics.

AWS — Amazon Kinesis + Amazon MSK The Kinesis family covers four distinct services: Kinesis Data Streams (managed real-time data streaming, millisecond latency, 1-year retention), Amazon Data Firehose (formerly Kinesis Data Firehose — load streaming data directly to S3, Redshift, OpenSearch, and third-party destinations without writing consumer code), Amazon Managed Service for Apache Flink (stream processing with SQL or Java/Python, formerly Kinesis Data Analytics), and Kinesis Video Streams (video ingest and playback). Amazon MSK (Managed Streaming for Apache Kafka) is the fully managed Kafka offering for teams that need Kafka API compatibility. MSK Serverless provides Kafka without cluster provisioning.

Azure — Azure Event Hubs + Azure Stream Analytics Azure Event Hubs is a fully managed big data streaming platform and event ingestion service with Apache Kafka protocol compatibility (no code change required for Kafka producers/consumers). Event Hubs Capture writes raw streams directly to Azure Blob Storage or ADLS Gen2. Azure Stream Analytics is the managed real-time query engine: reads from Event Hubs (or IoT Hub), applies SQL-like queries with windowing functions, and writes results to storage, databases, or Power BI. Microsoft Fabric Real-Time Intelligence (formerly Real-Time Analytics) provides an integrated streaming analytics workload within the Fabric SaaS platform.

OCI — OCI Streaming + OCI Streaming with Apache Kafka OCI Streaming is the native managed event streaming service compatible with Apache Kafka APIs. OCI offers a separate fully managed Kafka service — "Streaming with Apache Kafka" — which is 100% API-compatible with Apache Kafka, allowing existing Kafka applications to migrate without code changes. Oracle claims up to 31% lower cost than Amazon MSK and up to 73% lower cost than Confluent. OCI GoldenGate provides real-time change data capture (CDC) from databases (Oracle, MySQL, PostgreSQL, SQL Server) for continuous replication and event generation.

GCP — Pub/Sub + Dataflow Cloud Pub/Sub is Google's fully managed, serverless real-time messaging service: global, durable, at-least-once delivery, Kafka-compatible via a Kafka connector. It underpins Google's internal messaging at scale. Dataflow is Google's managed implementation of Apache Beam for unified batch and streaming data processing — the same pipeline code handles both. Pub/Sub feeds Dataflow, which processes and writes results to BigQuery or Cloud Storage. Datastream is a serverless CDC and replication service for reading changes from operational databases (Oracle, MySQL, PostgreSQL, AlloyDB) into BigQuery or Cloud Storage in real time.

Feature	AWS	Azure	OCI	GCP
Managed Kafka	Amazon MSK / MSK Serverless	Event Hubs (Kafka-compatible)	OCI Streaming with Apache Kafka	Pub/Sub (Kafka connector available)
Native streaming ingest	Kinesis Data Streams	Event Hubs	OCI Streaming	Pub/Sub
Stream processing engine	Managed Service for Apache Flink	Stream Analytics / Fabric Real-Time Intelligence	OCI GoldenGate (CDC/streaming)	Dataflow (Apache Beam)
Serverless stream ingest	Kinesis Data Streams (serverless scaling)	Event Hubs (Kafka-compatible, auto-inflate)	OCI Streaming (serverless)	Pub/Sub (always serverless)
CDC / database replication	AWS DMS + Kinesis	Event Hubs + Azure DMS	OCI GoldenGate	Datastream
Delivery to data warehouse	Data Firehose → Redshift / S3	Event Hubs Capture → ADLS / Synapse	OCI GoldenGate → ADW	Pub/Sub → Dataflow → BigQuery
Video streaming	Kinesis Video Streams	Azure Video Indexer	None native	None native

Key differentiators:

AWS has the most granular streaming service decomposition: separate managed services for ingest (Kinesis Data Streams), delivery (Firehose), stream SQL (Flink), and Kafka (MSK).
OCI GoldenGate is uniquely positioned for Oracle-to-Oracle CDC scenarios and supports ZeroETL Mirror for direct, low-latency replication into ADW.
GCP Dataflow's Apache Beam model is the only major streaming engine where identical code runs both batch and streaming workloads without modification.
Azure Event Hubs' native Kafka API compatibility allows Kafka workloads to migrate to Azure without application changes, making it the lowest-friction Kafka migration path.

3. ETL / Data Integration

ETL and data integration services extract data from source systems, transform it, and load it into analytical targets. Modern services increasingly offer visual no-code/low-code pipeline design alongside code-based options.

AWS — AWS Glue AWS Glue is the primary serverless ETL service: auto-generates ETL code (Python/Scala PySpark), provides a Data Catalog for schema discovery and metadata management, and includes Glue DataBrew for visual no-code data preparation. Glue crawlers auto-discover schema from S3, databases, and other sources. Glue Studio provides a visual drag-and-drop job authoring interface. AWS Data Pipeline is an older orchestration service for moving data between AWS services; new workloads should prefer Glue or Step Functions. AWS AppFlow handles SaaS-to-AWS data integration (Salesforce, SAP, etc.).

Azure — Azure Data Factory + Microsoft Fabric Data Engineering Azure Data Factory (ADF) is the cloud-scale ETL and data orchestration service with 90+ built-in connectors, a visual pipeline canvas, and support for mapping data flows (code-free Spark-based transformations). ADF integrates natively with Azure Synapse and Azure Databricks. Within Microsoft Fabric, the Data Engineering workload (notebooks, Spark jobs, data pipelines using a Fabric-native version of ADF pipelines) is the preferred path for new Fabric deployments. Azure Databricks — a first-party Azure service based on Databricks — is widely used for large-scale PySpark ETL.

OCI — OCI Data Integration + Oracle GoldenGate OCI Data Integration is Oracle's fully managed serverless ETL/ELT service: visual no-code pipeline design, supports bulk data loading into ADW and Object Storage, includes data quality and lineage capabilities. OCI GoldenGate's Data Transforms feature (added 2023) extends GoldenGate with batch ETL/ELT processing via drag-and-drop pipelines, enabling a single GoldenGate deployment to handle both real-time CDC and batch data integration. Oracle Data Integrator (ODI) is the on-premises ETL product, available as a cloud edition for lift-and-shift scenarios.

GCP — Cloud Data Fusion + Dataform + Dataprep Cloud Data Fusion is GCP's fully managed, code-free ETL service based on the open-source CDAP framework. It supports 150+ pre-built connectors, visual pipeline design, and generates Dataproc (Spark) or Dataflow jobs for execution. Dataprep by Trifacta (integrated into GCP as Google Cloud Dataprep) provides interactive visual data wrangling and generates Dataflow jobs; it is best suited for smaller to medium datasets and ad-hoc data preparation. Dataform is a SQL-based transformation tool for building and testing data models in BigQuery (similar to dbt), integrated into BigQuery Studio as of 2023.

Feature	AWS Glue	Azure Data Factory / Fabric	OCI Data Integration / GoldenGate	GCP Data Fusion / Dataform
Visual pipeline design	Glue Studio	ADF Canvas / Fabric Pipelines	OCI DI Visual Designer / GoldenGate Data Transforms	Cloud Data Fusion Canvas
Serverless execution	Yes (Glue serverless Spark)	Yes (Mapping Data Flows on serverless Spark)	Yes (OCI DI serverless)	Yes (Dataflow backend)
Code-based ETL	PySpark / Scala	PySpark (Databricks)	PySpark (OCI Data Flow)	Python / SQL (Dataflow / Dataform)
Built-in connectors	100+	90+	Oracle-focused + common databases	150+ (Data Fusion)
SaaS integration	AWS AppFlow	ADF + Power Platform connectors	Oracle Integration Cloud (OIC)	Cloud Data Fusion connectors
Data wrangling / prep	Glue DataBrew	ADF Wrangling Data Flows / Power Query	OCI GoldenGate Data Transforms	Cloud Dataprep (Trifacta)
SQL transformation	dbt (third-party)	dbt (third-party) or Synapse pipelines	OCI GoldenGate ELT	Dataform (native, BigQuery-integrated)
Schema crawling	Glue Crawlers → Data Catalog	ADF with Purview integration	OCI Data Catalog integration	Dataplex auto-discovery

Key differentiators:

AWS Glue Crawlers + Data Catalog provide the most automated schema discovery and metadata management, tightly coupled to the overall AWS analytics ecosystem.
OCI's combination of OCI Data Integration (batch ETL) and OCI GoldenGate (real-time CDC + batch ELT in one service) is uniquely suited for Oracle Database source systems.
GCP's Dataform is native to BigQuery Studio, making SQL-based transformation and dbt-style data modeling a first-class citizen with no additional tooling.
Azure Data Factory's 90+ connectors and Fabric's pipeline integration make it the most enterprise-connector-rich option for heterogeneous SaaS and on-premises sources.

4. Data Lakes

Data lake services provide scalable object storage with metadata, access control, and governance features for storing raw and processed data in open formats.

AWS — Amazon S3 + AWS Lake Formation Amazon S3 is the underlying storage for all AWS data lake architectures. AWS Lake Formation is the governance and security layer on top of S3: it provides centralized data lake setup, column- and row-level fine-grained access control via Lake Formation permissions (extending IAM), tag-based access control, cross-account data sharing, and integration with AWS Glue Data Catalog for schema management. Lake Formation does not replace S3; it governs access to data stored in S3. Governed tables in Lake Formation support ACID transactions on S3.

Azure — Azure Data Lake Storage Gen2 (ADLS Gen2) + Microsoft Fabric OneLake ADLS Gen2 extends Azure Blob Storage with hierarchical namespace (directory/file semantics), POSIX ACLs, Azure RBAC, AES-256 encryption, and petabyte-scale throughput optimized for analytics engines (Spark, Hive, Presto). It is the standard storage layer for Azure analytics workloads (Synapse, Databricks, HDInsight). Microsoft Fabric OneLake is the next-generation data lake built on top of ADLS Gen2: a single, unified data lake per Fabric tenant, where all Fabric workloads (warehouses, lakehouses, notebooks) store data as Delta Lake Parquet automatically. OneLake eliminates data silos by providing one storage location accessible to all Fabric engines.

OCI — OCI Object Storage + Oracle Intelligent Data Lake (preview) OCI Object Storage is the S3-compatible object store underpinning OCI data lake architectures. It supports Apache Iceberg table format, AES-256 encryption, IAM-based access control, and lifecycle management. OCI Object Storage egress is free within OCI (10 TB/month external egress included), making it notably cheaper than AWS S3 or ADLS Gen2 for data-intensive workloads. Oracle Intelligent Data Lake (announced at Oracle CloudWorld 2024, limited availability 2025) is an emerging managed data lake platform integrating a unified catalog, Apache Spark and Flink processing, and Jupyter Notebook for data science in one experience on top of Object Storage.

GCP — Cloud Storage + BigLake + Dataplex Google Cloud Storage (GCS) is the object storage layer. BigLake is GCP's unified storage engine that spans structured data in BigQuery and open-format data in GCS, providing consistent fine-grained access control (column- and row-level), caching, and query acceleration regardless of whether data is in BigQuery native storage or external files (Iceberg, Delta, Parquet, ORC). Dataplex is the data mesh and data lake governance service: it auto-discovers data across GCS and BigQuery, applies unified policies, creates logical data zones (raw, curated, production), and enables cross-project data quality and lineage.

Feature	AWS S3 + Lake Formation	Azure ADLS Gen2 + OneLake	OCI Object Storage + IDL	GCP GCS + BigLake + Dataplex
Underlying storage	Amazon S3	Azure Blob Storage (ADLS Gen2)	OCI Object Storage	Google Cloud Storage
Hierarchical namespace	No (flat S3 prefix)	Yes (POSIX ACLs, directory model)	No (flat object model)	No (flat bucket model)
Fine-grained access control	Lake Formation (column/row/tag)	Azure RBAC + POSIX ACLs	OCI IAM Policies	BigLake (column/row via policy tags)
Open table format	Apache Iceberg (Lake Formation governed)	Delta Lake (Fabric OneLake)	Apache Iceberg	Iceberg, Delta, Hudi (BigLake)
Unified multi-engine access	Via Glue Catalog + Lake Formation	OneLake (all Fabric engines)	OCI Data Catalog + Object Storage	BigLake (BigQuery + GCS)
Data mesh / governance	Glue + Lake Formation + DataZone	Fabric OneLake + Purview	OCI Data Catalog + IDL (preview)	Dataplex
Egress costs	Standard S3 egress rates	Standard Azure egress rates	10 TB/month free external egress	Standard GCS egress rates
Managed data lake SaaS	Lake Formation (governance only)	Fabric OneLake (full SaaS)	IDL (preview)	Dataplex (governance)

Key differentiators:

Microsoft Fabric OneLake is the most opinionated data lake solution: a single unified SaaS lake per tenant, automatically shared across all Fabric workloads with zero copy, eliminating the integration overhead of separate storage accounts.
AWS Lake Formation provides the most mature and granular column- and row-level access control for S3-based data lakes, with tag-based policies and cross-account sharing.
OCI Object Storage's egress pricing advantage (effectively free within OCI) is significant for large-scale data movement workloads common in analytics pipelines.
GCP BigLake's ability to enforce the same column- and row-level policies on both BigQuery native tables and external GCS files (Iceberg/Parquet) is architecturally unique and simplifies governance across the storage boundary.

5. Business Intelligence

Business intelligence (BI) services provide managed dashboarding, reporting, and self-service analytics with governed semantic layers and embedded analytics capabilities.

AWS — Amazon QuickSight QuickSight is AWS's serverless, pay-per-session cloud BI service. It uses the SPICE engine (Super-fast, Parallel, In-memory Calculation Engine) to cache and accelerate queries, enabling sub-second dashboard load times for large datasets. QuickSight connects natively to Redshift, S3, Athena, RDS, and third-party sources. The Q feature provides natural language query against datasets. QuickSight Embedded enables BI embedding in custom applications. Pricing is per-author (fixed monthly) or per-reader-session ($0.30/session), making it cost-effective at scale.

Azure — Microsoft Power BI Power BI is Microsoft's industry-leading BI platform with the broadest enterprise adoption of the four providers. It is a SaaS service integrated into Microsoft 365 and natively connected to all Azure data sources, Dynamics 365, and hundreds of third-party connectors. Power BI Premium provides dedicated capacity for large organizations and enables paginated reports, deployment pipelines, and XMLA endpoint access for third-party tools. Within Microsoft Fabric, Power BI is a first-class workload: Direct Lake mode connects Power BI semantic models directly to OneLake Delta tables without data import, eliminating the traditional refresh cycle.

OCI — Oracle Analytics Cloud (OAC) Oracle Analytics Cloud is Oracle's managed BI and analytics platform on OCI. It provides self-service data discovery, dashboards, pixel-perfect reports, augmented analytics (AI-generated insights), and integration with ADW and Oracle Database. OAC supports Essbase for multidimensional OLAP analysis. The Analytics AI Assistant (GA 2025) enables natural language queries and AI-generated visualizations. OAC can be provisioned alongside ADW through OCI Resource Analytics for an integrated BI+data warehouse deployment.

GCP — Looker + Looker Studio Looker is Google's enterprise BI platform acquired in 2020, centered on LookML — a YAML-based semantic modeling language that defines business metrics, dimensions, and relationships centrally. LookML ensures consistent metric definitions across all dashboards and self-service reports. Looker integrates natively with BigQuery as its primary data source and connects to all major databases. Looker Studio (formerly Data Studio) is Google's free, lightweight reporting tool for building shareable dashboards from a wide range of sources; it lacks the governed semantic layer of Looker proper. Looker Studio Pro adds organizational sharing and SLA support.

Feature	AWS QuickSight	Azure Power BI	OCI Analytics Cloud	GCP Looker / Looker Studio
In-memory caching	SPICE engine	Import mode (VertiPaq)	OAC in-memory caching	LookML-defined caching / BigQuery materialization
Semantic / metric layer	QuickSight datasets	Power BI datasets	OAC subject areas	LookML (Looker)
Natural language query	QuickSight Q	Power BI Q&A	AI Assistant (GA 2025)	Looker conversational analytics
Self-service / free tier	Reader session pricing	Free Desktop / Power BI Free	None (OAC is paid)	Looker Studio (free)
Embedded analytics	QuickSight Embedded	Power BI Embedded	OAC embedded	Looker Embedded
OLAP / multidimensional	Limited	Analysis Services	Oracle Essbase	None native
Native warehouse connection	Redshift, Athena, S3	Azure Synapse, ADX, Fabric OneLake	ADW, Oracle Database	BigQuery (primary), all via connectors
Pricing model	Per author/reader-session	Per user or capacity (Premium)	OCPU-hour (fixed capacity)	Looker: per user; Looker Studio: free

Key differentiators:

Power BI has the largest installed base and deepest Microsoft 365 integration; Fabric Direct Lake mode eliminates the data import latency that traditionally made large datasets slow in Power BI.
Looker's LookML semantic layer is architecturally unique among cloud-native BI tools: business logic is defined once in code, versioned in Git, and enforced consistently across all consumers.
OAC with Essbase provides OLAP multidimensional analysis capabilities that the other three providers lack natively, relevant for financial planning and enterprise performance management use cases.
QuickSight's per-session reader pricing ($0.30/session) is the most cost-effective model for large numbers of occasional report consumers.

6. Data Catalog / Governance

Data catalog and governance services provide metadata management, data discovery, lineage tracking, data quality monitoring, and access policy enforcement across data estates.

AWS — AWS Glue Data Catalog + Amazon DataZone + AWS Lake Formation AWS Glue Data Catalog is the central metadata repository for all AWS analytics services (Glue, Athena, Redshift Spectrum, EMR). Crawlers populate the catalog automatically from S3, databases, and streaming sources. Lake Formation uses the Glue Data Catalog as its underlying schema store and adds fine-grained access control, data governance policies, and cross-account sharing. Amazon DataZone (GA 2023) is the data mesh governance service: it provides a business catalog with a personalized search portal, data products, subscription-based access workflows, and lineage across AWS, on-premises, and SaaS sources.

Azure — Microsoft Purview Microsoft Purview (formerly Azure Purview) is the unified data governance and compliance platform covering the entire Microsoft data estate: Azure data services, Microsoft 365, Power BI, SQL Server on-premises, and multi-cloud sources. It provides automated data discovery and classification, end-to-end data lineage visualization, sensitivity labels, business glossary, and integration with Microsoft Defender for data security. Purview integrates directly into Fabric workspaces for unified governance of OneLake data.

OCI — OCI Data Catalog OCI Data Catalog is Oracle's managed metadata repository for OCI data assets (Object Storage, ADW, databases, OCI Big Data). It provides a searchable inventory of enterprise data assets, automated schema harvesting via "harvesting" jobs (analogous to crawlers), business glossary, data lineage, and tag-based discovery. OCI Data Catalog integrates with OCI Data Integration and OCI Data Flow for pipeline-level lineage. As of 2025, OCI Data Catalog capabilities are being integrated into the Oracle Intelligent Data Lake platform.

GCP — Dataplex + Analytics Hub Dataplex is GCP's intelligent data fabric and governance service: it discovers data across GCS buckets and BigQuery datasets, organizes data into Lakes, Zones, and Assets, applies unified IAM and VPC Service Controls policies, enables cross-project data quality rules, and tracks lineage. Dataplex Universal Catalog (previously Cloud Data Catalog) provides searchable metadata for BigQuery, GCS, Pub/Sub, and external sources. Analytics Hub is GCP's data exchange marketplace: organizations publish BigQuery datasets as listings, and subscribers access the data via Authorized Views without copying data, enabling large-scale governed data sharing and monetization.

Feature	AWS Glue Data Catalog + DataZone	Azure Purview	OCI Data Catalog	GCP Dataplex + Analytics Hub
Auto-discovery / crawling	Glue Crawlers (S3, databases, streaming)	Automated scanning (Azure + multi-cloud)	Harvesting jobs (OCI assets)	Dataplex auto-discovery
Lineage tracking	DataZone lineage	Purview end-to-end lineage	OCI DI / Data Flow lineage	Dataplex lineage (BigQuery, Dataflow, Spark)
Business glossary	DataZone glossary	Purview business glossary	OCI Data Catalog glossary	Dataplex business glossary
Data quality	AWS Glue Data Quality	Purview Data Health	OCI Data Catalog quality rules	Dataplex data quality tasks
Multi-cloud / hybrid	DataZone (AWS + on-prem + SaaS)	Purview (Azure + AWS + GCP + on-prem)	OCI assets only	Dataplex (GCP + BigQuery connector)
Data sharing / exchange	AWS Data Exchange	Purview + Azure Data Share	No native exchange	Analytics Hub (BigQuery data sharing)
Access governance	Lake Formation policies	Purview + Azure RBAC	OCI IAM + Data Catalog tags	BigLake + Dataplex policies
Sensitivity classification	Amazon Macie (S3)	Purview sensitivity labels (Microsoft 365 labels)	OCI Cloud Guard	Sensitive Data Protection (DLP)

Key differentiators:

Microsoft Purview has the broadest multi-cloud and hybrid coverage, extending governance to AWS, GCP, SAP, and on-premises sources from a single pane, and is uniquely integrated with Microsoft 365 compliance and sensitivity labels.
GCP Analytics Hub is the only native cloud data marketplace that enables cross-organization BigQuery data sharing as authorized views (zero-copy), supporting both internal data products and commercial data monetization.
AWS DataZone's data product and subscription model is the most structured approach to data mesh governance, enforcing access workflows and subscriptions across organizational boundaries.
OCI Data Catalog's integration scope is narrower than the other three, primarily optimized for OCI-native assets; multi-cloud governance scenarios require third-party tooling.

7. Hadoop / Spark (Distributed Compute)

Managed big data processing services provide Hadoop and Apache Spark clusters for large-scale distributed data processing, eliminating the need to manually provision and configure worker nodes.

AWS — Amazon EMR Amazon EMR (Elastic MapReduce) is the long-standing managed Hadoop and Spark platform supporting Hive, Presto, HBase, Flink, Hudi, and dozens of other open-source frameworks. Deployment options: EMR on EC2 (full cluster control), EMR on EKS (Spark jobs on EKS pods), and EMR Serverless (submit Spark or Hive jobs without managing clusters; auto-provisions worker capacity per job, charges per vCPU-second and GB-second). EMR Serverless now supports serverless storage (eliminating local disk provisioning), reducing costs by up to 20% and preventing disk-related job failures. Persistent clusters, transient job clusters, and spot-optimized instance fleets are all supported.

Azure — Azure HDInsight + Azure Databricks Azure HDInsight is Microsoft's managed Hadoop service (Hadoop, Spark, Kafka, HBase, Storm clusters). HDInsight 5.0 was retired March 31, 2025; HDInsight 5.1 is the current supported version, with strong guidance to migrate to Azure Databricks or Microsoft Fabric for new workloads. Azure Databricks (a first-party Azure service, jointly engineered with Databricks) is the strategic platform for large-scale PySpark, Delta Lake, and ML workloads on Azure. Databricks offers serverless compute, job clusters, SQL warehouses, and the Unity Catalog for data governance.

OCI — OCI Big Data Service + OCI Data Flow OCI Big Data Service is the managed Hadoop cluster service: provisions Oracle Distribution including Apache Hadoop (ODAH) clusters with Spark, Hive, Kafka, Hue, and other components. Supports auto-scaling and mixed shape clusters (general compute + high-performance storage). OCI Data Flow is OCI's managed Apache Spark service: submit Spark applications (Python, Java, Scala, SQL) without managing cluster infrastructure. Data Flow provisions compute per job, auto-scales, and shuts down after job completion, with pricing per OCPU-minute. Integrates natively with Object Storage and ADW.

GCP — Cloud Dataproc Cloud Dataproc is Google's managed Hadoop and Spark service, supporting Hadoop, Spark, Flink, Hive, and Presto on GCE VMs or GKE. Dataproc Serverless (for Spark batches and interactive notebooks) provisions Spark workers per job without a persistent cluster, charges per vCPU-second and GB-second, and is the preferred path for new Spark workloads. Dataproc supports open table formats natively (Iceberg, Delta, Hudi), integrates with GCS for storage, and publishes output to BigQuery. As of 2025, BigQuery Serverless Spark is GA — Spark workloads can run directly inside BigQuery Studio without a separate Dataproc cluster.

Feature	AWS EMR	Azure HDInsight + Databricks	OCI Big Data + Data Flow	GCP Dataproc
Managed Hadoop	EMR on EC2	HDInsight 5.1 (legacy path)	OCI Big Data Service	Dataproc Standard clusters
Managed Spark (serverless)	EMR Serverless	Azure Databricks Serverless	OCI Data Flow	Dataproc Serverless
Kubernetes-based execution	EMR on EKS	Databricks on AKS	None native	Dataproc on GKE
Strategic platform	EMR (all tiers)	Databricks (new workloads)	OCI Data Flow	Dataproc Serverless + BigQuery Spark
Spot / preemptible workers	EC2 Spot	Azure Spot VMs	OCI Preemptible VMs	Spot VMs
Supported frameworks	Hadoop, Spark, Hive, Presto, Flink, HBase, Hudi	Spark, Delta Lake (Databricks), Hive (HDInsight)	Hadoop, Spark, Hive, Kafka (Big Data); Spark (Data Flow)	Hadoop, Spark, Flink, Hive, Presto
Open table format	Iceberg, Hudi, Delta	Delta Lake (native in Databricks)	Iceberg (Object Storage + Data Flow)	Iceberg, Delta, Hudi
Integrated BI	QuickSight	Power BI (Databricks connector)	Oracle Analytics Cloud	Looker / BigQuery

Key differentiators:

Azure's strategic shift to Databricks as the primary Spark platform (with HDInsight in legacy maintenance mode) gives Azure Databricks Unity Catalog a uniquely strong position for Delta Lake lakehouse architectures.
GCP's BigQuery Serverless Spark collapses the boundary between the data warehouse and Spark: Spark workloads run inside BigQuery Studio against BigQuery or GCS data without switching tools or clusters.
EMR Serverless's per-job auto-provisioning model eliminates persistent cluster costs while retaining access to the full EMR framework ecosystem.
OCI Data Flow's OCPU-minute pricing and zero cluster management make it a cost-effective Spark option; OCI Big Data Service serves the full Hadoop ecosystem for teams requiring Hive, HBase, or Kafka.

8. Batch Processing

Batch processing services schedule and execute large-scale compute jobs against datasets, distinct from streaming (which is continuous) and interactive warehousing (which is query-on-demand).

AWS — AWS Batch + AWS Glue (ETL jobs) + Amazon EMR (transient clusters) AWS Batch is the dedicated batch compute orchestration service: it manages job queues, compute environments (EC2 or Fargate), and job dependencies. AWS Batch dynamically provisions the optimal instance type and count for each batch job, supporting containerized applications on ECS, EKS, or Fargate. Step Functions integrates with AWS Batch for dependency-chained workflows with retries and conditional logic. For data-centric batch ETL, AWS Glue serverless jobs and EMR transient clusters are the more common patterns. Amazon EventBridge Scheduler provides time-based and event-based job triggering.

Azure — Azure Batch + Azure Data Factory Pipelines + Microsoft Fabric Azure Batch is the managed batch compute service: manages VM pools, job scheduling, and task execution for HPC and parallel workloads. Azure Data Factory pipelines provide scheduled and event-triggered ETL batch workflows with support for parallel execution, dependency management, retry policies, and integration with Databricks, HDInsight, and SQL endpoints. Microsoft Fabric Data Pipelines (the Fabric-integrated version of ADF) is the recommended path for new Fabric deployments. Azure Logic Apps provides event-driven workflow automation for lighter-weight batch triggers.

OCI — OCI Data Flow (Spark batch jobs) + OCI Data Integration (batch ETL pipelines) OCI does not have a dedicated general-purpose batch compute service equivalent to AWS Batch or Azure Batch. Batch data processing on OCI is achieved through OCI Data Flow (Apache Spark batch jobs submitted on-demand), OCI Data Integration (scheduled ETL/ELT pipeline runs), OCI GoldenGate Data Transforms (batch ELT pipelines), and OCI Functions + OCI Scheduler for compute-light batch tasks. Oracle Database scheduled jobs (DBMS_SCHEDULER) are commonly used for database-tier batch processing.

GCP — Cloud Batch + Dataflow (batch mode) + Dataproc Serverless Cloud Batch (GA 2022) is GCP's managed batch compute service for containerized and script-based batch jobs: manages VM provisioning, job queues, retries, and logging without cluster management. Cloud Dataflow in batch mode (Apache Beam) handles large-scale data processing pipelines: it auto-scales workers, handles backpressure, and shuts down after job completion. Dataproc Serverless batch runs Spark batch jobs without a persistent cluster. Cloud Composer (managed Apache Airflow) is the standard workflow orchestration layer for scheduling and monitoring complex multi-step batch pipelines across GCP services.

Feature	AWS	Azure	OCI	GCP
Managed batch compute service	AWS Batch (ECS/EKS/Fargate)	Azure Batch (VM pools)	None native (use Data Flow / OCI Functions)	Cloud Batch (VMs / containers)
Batch ETL / data pipelines	Glue jobs + EMR transient	ADF Pipelines / Fabric Pipelines	OCI Data Integration + GoldenGate	Dataflow batch + Data Fusion
Workflow orchestration	Step Functions + EventBridge	ADF + Logic Apps	OCI Data Integration orchestration	Cloud Composer (Apache Airflow)
Serverless batch	Glue Serverless + EMR Serverless	Databricks Serverless + ADF Serverless IR	OCI Data Flow (per-job Spark)	Dataproc Serverless + Dataflow batch
HPC / parallel batch	AWS Batch + EC2 HPC instances	Azure Batch + HPC VM sizes	OCI HPC shapes + OCI Data Flow	Cloud Batch + N2 / HPC VMs
Job dependency management	Step Functions	ADF dependency / Logic Apps	OCI DI pipeline dependencies	Cloud Composer DAGs
Spot / preemptible batch	AWS Batch Spot Instances	Azure Batch Spot VMs	OCI Preemptible Instances	Cloud Batch Spot VMs

Key differentiators:

Cloud Composer (Apache Airflow) is the only cloud-native managed Airflow service among the four providers, giving GCP the richest ecosystem for complex multi-step batch pipeline orchestration using the industry-standard DAG model.
AWS Batch is the most mature and feature-complete dedicated batch compute service, with deep Step Functions integration for complex dependency graphs and native support for containerized HPC on Fargate or GPU instances.
OCI's lack of a native batch compute service (equivalent to AWS Batch or Azure Batch) means general-purpose parallel compute batch workloads must be composed from OCI Data Flow, OCI Functions, or VM-based solutions.
Azure Data Factory's batch pipeline model — with conditional execution, retry policies, parallel branches, and integration with Databricks and SQL endpoints — makes it the most data-engineer-friendly batch orchestration tool for Azure-native workloads.

Summary Comparison

Category	AWS	Azure	OCI	GCP
Data Warehouse	Amazon Redshift / Redshift Serverless	Azure Synapse Analytics / Microsoft Fabric	Autonomous Data Warehouse (ADW)	BigQuery
Streaming Ingest	Kinesis Data Streams	Azure Event Hubs	OCI Streaming	Cloud Pub/Sub
Managed Kafka	Amazon MSK / MSK Serverless	Event Hubs (Kafka-compatible)	OCI Streaming with Apache Kafka	Pub/Sub (Kafka connector)
Stream Processing	Managed Service for Apache Flink	Azure Stream Analytics / Fabric Real-Time	OCI GoldenGate (CDC + streaming)	Cloud Dataflow (Apache Beam)
ETL / Data Integration	AWS Glue	Azure Data Factory / Fabric Pipelines	OCI Data Integration / GoldenGate	Cloud Data Fusion / Dataform
Data Lake Storage	Amazon S3	ADLS Gen2 / Fabric OneLake	OCI Object Storage	Google Cloud Storage
Data Lake Governance	AWS Lake Formation	Microsoft Purview	OCI Data Catalog	Dataplex
BI / Dashboarding	Amazon QuickSight	Microsoft Power BI	Oracle Analytics Cloud	Looker / Looker Studio
Data Catalog	AWS Glue Data Catalog + DataZone	Microsoft Purview	OCI Data Catalog	Dataplex Universal Catalog
Hadoop / Spark	Amazon EMR	Azure HDInsight + Azure Databricks	OCI Big Data + OCI Data Flow	Cloud Dataproc
Batch Compute	AWS Batch	Azure Batch	OCI Data Flow (no native batch)	Cloud Batch
Workflow Orchestration	Step Functions + EventBridge	ADF + Logic Apps	OCI DI Pipelines	Cloud Composer (Apache Airflow)

Cloud Services Cross-Reference: Analytics & Big Data

1. Data Warehousing

2. Streaming / Real-Time Data

3. ETL / Data Integration

4. Data Lakes

5. Business Intelligence

6. Data Catalog / Governance

7. Hadoop / Spark (Distributed Compute)

8. Batch Processing

Summary Comparison

References