10 min read

Part 2: Comparing features and developer experience of data warehouses

If you haven't seen yet, here is the previous post on pricing:

In this post, we will compare the features of the data warehouses. Based on my personal experience, I added scores and product offerings for each section:

Features

Snowflake AWS Redshift Serverless Google BigQuery Databricks AWS Athena
Streaming Streams, Dynamic Tables 5️⃣ Materialized tables 4️⃣ Materialized table, Continuous query 4️⃣ Delta Live Tables 5️⃣ -
HTAP Hybrid Tables [Glue to Aurora], Zero-ETL BI Engine, BigTable, Spanner, Pub/Sub - -
Scheduling Tasks 5️⃣ - Scheduled queries, Workflows 4️⃣ Workflows 5️⃣ -
Sharing Capabilities Data (Secure share), Code (Framework) 5️⃣ AWS Clean Rooms 4️⃣ Analytics Hub, VMs 4️⃣ Delta Sharing, Compute 5️⃣
Marketplace 3K listing, most advanced 5️⃣ 5K listing 4️⃣ 1.5K listing 3️⃣ 2.5K listing, most advanced 5️⃣ 5K listing4️⃣
Cloud AWS, GCP, Azure AWS GCP, AWS AWS, GCP, Azure AWS
Cross Region in the Cloud Auto ReplicationQuery Across Region
Bring your code Containers, Pandas, Python / Javascript5️⃣ 1️⃣ Containers, Spark via Datapro4️⃣ Spark 5️⃣ Spark, Data Source Connectors 3️⃣
Catalog Polaris (Apache Iceberg), AWS Glue, Snowflake 4️⃣ AWS Glue 3️⃣ AWS Glue, BigQuery, Google Data Catalog4️⃣ Delta Lake, Apache Iceberg 5️⃣ AWS Glue 3️⃣
Best for 🦄 SQL-for-everything, analytics engineers getting serious on AWS already on GCP Non-SQL / ML workloads, data engineers starting on AWS

Experience

Snowflake AWS Redshift (S) Google BigQuery Databricks AWS Athena
Maintenance 4️⃣ 4️⃣ 5️⃣ 3️⃣ 3️⃣
Onboarding 5️⃣ 3️⃣ 4️⃣ 3️⃣ 4️⃣
Ecosystem 4️⃣ 4️⃣ 4️⃣ 5️⃣ 4️⃣
Capability 5️⃣ 3️⃣ 4️⃣ 5️⃣ 4️⃣

Best overall: Snowflake ❄️

To run queries on Snowflake, you need to create an account. Snowflake accounts are within a region in your cloud provider so you can't have two tables in different regions / cloud providers in a single account. This design is mainly to isolate the unexpected egress costs in cloud providers, they rather advice auto replication feature which makes sure the data is copied only once in different regions.

SQL-for-everything

Snowflake provides SQL support for the resources, which I love. You create accounts, provision warehouses, run Kubernetes, (it's weird to write YML in SQL though) and share data just with SQL. Their syntax is consistent, rich, and well-documented. Their CREATE [whatever] can provide a very convenient onboarding experience for newcomers. You don't need to dig into specific REST API documentation or deal with authentication. Moreover, the features such as dynamic tables makes it easier to use Snowflake with SQL-based pipelines.

Connectivity

  • You can bring your Python / Javascript and map them as user-defined (scalar, table, aggregate) functions
  • Snowflake scales out Pandas, as an alternative to Databricks's PySpark.
  • You can connect to your cloud (Google Cloud Functions, AWS Lambda, etc.) external functions.
  • You can run Docker images inside your account with Container Services.
  • Snowflake has Streams and Tasks, which let you create workflows based on event and time-based triggers.
  • Snowflake Native Application Framework provides a first-class data application experience, sharing data and code with your partner. It's superior to AWS and GCP Marketplace.

Hidden Gem: BigQuery

I believe BigQuery is the most underrated data warehouse, given that's almost 15 years old. It was built upon Google's modern infrastructure years ago and can run on other clouds such as AWS via Omni. It would be a standalone product outside of GCP, it would probably succeed more as a product. Snowflake's major cloud provider support is AWS and BigQuery is indeed a big competitor of Snowflake but selling BigQuery outside of the GCP ecosystem is touch as it's highly coupled with GCP internals.

Highly coupled with GCP

  • BigQuery integrates well with most GCP products and your Google Workspace data, which makes it a natural choice in the Google ecosystem.
  • BigQuery doesn't have the database/schema/table hierarchy unlike most of the traditional and modern data warehouses. Your GCP project is your database and you use a non-SQL API to create datasets. Under the dataset, you can create tables but you can't create schemas via SQL. Here is how the full qualifier of your tables.
  • Google Marketplace provides ways to share data with partners but BigQuery doesn't provide an application framework to build and deploy apps in a way advanced as Snowflake and Databricks.
  • It's not obvious but if you use BigQuery Omni on AWS S3, you don't pay any egress cost so it's indeed cross-cloud unlike any of the AWS offerings.

Connectivity

  • You can bring your Javascript and map them as user-defined (scalar, table,aggregate) functions
  • BigQuery run Spark procedures via Dataproc, managed Spark cluster offering from GCP.
  • You can connect to your Google Cloud Functions and run containers on Cloud Run via remote functions.
  • You can connect cross-cloud data on AWS and Azure via Omni which doesn't incur any egress cost as Google runs BigQuery natively in your cloud provider.
  • You can connect most of your GCP data in and out for building workflows, using BigTable and PubSub integrations.
  • You can connect your Google data including spreadsheets and Google Drive.
  • You can use materialized views and continuous queries to create streaming / incremental data pipelines.

Most flexible: Databricks

Databricks's onboarding experience is among the worst I had, primarily because I don't have much Spark experience. Their BYOC approach (Bring your own cloud) is a good idea but it's not as easy as it sounds. You need to have a good understanding of Spark and Databricks to use them effectively.

Databricks runs on Spark, one of the most popular data processing engines. It's a natural choice if you're already using Spark, on your own cloud. All the compute happens in your own environment with Databricks and you have full transparency.

Cool again: Redshift Serverless

Years ago when Redshift came out in 2012, it revolutionised the data warehouse ecosystem because it was a huge boost from Hadoop to Redshift. However; over the last 10 years, its competitor Snowflake has been eating Redshift's lunch. Even Redshift salespeople were selling Snowflake to their customers, probably because it's clearly a better product and AWS makes money from Snowflake anyway for compute.

Redshift Serverless feels like a comeback from Redshift. Given that companies need to sign another contract with Snowflake if they're already on AWS, Redshift Serverless is a natural choice today.

Easiest to start with: AWS Athena

AWS Athena is using AWS Glue as a Data Catalog and it's a serverless data warehouse on the AWS data and Iceberg tables. It can create tables but it's simpler than its competitors. That makes it a natural choice if you're on AWS and need something quick and easy to use.


TLDR;

  • If you're on AWS, using Glue will give you the option to use Athena, Redshift Serverless, Snowflake, and BigQuery. Athena is the easiest to start with on AWS and you can switch over to Redshift Serverless over time.
  • If you're on GCP, BigQuery is probably the best option for you.
  • If you're cross-cloud:
    1. Snowflake is the best option if you have SQL-familiar people like analysts and analytics engineers.
    2. Databricks is the best option if you have data engineers and data scientists, focusing on ML.
    3. If you're expanding to AWS from GCP, BigQuery is still worth investing in.
Would you like to comment? Reach me out @bu7emba!