I Built a Lakehouse with Polars+Delta on GCS

And so can you

Dec 16, 2024

As much as I rave about Iceberg, I will admit that when it comes to data processing, Delta lake is very far ahead. If you don’t believe me, just go look at this article by the legendary Daniel Beach and how he shows what can be done with Delta in just 2 lines of code. On top of Delta, Polars is also arguably way ahead of DuckDB when it comes to support for the lakehouse. Why would I say that? A few things come to mind:

Polars has built-in support to write to cloud object stores such as S3 and GCS - no extensions, no extra packages required. Duckdb requires an extension for AWS; for GCS, you have to perform some hackery with the fsspec system (see this post)
Polars has built-in support to WRITE to delta lake. There is currently no support to write to iceberg in Polars or DuckDB. On top of that, if you ever want to read Iceberg from Polars, you have to have it map to the most recent metadata file, which can change as things update on your table. Thus, the only method I’ve found reasonable in that regard is to pull in pyiceberg (yay another package) and load the table from there before having polars read it. For that example, you can read this post.
Polars supports the MERGE command for delta lake, which means I can easily and quickly load incremental data to a dataset. DuckDb does not support the MERGE command and requires a little buffoonery to get it to effectively perform an “upsert” of some sort. The core problem with this upsert approach is you have to OVERWRITE the entire table…not fun

With that said, I wanted to illustrate in this article how easy it is to get a Lakehouse going in the public cloud. You don’t need to spend absurd amounts of $$$ nor do you need a lot of services. We will complete this task with just 3 things:

A python script to generate test data and another to illustrate a Lakehouse workload converting the raw data to the delta lake format
a google cloud storage bucket
a bigquery project

Alright, Let’s Go Build a Lakehouse with Polars

First thing first…we have to pick an industry retail segment we want to build our faux lakehouse for. For this article, I chose a bicycle shop. This bicycle shop gets daily orders and needs to process them from their raw format and load them to the lakehouse for downstream user consumption. So how did I create the faux datasets for this bicycle shop?…Enter the “AI”…a.k.a. “chat gippity”.

Pro Tip - I have found lately that ChatGPT is great for generating fake datasets.

I now can outsource the building of my fake datasets to chatGPT with some simple prompts. I gave “the AI” this prompt and it spit out some pretty reasonable polars code for me:

“I want you in python to create 2 fake dataframes using polars. The first one is bicycle shop order header data. I want to generate 10k rows of this fake data. I then want you to create a detail dataset that corresponds to the order numbers in the header so I can join them later. I want each dataset to be written to CSV via polars and saved to a. specific GCS folder that I can provide.

With that prompt, ChatGPT in its first try spit out these 2 python functions to generate order header and order detail data:

Order Header Function

Order Detail Function

I put those 2 functions into a py file, wrapped them in a main function where I can easily toggle the order start number and order date (to affect the name of the csv file), ran the script a few times and generated several CSV files in GCS. This is what the bucket looked like after generating the files:

You Said Lakehouse Though…CSV Ain’t A Lakehouse Table Format

Alright, as I mentioned before, we first needed to generate our daily orders data raw files. But now that we have them generated, how do we get them into the Lakehouse open table format and build the script to where it can do incremental updates?

As I mentioned earlier, polars has built-in support to write to delta lake. I wrote a function below that checks if the table already exists. If it doesn’t, it will get created for the first time. If it does exist, then a merge is performed:

And when I say polars supports Delta’s MERGE command…I mean full support. Just look at the docs here from Delta Lake. Delta supports all the MERGE predicates including rarities like “WHEN NOT MATCHED BY SOURCE”, which is a goto of mine when I need to keep a target table in sync with a source table.

The main data processing script feeds both the order header dataframe and the order detail dataframe separately to this function for processing.

OK, But Do My Downstream Consumers Now Need to Know How To Write Polars Code?

This question comes up a lot from my experience. I’ll illustrate using polars, duckdb, or spark to a team and they freak out and think “My analysts don’t know how to write python code, let alone this fancy dataframe package stuff. Do they have to now go learn this?

Absolutely not; they can use SQL friendly front-ends like Google BigQuery and work with the data just fine. For this tutorial, we can simply setup external tables for both the header and detail delta lake tables as follows:

Side Note - Before you can create external tables in BigQuery pointing to GCS, you have to first setup what Google calls an “external connection”. This took a minute because the drop down menu to me was not obvious if you wanted to connect to GCS. The option you need to choose is “Vertex AI remote models, remote functions and BigLake (Cloud Resource)“. Once you create the external connection, head over to IAM to ensure the role tied to that external connection can read from your GCS bucket.

And now that those tables are setup, our consumers can simply write some BigQuery SQL to analyze the data. Here’s an example query that joins the two tables together to find orders with the highest order quantities:

And that’s it. You can easily, over time, modify the scripts to add on more tables and data sources as needed, and as long as you are not pushing into the 10’s of GB territory or higher on data size, you should be just fine with this setup. If you start to grow a lot in size, you still have options to stick with Polars with some clever partitioning and archiving of data.

Summary

This article walked us through how to build a Lakehouse on GCS using polars+delta lake for the back-end processing and Google BigQuery for the front-end. Here’s a high level summary of what we did:

Created some fake datasets with some ChatGPT prompting and had polars write the csv’s to GCS storage. This script for that can be found here.
Wrote a script to process the raw csv’s and save them to GCS in the delta lake format. The scripts will create the tables if they don’t exist and then perform a MERGE to do incremental updates, so that as more data come in, we can easily update our datasets. The script for that can be found here.
Created a couple external tables in Google BigQuery that point to our Delta Lake tables in GCS and then performed an analytical query on them. That script is located here.

On a final note here, I was able to do all of this work on GCP for free. Google has a nice free tier for storage and BigQuery usage/testing. Honestly, a small business could do the same and probably get away with a cloud bill of almost $0 per month to drive their analytics…and dare I say, maybe even their operational workloads.

I’m hoping that soon iceberg and duckdb provide us the same features that polars and delta do today - brain-dead easy support to create and merge lakehouse tables in the cloud. It would be great to have more choices.

Hope you enjoyed reading this and that it gave you some ideas on how easy it is to setup analytical workloads in the cloud with the modern lakehouse.

Thanks,
Matt

High Performance DE Substack

Discussion about this post