Local DuckDB on GCS

Shift the compute..."left" ?

Aug 23, 2024

One of the nicest things about the cloud I’ve come to know very well is serverless compute. I can, in just a few seconds, spin up a spark cluster, run some pyspark code on it to transform my data and voila…the dataset is landed in a consumable format for the end consumer. The popularity of serverless compute is very evident these days. Databricks and Snowflake are making a killing on it, and Google BigQuery’s serverless architecture makes crunching data via sql brain dead easy.

But…with all these nice shiny toys comes a cost, and a rather large one. It is no secret that compute in the cloud (especially serverless compute) is expensive. We’ve seen over the last few years many prominent figures announce they are leaving the cloud to save on compute costs, the one most notably to me was the creator of Ruby, David Heinemeier Hannson, who created a ton of buzz when he put out this note.

This article is not to try and sway you to stay in the cloud or move your compute back on-prem. Rather, this is just me providing a lens on how easy it is to use DuckDb to query GCS data locally. You could end up using this approach to process small-to-medium datasets locally (my current interpretation of a medium dataset is 5GB or less 😜), but I’ll leave that up to you to make the call. So let’s dive in and show you how easy it is.

Bootstrapping DuckDB to GCS

The first thing we need to do, is pull in some python libraries that play nice with DuckDB to make it aware of the GCS file protocol. To do this, we will pip the following packages:

gcsfs
fsspec

Once you have brought those in, we can register GCS with our default local GCP creds as easy as this:

Pretty straight forward, ehh ? 😁.

You might be asking yourself - What does Matt mean by “default local creds” ? To see how we do that, read this post and look for the part where i’m running “gcloud auth application-default login”

Querying GCS Data

Now that we have made DuckDB aware of the GCS file store protocol, let’s go try to query some CSV files.

Pro Tip - I’ve mentioned this in other posts, but it is good practice to NEVER expose the name of your actual bucket. Otherwise, you might end up like this poor soul.

Pretty straight forward. A few things you need to realize though:

When you read a csv from GCS down to your local workstation, (I think) your computer will have to download the entire file. I doubt duckdb has the capability to crack open the file in GCS in-place and only fetch the columns you want. So be mindful of the egress latency and potential cloud charges.
If you are wondering how I have such lovely test data to query against, you can reference this same article I mentioned earlier on this post and look for the section where I run my “fd” command to generate fake data.

Now Let’s Write Some Data

Given that querying data from GCS with duckdb was as brain-dead easy as I promised, let’s see if this ease of use can go both ways to where I can process some data and push the results back up to GCS?

What I’ll do here is define a query that reads all my test data from a specific GCS folder (its 100k rows in total), does some aggregation, and writes the results out to parquet in GCS:

WOW - Is is really that easy? Did we really just read some data down locally, crunch it, and push back up to GCS with less than a dozen lines of code?….we absolutely did 😁. But to double check, let’s validate the parquet file we just wrote with a little more code:

And there you go…

Conclusion

This article demonstrated how ridiculously easy it is to read data from GCS, process it locally, and push results back up to GCS with DuckDB. And believe it or not, this same ease-of-use is also apparent in AWS S3. Heck, DuckDB even has an AWS extension for it that can be found here. In terms of whether or not you should take this approach, you really need to examine your workloads and consider the following:

Can I, and do I, want to maintain scripts on-prem?
Is the cost savings there? This is a tricky one and will require some balance and testing, as the cloud egress costs are sometimes tricky to track down on a per-query basis
Can I process and write once and then publish that final dataset up to the cloud?
Is this going to be widely supportable by my organization?
Can we construct these data processing scripts in such a way that it will be very easy to pivot them from on-prem to cloud compute on an as-needed basis?

Those are just a handful of things that come to my mind. I think the jury will be out for a while though as orgs continue to mature their journey to the cloud and understand the pros and cons.

Here’s a link to the code we wrote today.

Quack Quack Duck GCS Attack

Thanks for reading,
Matt

High Performance DE Substack

Discussion about this post