Iceberg Tables Via Duck DB and Polars

Can I really ditch Spark to build my Iceberg Tables?

Sep 05, 2024

I’ve gotten some good practice lately creating and working with Iceberg tables, but there’s been a glaring problem. I’ve only been able to do so successfully with Spark. I’m not saying Spark is bad here, but for smaller workloads, there surely must be a simpler way to do this with less overhead, right?

In comes the PyIceberg python package. This package is built to read and write iceberg tables and plays nice with various data processing apis such as DuckDB and Polars. This post will walk through showing how to generate some datasets in DuckDB and Polars, create Iceberg tables off of them, and then check that we can actually use the data and read it back for analytics.

When noodling through the pyiceberg documentation and other online examples, I hit several walls where the docs apparently were either outdated or I was simply just holding it wrong:

I saw in some examples, the recommendation was to put your Iceberg configuration into a YAML file, but after a few tries, I kept getting dumb errors such as “missing uri”, when the URI path was clearly in the yaml file. On the most current doc from the pyiceberg website, this command seemed to hit the sweet spot and actually work on generating a local iceberg catalog (which uses SQLite to manage the backend):

Now that we have our Iceberg catalog up and running, let’s create what Iceberg calls a “namespace”; I think of a namespace as analogous to a schema in a normal database e.g. a container of objects inside the database.

catalog.create_namespace("dummy_data")

Random side note - why are we now mixing the phrases “catalog” and “namespace” with database and schema?

Alright, so now we have our Iceberg catalog and a namespace to put our tables in, let’s go generate a couple data frames via duckdb and polars as follows:

You will notice I’m converting each dataframe to an arrow dataframe at the end. This is because in order for pyiceberg to make Iceberg tables, the incoming dataframe needs to be in an arrow format. Let’s continue this and create the Iceberg tables and load the data. I’ll also throw in a quick row count validation check to make sure all records were loaded:

Now to Read the Data Back

Ok, so at this point, we have successfully created an Iceberg catalog/SQLite database backend, made a couple data frames and loaded them to the Iceberg catalog as tables. So, if we want to read it back, how does that look?

Let’s Try with DuckDB

Pro Tip 1 - The current extension in DuckDB for Iceberg stinks

You might ask “well why does the current extension stink for duckdb and iceberg?” Simply, from my testing, the iceberg_scan function cannot reliably traverse the iceberg metadata folder structure and is trying to find files that pyiceberg does not generate such as “version-hint.text”. Of course, when you download the demo dataset on DuckDb’s documentation, that missing file does exist, thus making their demo work flawlessly; after some googling, it appears the “version-hint.text” file is an outdated part of the iceberg spec and is no longer included, so it appears DuckDb still has some work to do for getting their extension up to mainstream caliber. Below is a screenshot demonstrating the error:

Ok, so if the duckdb extension for iceberg doesn’t work, what are our options?

Pro Tip 2 - leverage the table objects in pyiceberg to translate back to DuckDB or Polars

Using the PyIceberg table we had previously created with duckdb data, we can reference it back, create a table alias to it, and query it like so:

And that’s it. Pretty straight forward.

Now Let’s Try with Polars

Polars is even easier than DuckDB to read in Iceberg tables IMO. Below is the code for it to do so:

That’s literally it. Can’t get much easier than that.

Conclusion

This post demonstrated how to use the pyiceberg API to read and write Iceberg tables. My honest assessment is that this strategy can be fine for some smaller workloads and datasets, but when you look at pyiceberg compared to the delta lake api, delta lake is further ahead. Polars has a built-in writer for delta lake, and it supports the merge construct; currently, Polars and Duckdb do not support write operations to Iceberg tables nor does Pyiceberg support the merge command. I have no doubt that these items will be coming though in the near future, which will be nice and finally give us a simple alternative to Spark when we want to create and manage Iceberg tables.

Here’s a link to the code set we walked through today: the code

Thanks for reading,

Matt

High Performance DE Substack

Discussion about this post