Both Apache Iceberg and Databricks Delta Lake are dominating the open table format wars for the Lakehouse. Both boast that they are “open”, to where you own the data and at any point if you want to move from one cloud provider or data storage medium to another, you can simply wave a magic wand 🪄, copy your data over from one cloud provider bucket to another and it should “just work” in the next cloud, kind of like the famous Steve Jobs saying:
Well, I’d say let’s actually put this theory to the test and see if these formats are as portable as they say they are.
The Setup
For our setup, we will do the following:
Generate both a Delta Lake and Iceberg table locally (on-prem baby!)
We will then copy each folder’s warehouse, which includes the data files and metadata files up to GCS
We will create a BigQuery External Table on each table from GCS and see if BigQuery can actually read them
Creating the Tables
For this exercise, you will see shades of my prior post where I built a lake house in GCS using just polars and delta lake. That post can be found here: GCS Lakehouse
First I’ll generate the dataframe with some dummy data in polars using this python function:
Next, I’ll save it locally to delta lake as follows:
Talk about an easy button. Deltalake’s built-in first class support in polars is a nice touch. Iceberg requires a little more elbow grease at the moment. Here's what we have to do to save the same table locally via Iceberg:
Alright, let’s take a peak at the file explorer to see what we are working with:
Deltalake Files
Iceberg Files
Next, I will copy both the “delta_warehouse” and “iceberg_warehouse” folders up to GCS. I did this part on the web console manually…did not feel like writing code for this one 😁:
Alright, so at this point, we have created our datasets locally for both Deltalake and Iceberg and we have copied the files up to GCS. Surely, since both of these platforms tout that their table formats are “open” and that I own the data, that I should in fact have freedom of choice to move the data wherever I want, whenever I want.
Creating the BigQuery External Tables
First, I’ll create the BQ external table on the delta lake folder and attempt to query it as follows:
Alright! Worked on the first try and was pretty easy.
Next up, we will try it with Iceberg:
Looks like we have encountered our first problem. When reading into this more, metastores like BigQuery, AWS Glue, etc. need the path to the most current metadata file for an iceberg table. Luckily, the pyiceberg API makes retrieving the current metadata path easy:
Ok, so now that we have the path, let’s give it one more try:
Alright, that took and said the table was created. Let’s go run a test query on it:
Wow…we got an error saying “Invalid uri in load data options”…And here inlies the core problem:
Iceberg hardcodes the data file paths in their metadata files
This took a minute for me to noodle through, but when you actually crack open the JSON metadata file that Iceberg produces for their tables, this is what we see in it, and thus why BigQuery is having issues with the URI as its trying to traverse the Iceberg files:
Notice the very first key/value is the “location” and the path. The path is referencing my original local path. Iceberg, as of today cannot discern relative paths. It needs the absolute path in order to function, which is a major problem IMO.
So how do we go about fixing this? Let’s try to recreate the Iceberg table, but in our catalog configuration, specify our GCS bucket as the warehouse instead of our local file system:
And now, let’s head back over to BigQuery and create our external table again and attempt to query it again:
And voila.
The Core Problem
The core problem IMO with Iceberg is that its architecture requires absolute paths to tick and tie the metadata files to the actual data files. As we saw earlier, Deltalake does not suffer from this same issue. We were able to easily create the delta table locally and then copy it into GCS and load to BigQuery with zero issues.
I think what this really boils down to is the fact that out of the gate, Iceberg is required to be hinged to a catalog. I’ve gone back and forth in my head many times on if this is a good thing or a bad thing. I can find pros and cons to both sides of the coin. I ,no doubt, understand that the architects of Iceberg are way smarter than I am and had a very good and justified reason for what they did, but placing absolute paths in the metadata makes portability a very hard thing to solve. If you ever wanted to migrate catalogs from on-prem or one cloud provider to another, you would have to go through a significant amount of steps to doctor all the metadata files and replace the current warehouse folder with the new one. I think a simple solve is that Iceberg could instead have a single file called “warehouse_stuff.json” at the top level that provides the warehouse root folder; then, all subsequent metadata files could reference that top level file to construct the URI’s to query the data. That way, if you ever had to migrate from one object store to another, you’d only have to change 1 file.
I did some more googling on this, and it appears that some companies are already offering code to migrate Iceberg catalogs. This one from Dremio is one I came across.
Dremio Iceberg Catalog Migrator
Conclusion
Iceberg’s metadata files have a problem that needs to be fixed. I’m not sure if their roadmap has plans to update the metadata files to use relative paths over absolute, but until they fix this problem, good luck if you ever want to migrate your iceberg tables from one storage medium to another.
And who knows, maybe there is some other pyiceberg function that I have not seen lurking out there that does this fix for us.
But, this exercise reminds me of a quote I once heard about your data and the cloud (I can’t remember who said this, but its catchy):
“The Cloud is like Hotel California. Your data can check in anytime you’d like, but it can never leave” ~ No Idea Who Said It
Link to the code: Delta IcyStuff
Thanks for reading,
Matt
If anyone would like to join the on going discussion for this topic in the Apache Iceberg project - https://github.com/apache/iceberg/issues/1617
A very useful piece, thank you. A big enough gotcha to be annoying and prevent wholesale adoption, at least for me.