Apache Iceberg and Delta Lake have become the new standards for data engineering and lakehouses, whether you like to admit it or not. They bring the flexibility of data lakes while at the same time bringing the speed and efficiency that traditional data warehouses spent decades building and refining. With that said, one of the big marketing hypes of these table formats is “time travel” e.g. being able to recall records at a certain point in time. But with all this hype (and I might be in a minority here), I have found only a single valid use-case for it. If you’d like to understand my thinking on this subject further, read on…otherwise go back to mindless scrolling on LinkedIn.
First, The Semantics
For this article, I’ll be using Iceberg as the gunea pig for doing time travel examples. With that said, let’s first create a simple Iceberg table and demonstrate how one performs time travel on it. For this tutorial, I’ll use Spark. You can always check the code link at the bottom or look at my prior articles to understand the dozen or so spark configs I have to put into create the session. I’m going to skip that here in this article and assume we will just have our SparkSession (spark) up and running.
Let’s create a table with a couple records that represent stock symbols and prices:
Alright, next let’s create a dataframe with an updated price for Microsoft and then merge it to the iceberg table so that we get a change in our record:
And let’s take a peak:
Alright, so when I first created this table, the price for Microsoft (MSFT) was $23.99; then I made an update to the price and it’s now $21.45.
So How Does One Recall History a.k.a ‘time travel’?
To do that in Iceberg, you have a couple of ways. You can either recall a table as of a specific “version”, or in my opinion as a more practical example, recall it as of a specific time frame. Before we can go actually query as of a specific timestamp though, we have to look at the internal iceberg history table for this; otherwise, we will get an error if we put in a timestamp that doesn’t exist for the history:
Alright, so we see 2 entries in this history table; the column I'm interested in is the “made_current_at”. I’ll be using that to construct my audit history of the stock prices for MSFT.
Here’s how we can look at the price of MSFT as of specific timestamps:
Even though this is pretty straight forward, you are having to concoct a query for each individual entry. I don’t know about you, but I’d rather be able to call the entire history of a record in a table in a single shot. I’m not sure if I’m just holding it wrong, but according to Iceberg’s docs, that is currently not possible. Instead, you have to provide a specific timestamp or version for each instance where you want to recall history.
In this case, any reasonable data engineer that wanted to provide a historical dataset to a business partner would instead just architect the table with the primary keys as ticker_symbol and a load_timestamp. That way, the business partner is not having to fiddle with this “TIMESTAMP AS OF” stuff and go hunting round the Iceberg internal history table.
But now, we get to the actual situation where I see Iceberg’s time travel coming in handy…
Accidentally Overwriting A Table
We’ve all been there…or at least I’m willing to admit it 😁. You thought you were in dev, you ran an update and hosed a production table. Well, how does one quickly restore it to an “as-was” setup. This is where I see Iceberg’s time travel really providing its bang for its buck. And here’s how simple it is:
Well lookie right there; we were able to restore the table to an “as-was” in a single shot. Let’s take a peak:
As you can see, the table’s price is reflecting what MSFT was at the beginning of this tutorial. I wish I had such simple syntax back in my SQL Server days. It would have saved me a lot of time, headaches, and bribing the DBA’s to fix problems I created.
Conclusion
Is Iceberg “time travel” just marketing hype? Yes and No. I’ve provided examples above on how time travel works in Iceberg as well as my opinion on when it really would help vs. just creating more confusion and chaos.
Thanks for reading,
Matt