If you have any curiosity of data engineering or data modeling and have done some decent research, you have probably ran across the term “Type 2 Slowly Changing Dimension” aka Type 2 SCD. But what exactly does that mean? And are there other types as well? One would think if there is a 2, then there would be a 1 and maybe a 3 or 4? And there are, but the most popular are type 1 and 2:
Type 1 SCD: Only the current state of a dimension is maintained in a table
Type 2 SCD: current state and historical state are maintained in the same table as separate records
I think the easiest way to explain a type 2 is with a real world example. On my first job, the minute I dug into the team’s SQL Server, I ran across a type 2 dimension table. The data I was working with was distribution center geographic and management information for big box retail. This type 2 SCD table provided both the distribution center’s immutable traits e.g. geographic location on planet earth, as well as the management structure (who ran the DC, who was the assistant GM, etc). Since humans tend to move about through their career, this was the piece that was considered the type 2 SCD. For reporting performance purposes, we had to maintain historical snapshots on who ran the building and what their effective start and end date were. In simplistic terms, it looked something like this:
As you can see in for DC_NBR 1234, we have 2 entries. The first one was managed by Joe Schmo and his tenure ended on 2/3/2010. The next GM (Jane Doe) is still managing the building. One nifty thing that you will notice at the end is a column called “IS_ACTIVE”. I think this should be a required column on any SCD type 2 table. Sure, one could argue “Hey I can just query where the effective end date = 9999-12-31 and be just fine”; yea, you can, but having that simple bool to provide the current state has come in handy in a lot of my work. Plus, most databases only take up a single bit for a true/false flag, so you are not really adding much to the storage, especially if the table is type 2. And on SQL Server for example, once you add 1 bool to a table, you technically can add an additional 7 bools to the same table and still take up the same space internally, since that would be 8 total bits = 1 byte. (flag columns galore 😁)
Doesn’t Iceberg Inherently do all this?
I’ve noticed in some of my discussions with people looking at Iceberg that they are getting confused with time travel and Type 2 SCD. They are not the same thing. Sure, Iceberg maintains a history of the data as you update it, but the way that data is organized and retrieved is not exactly the same as a classic Type 2 SCD. Just look at this prior article I wrote on Iceberg to see how one would go about retrieving a series of historical records (it ain’t pretty). You’ll notice below that if I want to see multiple different records in an Iceberg table at a different point in time, I’d have to write a single query for each one (yuk)...
With Type 2 SCD, the data is cleanly organized into a table that is queryable for all history so that you don’t have to fiddle with some internal history table to grab a timestamp of when a record was landed.
Additionally, even if one were to try and make the argument that you can in fact just use an iceberg history table to get point-in-time snapshots, the “made_current_at” internal iceberg column probably does not accurately portray from a business standpoint the actual effective begin/end dates for a particular segment of a business. Joe Schmo could have left managing DC 1234 on 2/3/2010, but the new record might not have dropped into the table for Jane Doe until 3 or 4 days later. Needless to say, if I need to maintain historical point-in-time reporting for a particular line of business, I’ll go with my Type 2 SCD.
More Pontificating
Type 2 SCD’s are a cornerstone for many core data models and things such as financial reporting to the street. Type 2 tables are what make audits somewhat bulletproof as long as you have proper change controls in place. And Iceberg’s time travel (although convenient to restore tables and/or a slice of records), IMO is a hot mess to work with when you actually want to apply business logic to it. It would be interesting one day if the Iceberg spec could update and allow us to query all history in a single shot and join it to an actual table instead of us having to write a separate query for each snapshot instance we want to retrieve. I tried at one point to join an iceberg’s internal history table with the actual data table based the internal column “made_current_at” and Iceberg blew up and said I was holding it wrong 😆.
Thanks for reading,
Matt