Throughout my career, I’ve worked with many analytical datasets that are replications from an operational data store (ODS) where I have to deal with multiple entries of the same event. Think of an order fulfillment system that has to keep track of order entries, the lifecycle of the order as it travels through the supply chain, and finally it landing to the customer’s doorstep. In theory, when an operational system lands data in multiple tables, it will acquire a transaction lock and land the order entry + the create timestamp to all needed tables from the onset. However, in reality, this becomes a mixed bag. Some developers think it’s better to just lock one table at a time vs. several; additionally, operational events will create sometimes a slightly disjointed set of tables that require some elbow grease to make sense out of.
As an example, let’s pretend we are a big box retailer that is fulfilling an online order for a customer. Here is how the series of events could look for the order:
Order is created
Order is packed and labeled
Order is shipped
Order is delivered
That right there is what we call the “happy path”. However, when fulfilling thousands of orders a day, you will inevitably run into what is known as exceptions. These exceptions can range from a label not being properly fixed to the package, the package getting damaged in transit or in the warehouse, the customer not being home, there are many upon many exceptions that can occur. And when these exceptions occur, it is up to the agent that identified these exceptions to add context of what happened so that the managers can review and determine if these type of exceptions are a repeating pattern that require more attention. So, let’s revisit our happy path above and throw some wrenches/exceptions into it:
Order is created
Order is packed and labeled
Exception occurs - package label is not readable
Order is re-labeled
Order is sent down conveyor
Exception occurs - package falls off conveyor
Order is picked up/resent down conveyor
Order is shipped
Order is delivered to the customer
In that example above, we had 2 exceptions occur on this order. The dear lord must have really had it in for the customer that wanted their hammer…Or Ace Ventura had a hand in this one:
How does this data look in a Operational Data Store (ODS)?
In an ODS, you typically will have the following tables:
Order Header - contains information of when the order was placed, an order Id assigned, who the order is delivering to
Order Detail - contains SKU’s and quantities getting ordered
Order Event - contains a series of event ID’s attached to an order with corresponding timestamps to track the lifecycle
Order Exception Details - Contains information when an exception occurs such as hand typed descriptions of what went wrong
And many more tables…these are just a few of the usual suspects. So how are we going to build our test ODS for this exercise? DuckDB to the rescue!
Let’s Build our Database
Since I’m a glutton for punishment 😁, I’m going to do all the coding work in terminal. This is to showcase a few things:
With DuckDB, you can do a lot with just a terminal, since the CLI comes shipped with batteries included
The DuckDB CLI combined with a “newish” terminal syntax sugar UI known as Harlequin is a mean combo, which I’ll get into as we progress.
Side note - I found out about Harlequin a few weeks ago and it is awesome
To load our data, we will run this SQL command on the terminal. It will create a database called “order_repo.db” and load some dummy data to it. Let’s rock:
duckdb "order_repo.db" -c "$(cat order_data.sql)"
The SQL text getting executed (order_data.sql) can be found here: order data
Once this command is ran, you will see in your file browser a new file called “order_repo.db” pop up. Congrats, you have made a persisted duckdb database and loaded some data to it 😁.
Now that we have our database up and running, let’s issue some queries against it in terminal…ha just kidding; instead, I’ll use this pretty sick editor over the terminal called Harlequin. The link to the docs of this editor can be found here: Harlequin docs
So instead, I’ll fire up Harlequin and bootstrap it to our newly created duckdb database with the following command in terminal:
harlequin "order_repo.db"
What you should see is a screen like this one, which shows us the 3 tables we created as well as the schemas for each:
That is pretty dang slick…and to think, this thing just lives in terminal…so its very very fast to load and doesn’t come with all the baggage of an IDE like Visual Studio Code.
Now the Fun Begins
Alright, we have our ODS database, we have our “IDE”; so what analytical question do we want to answer?
Question: Show us all orders that had exceptions occur and what the corresponding exception description was?
This seems like a pretty straight forward question; we have our order header table, we have our order event table, and we have what looks to be an order event exception table (order_event_ex). We just need to join this all together to make sense of it; so let’s start by finding an order that had some exceptions. To do this, I’ll scan the order event table. When browsing that table, it looks like exceptions carry an event_type of 99 as seen in the screenshot below:
Alright, I think I’m going to drill this further and focus on order_id 3, since that one has a couple different exceptions. Let’s take a look at that specific order and craft our query against it; once we get it right, we can expand it later; Here’s an updated screenshot of just order 3:
Ok, so now that we have filtered for this one order, I can see that an exception occurred right after the order shipped as well as again after the conveyor was loaded. I’m also starting to see a problem; I’m noticing that the exception code is not unique for each occurrence; each carries an event_type of 99; let’s kick it over to the order_event_ex table to see if we can make more sense of this stuff?
…and Yikes! Ok, so our order_event_ex table does contain the human readable text of what went wrong, but how the heck am I supposed to join that back to the event table and ensure the first exception is lined up with the “order lost - creating new one” and the second exception lines up with the “package fell off conveyor”?
I could try joining on both the order_id and the event_ts = except_ts…but I’m noticing that the event_ts for exceptions appears to be slightly a few minutes behind the actual logging of the except_ts…why is this happening?
This is a classic example of why, you as a data engineer, need to have a good understanding of the operations
To progress well in your career as a data engineer, IMO you will want to make sure you have a good understanding of how the operations work and make friends with your business partners. It’s one thing to query data, it’s another to understand why you query data a certain way. In our example above, let’s think about what’s happening operationally.
Let’s pretend Ted is a worker on the floor and sees that a package fell off the conveyor. Ted will quickly stop the conveyor line, which will log an exception in the order_event_dtl table of event_type 99; but, once that happens, Ted needs to walk over to the package, pick it up, make sure it’s not damaged, walk it back over to the conveyor start line, and resend the package down. Then, after that, Ted needs to walk to a terminal and log the exception context into a free form text field that describes what actually happened, which was “package fell off conveyor”. In that timespan, there will be several minutes that have occurred between the initial exception getting flagged and Ted typing in what went wrong. This is why we are seeing a lag on the timestamps.
One could curse the darkness and say “this is such a bad data modeling design…blah blah blah”, but we have work to do here as the analyst. We were not in charge of selecting the ODS system; we are here to work with the data and make some sense out of it, even if its not in our ideal nirvana rainbow rocket powered unicorn system.
for posterity
Thus, back to our original problem: How are we going to show order events with all exception information captured in a single query when the timestamps don’t line up?
WINDOW FUNCTIONS to the rescue!
Let’s figure out this and make our boss happy
To work through this, I will draft the query and then modify it in chunks so you can follow along and see how to break this thing down. First, let’s keep it simple by just focusing on order 3 for event_type_cd 99 (exceptions) and join to the exception detail table. This looks as follows:
Alright, at least we have our data, but as we can see, it is more/less cross applying both exceptions to both timestamps from our event detail table. So, how can we clean this up with a window function?
We know that the exception entry occurs within a few minutes of the actual incident occurring, thus, we can establish a business rule on our query and rank the data and choose the one with the smallest time between the event timestamp and the exception timestamp.
What does this all look like with SQL? Here’s an updated screenshot:
Well…not quite right. Now that we have implemented our fancy Window function via the QUALIFY clause, both exceptions are gravitating towards the first exception entry, which would make sense because the rank function is choosing the value with the lowest date time difference in seconds…even if its negative…wait a second? We know from our business rules that the actual entry of the exception occurs a few minutes after the exception is first logged. Thus, let’s update our query’s WHERE clause to solidify that business rule:
And Bingo! Now we have correctly built this query to handle joining 2 tables with disjoined timestamps together to grab exceptions, and we can now scale it! Let’s remove the filter for the order_id and see what else shows up:
Beautiful! Now that we have commented out our order_id criteria in the WHERE clause, we can also see order_id 1, which had an exception show up. So let’s wrap this exercise up so we can get out to lunch on time and come back to our desk with a slew of additional questions in our inbox, since we have just proven to our boss we know a thing or 2 about navigating complex datasets:
And as you can see in the final query, we are showing all orders that experienced an exception as well as the corresponding exception text. Bravo👏.
Conclusion
I just walked you through what I had spent many of my days as analyst having to do when working with data. In many situations, you don’t get a clear cut way to join the tables and will have to get creative. But understanding the business process has served me well over the years and allowed me to excel at providing decision makers with accurate data that follows the way operations work. That, IMO, becomes an incredible skillset that most organizations would pay good $$$$ for. I hope you got some good insights from this excercise and can apply similar concepts going forward. Our final query is at the link below.
Other Thoughts
The SQL QUALIFY clause makes applying window function filters very easy and clean. If your RDBMS does not support the QUALIFY clause, flood the suggestion inbox to that vendor to get it added 😁. The first time I saw QUALIFY was when I worked with Teradata. Now, I see it on nearly every platform, with a pesky exception of SQL Server, which for the life of me, I cannot figure out why Microsoft has not implemented it.
I know this post was lengthy, but I think it’s very valuable to show the reader how to start with a bare bones query and continuously improve it to reach our goal. Just throwing out the final query and saying “here you go” without context, does not help train the mind on how to break these complex problems down to their natural building blocks. People can get lost very quickly by trying to craft a “final” query from scratch vs. stepping through it and progressing towards the goal
The Harlequin UI over terminal is AMAZING. It is probably going to become my daily driver for DuckDB work. I used to do a lot of DuckDB work in python, but if I’m doing pure SQL, then I’ll probably move towards this setup of just some SQL files along with Harlequin to do the work.
Thanks for reading,
Matt
Great article! I recently discovered asof joins for this kind of analysis . Under hood, asof joins might very well be expanding to same thing as your final query.
https://duckdb.org/docs/guides/sql_features/asof_join.html