DuckDB is an amazing, compact, and easily portable analytics database that has come a long way over the last few years. One thing I love about the platform is that it is self-contained, which means you can download the CLI and just run it with ZERO concern of any other library dependencies. This makes the install brain-dead easy and enables you to hit the ground running.
But beside the portability aspect, something that is starting to set DuckDB apart from its competitors is how much investment the developers have put into its SQL Syntax. Nearly every month or so, DuckDB is enhancing the syntax further with very nice user-friendly features to help you cut through the complexity of your queries and enable you to work through your analytics faster. In this post, I will go through just a handful of DuckDb’s syntactic sugar that I feel has set it apart from the competition.
Union All By Name
I was not aware of this nice feature until about a month ago when Erik Darling brought it to my attention, but immediately recognized the value. Over my career, I have encountered many situations where I need to merge data from a legacy system and the modern/“current” system together. What makes these tasks challenging is usually the more modern/current system has more columns vs. the legacy system due to improvements in the overall changes in the operations. So how do you merge all that data?
Before DuckDB, it was a long-winded UNION ALL where you had to explicitly write all the columns out from the modern system, write the columns from the legacy system and put in null fillers for columns that did not exist then e.g.
But now with DuckDB’s “UNION ALL BY NAME”, we can have DuckDB do all the heavy lifting. To illustrate this in action, let’s create a couple tables:
order_hdr - this is the modern order table
order_hdr_legacy - this is our pretend legacy system, where it won’t contain all the same columns as the modern system
Here’s order_hdr
And here’s order_hdr_legacy
To combine the 2 tables is as simple as follows:
And there ya go. It doesn’t get much easier than that.
Lateral Calculations
Over the course of my career, when I’ve had to build out analytical datasets that create a calculated value and then do more calculations off of that value, I’ve had to first create the base calculated value in a subquery and then have the outer query reference that subquery…or I had to repeat the same base calculation over and over again for a field. As an example, if we wanted to calculate on our order table our gross profit and gross margin, in a legacy SQL system, we would have to do this:
Notice how I’m having to perform the same calculation for gross profit twice. This can lead to user error, tedious updates when you need to make a calculation change, etc… Sure, you could have put the calculation first in a sub query or CTE and then deal with multiple query sections. But DuckDB allows what is called a lateral calculation to where you only have to perform the calculation once and then you can reference it in the same SELECT statement throughout as follows:
Pretty Nice Eh?
Dynamic Pivots
If you have ever spent any amount of time in analytics, you will inevitably run across a scenario where you need to build a pivot table on a massive dataset. At that point, Excel just doesn’t cut it (Sorry Josue Bogran, so you go to SQL. Dynamic pivots are nothing new here, but DuckDB has made it incredibly easy. Prior to DuckDB’s implementation, when I wanted to do dynamic pivots in SQL Server for instance, I had to use a combination of sp_executesql, the stuff function, and the “for xml path” to build the column list. It looked something like this:
Please Note: I have not used SQL Server in about 4 years now, so the above code is just from my memory and probably contains some syntax issues; but you get the gist looking at it
As you can see, the code above is tedious to write and very messy. Another option we have at our disposal would be to hardcode the columns we want to create a dynamic pivot out of. One can do that like this:
However, this is 2024…surely there is a much easier way to do this? Well…with DuckDB there is…lo and behold:
I’ll take that any day of the week over the old school methods. Finally, a database engine that “got it right” IMO.
Conclusion
This article showed just a handful of DuckDB’s syntactical sugar. There is definitely more stuff out there that you can leverage beyond the 3 items I showcased today. I would hope more database vendors would take notice to what DuckDB is doing and work to improve their SQL syntax to be more dare I say “user friendly”.
Even as of this writing, Microsoft SQL Server still does not support the ANSI SQL QUALIFY clause, which would make filtering on window functions so much more cleaner on their platform. The QUALIFY statement has been out well over a decade now…what gives MS?
Here’s the scratchpad code of today’s article: duckin great code fun
Thanks for reading,
Matt
I'd like to challenge the notion that Excel doesn't cut...
All jokes aside, great article. I personally have spend 0 time on DuckDB, but everything I've read about it is that is works very well.
Lateral calculations are fantastic! Made many mistakes in my career because I had to rewrite logic.