Every now and then, I will force myself to pay homage to the one that started it all, the grand daddy of them all, the big kahuna…yes Dennis Ritchie and Ken Thompson’s brain child - C Code.
C code pretty much runs everything today. Your household appliances run on C, your video games run on C++ (a derivative of C), DuckDb runs on C++, most RDBM’s run on C and/or C++, all of our fancy programming languages today are written with a C backend or rely on the C library to compile down to the 1’s and 0’s, with the one exception of Zig, which is gaining some traction as a new low level programming language.
Why do I do this to myself and put unnecessary stress on my mind?
I code in C from time to time to help me appreciate how far we have come with nice data processing frameworks such as Spark, Polars, and DuckDb
With that being said, let’s dive in to what this script is.
The Setup
I’m going to use Polars to first create a dummy dataset of 1M rows of data that gets written out to a CSV. The schema for this dataset is as follows:
order_id - int
order_line_id - int
order_date - date
quantity - int
price - float
To generate this dummy dataset with Polars, I wrote this script:
It’s pretty straight forward. It will choose a random number of order lines to add to each order_id, ranging from 1-5. The script runs on my M2 Pro base model in about 8 seconds; not too shabby to generate 1M rows of dummy data on the fly.
How Will We Be Processing the Data?
For our C script that we are about to walk through, we will read the 1M row CSV file and group by order id, sum the quantity and price, and find the max order date, all corresponding to each order id. If we were to do this with python polars, it would be a slam dunk:
And Now, the C Code
When writing stuff in C, you have to basically build almost everything. C doesn’t have an easy way to expand the tilda to the user’s home directory. You have to roll that. And C doesn’t have a popular dataframe library handy to crunch the data. So we have to write all that logic as well. But let’s think about what we are trying to do here:
sum order quantities; that’s easy; just accumulate via +=
sum total price; same method as order quantity
find max order date; 2 variables in play; one maintaining the current max order date, and one that it compares to and swaps if it’s greater
How do we maintain all this cleanly though on a per-order_id basis? We will use a struct:
You might ask yourself? Why is the length of the max_order_date 11, when a date in MM/DD/YYYY takes up only 10 characters? That eleventh character is for the pesky null terminator that is required in C so that when the data is written, it doesn’t bomb. It’s a nuance of C that we don’t ever have to worry about in higher level languages like Python. If you’d like to better understand the basics of C, you can read this tutorial that I wrote.
Anyhow, when I wrote this code, I first tried to queue up all the rows in RAM, but quickly saw that I was having memory pressure on my end trying to store such a vast array. Thus, I built in a buffer to flush the data after every 100 rows to disk, which got the total script to execute in less than a second to read 1M rows and write out the aggregate to another text file. Also, the methods I’m using in C are not the absolute most efficient to write text to files. For a crazy deep dive on that subject, you can read this article that I wrote over half a year ago that was able to write 1B sequential ints to a text file in C code in under 3 seconds.
The bulk of the C code magic can be seen in the screenshot below (Full code link is at the bottom of this article):
As you can see, we have a lot going on here. Lines 76 and 77 are doing the += trick to accumulate quantity and price sums. Also for the max order date, I wrote this additional function that gets called:
And that’s pretty much all there is to it.
Summary
This article showed us how to process data in C code. We also provided a snippet of how to do the same thing with python polars, which was less than 1/4 the lines that the C code requires. Keep in mind though that under the hood, polars python is calling rust code, which is using LLVM to compile, which is written in C++ 😆.
So, as you go forward cracking open DuckDB, Polars, or PySpark during your day job to write a few lines to transform data, just remember where this all started and what it’s all built off of. It will help you better appreciate how far we have come since the 70’s.
Thanks for reading,
Matt
We'll look at you. Showoff. Rust forever!