Side Note - For this image above, I tried to get the bing image creator to show the crab and duck doing the fusion dance from Dragonball Z and this is the best it could come up with…
As both Databricks and the Apache Iceberg crew work to harden their rust-based tooling, both orgs have settled on Apache Datafusion as their backend data processing engine:
I couldn’t help but wonder why? Why DataFusion? Why not the duck…who has been proven as a valid contender for SQL based data processing?
This article provides an introduction comparing duckdb and Datafusion on their ease of use, and also pontificates on what led to Databricks and Iceberg’s decision to go with Datafusion.
Rust and the Hysteria
There is no denying that Rust is an extremely fast and powerful programming language. I’ve done some personal benchmarks in the past, and it was able to write 1B rows of integers in under 3 seconds, with minimal time spent optimizing the code. In that benchmark, I compared it to both C and Go. C was barely able to edge out Rust and win, but required a significant amount of involvement with some crazy algorithms to convert integers to text for very fast writes.
Rust provides the developer a more rigid programming structure in terms of how they can pass variables from function to function and the whole concept of “ownership”, which in itself can be a deep rabbit hole of a discussion that his article will not get into.
With that all out of the way, there has been a growing trend lately of python API’s becoming simply a wrapper around a rust backend. This makes sense because if you need python to work through some massive data and perform massive iterations, you would want it to send its commands to a compiled language; python is notoriously slow when it comes to looping with a large dataset.
I think that this strikes a good balance though. Let rust do the heavy lifting behind the scenes but spare your average Joe of a developer from having to learn how to program in Rust. Just give them the simplicity of python so they can put their foot on the gas and get ideas from whiteboards to working code much faster.
Where Were We? Oh Yea, Why Datafusion and Not DuckDb
Both Datafusion and Duckdb offer a SQL based, break-neck speed data processing engine to churn through arrow tables quick. On the surface, both look very similar. So, let’s take a look at some simple examples of both first and figure out what’s going on:
Example 1: The Answer to all things in the Universe
Very similar styles here for both. No surprise. Now let’s see how they handle reading data
Example 2: Reading Data
Alright, so a slight deviation. Datafusion required an extra line of code to first register the csv files as a table…no big deal.
Example 3: Exporting Data
Exporting data has essentially identical syntax, which is great. Who knew these days exporting data to parquet could be done in just 1 line of code 😁.
Example 4: Generating Test Data
Both duckdb and Datafusion make generating test data on the fly very easy. Only a slight change on how the series of rows is generated.
Example 5: Reading Iceberg Tables
I couldn’t leave out Iceberg for this introduction. I’ve been on an iceberg kick lately, as you’ve seen in several of my other posts. Let’s see how well ducdkb and Datafusion can handle reading an iceberg table:
Side Note: I first generated the iceberg test table here.
First let’s establish a catalog connection via pyiceberg:
Now let’s read a previously generated Iceberg table to both DuckDB and Datafusion:
Alright, both approaches are pretty easy. Pyiceberg has a native “to_duckdb” function built-in. With datafusion, you have to use a pyarrow dataset wrapper to read the arrow data stream in, but still pretty easy IMO.
Conclusion
At the end of the day, both DuckDB and Datafusion seem fine to use for data processing. From an analytical perspective, I’d content that DuckDB is still way far ahead of Datafusion from a SQL rich feature perspective. I wrote an article on some of the nice SQL syntax sugar duckdb has several months ago, and those features are not included in Datafusion.
If you are looking to develop a rust based API wrapper for python and need a sql based data processing engine, then Datafusion would seem like the logical choice since it’s natively available as a rust crate. Sure, Duckdb has bindings for rust, but it’s not native, thus you might hit some limitations on things you can do when developing your rust apps.
In terms of why Duckdb chose C++ as its programming language, I’ve heard from a few folks that it simply had to do with the longevity and proven track record C++ has to offer. Rust has not been around nearly as long, and has had its fair share of drama. Additionally, DuckDB wasn’t the only heavyweight that chose C++ as the backend. Databricks’s photon engine is also written in C++.
Well there you have it folks. Datafusion and Duckdb both offer similar syntax and ease of use when it comes to using a SQL based query engine in python, or as Dr. Evil once said:
Thanks for reading,
Matt
Datafusion will always be relegated to the background because of its under whelming Python wrapper. Such is life.
We are always looking for help to improve datafusion-python. I don't think you have to take it as a given -- if you are willing to help, we would most appreciate it: https://github.com/apache/datafusion-python
Perhaps you could write up what makes it "underwhelming" to you as a ticket?