Apache Arrow is the defacto in-memory columnar data processing format. Its close cousin (parquet) is the defacto file standard for columnar data storage. What makes Arrow unique and sets it apart from older storage formats is that it’s highly optimized to crunch through millions, if not billions, of rows very quickly, especially when you need to do things like aggregation.
If we look at the landscape today on data processing engines, many of them harness Apache arrow under the hood, such as:
Additionally, many other libraries utilize Apache Arrow in many cases to build and work through datasets. Most recently, I worked on a PR for the pyiceberg project to enable upsert capabilities on an iceberg table just using python (no spark). When I got knee-deep into the project, I started to better understand the arrow API (python’s version known as pyarrow), and I started to see why it is such a powerful in-memory data format.
For the rest of this article, we will explore some of the basics of pyarrow and how you can use it to create and manipulate in-memory tables.
The Basics: Create a Table
To create a table using pyarrow, we can simply do the following:
This is pretty straight forward. One might ask next, “how does one filter for specific rows on this table?”
Filtering
To do that, we create what pyarrow calls a “mask”. A mask is simply an array of booleans that either say True or False to show a row. Since there are 4 rows in this table, we will create an array with 4 booleans. I’ll just have it show rows 1 and 4:
Don’t let the simplistic nature of a mask fool you though. This is where you can start to do some more advanced stuff with pyarrow tables, such as filtering for specific criteria. For our next example, let’s only show rows where the price is greater than or equal to 21:
Pretty cool. But is there a more direct way to do this?
PyArrow Compute API
There is…we will introduce another part of the pyarrow api, its compute layer. The example below uses the pyarrow compute to filter for rows with a price greater than or equal to 21. One of the nice things about pyarrow compute is that it has many built-in functions to get the job done:
Can I Effectively Query This Stuff?
Pyarrow doesn’t directly support a “SQL” dialect (check out data fusion if you want that). But, it does support many dataframe operations such as grouping and summing. For our final example of this intro article, we will sum the price and group by the order type. Below is how you do that:
And if you want to dive deeper, check out all the functions pyarrow’s compute API supports.
Summary
This article was a brief intro to pyarrow. It is a very powerful in-memory data processing format, and given we are seeing many libraries out there today leverage arrow, it will be around for a while before something supplants it.
Thanks,
Matt
Arrow is the GOAT of Data Engineering
Great, I entered your article without knowing pyarrow and left it discovering that I am a professional in it 😂