Converting CSV's to Parquet

Oct 22, 2024

CSV files are a vastly common way to exchange and share data; they have their pluses and minuses, but from a cloud strategy perspective, parquet files have become the gold standard. They compress better than CSV, they carry a more formalized schema with them, whereas engines reading CSV’s have to sample several hundred or thousand rows and guess the data type. All of today’s popular data processing engines such as Spark, Polars, and DuckDB can read and write parquet files.

Given those facts, a common pattern we see in the data engineering world today is converting CSV’s, that are usually generated from on-prem systems, to parquet and then pushing the parquet files up to a cloud storage bucket for further downstream user consumption. This task has actually become a very easy one that we can script in just a few lines of bash code.

Let’s Get our Test Data Ready

As you’ve seen in several of my other posts, I’ve previously written a data generator program in GoLang, which can be found in this repo. Having a dummy data generator has become very handy when I need to do performance testing as well as create demos for others. So, let’s start by first using this repo to generate 10 CSV files, each containing 10k rows of dummy data:

fd create -f 10 -t csv -o ~/test_dummy_data/fd -p data -r 100000

Alright, let’s take a peak at one of the files:

Converting the Files to Parquet

Now that we our test data ready, let me show you how brain-dead easy it is to convert the csv’s to parquet. For this, I’ll use a simple bash script and the DuckDb CLI:

In this bash script, it’s simply looping over the file set that we had previously generated and using the DuckDB engine to read the CSV’s and convert to parquet.

Now the Push to Cloud Storage

Now that we have our parquet files ready, how do we upload them to cloud storage? For this, I’ll use GCS’s cli utility; it’s just 1 line of code:

Pro Tip - Don’t ever expose your bucket name in your scripts; maintain them as environment variables

And there you go. Let’s take a peak at GCS:

Summary

Converting CSV’s to parquet has become a very easy process these days. What I demoed above is just one of many methods you can use to do the process. You can also use Spark, Polars, and numerous other libraries to convert the data over.

Thanks for Reading,
Matt

High Performance DE Substack

Discussion about this post