Generating Test Data is Hard

Part 1 of 2 (I think...might be more parts...let's see)

Jun 04, 2024

If you’ve read some of my older posts on LinkedIn, you’d see that I do a lot of work generating test data in a variety of programming languages. This has become an essential part of my job when I need to do performance testing with new vendor products. Some of these products are for spark driven workloads, others are for SQL workloads. With that being said, sometimes I need to first do a volume load test before putting in any data that could resemble something more realistic. Stress testing a new product is essential to ensure it can perform and scale well. With volume tests, all I need to do is create a large dataset of just random gibberish. Sounds simple right?

In this first part, I’ll show you some simple tricks I use both in a SQL driven environment and a Spark driven environment to generate test data in the orders of magnitude of several GB and hundreds of millions of rows. This test data will only consists of integers, GUID’s, floats, and dates. Generating more meaningful test data with realistic names, geographic info, and other stuff will come in a later part of this series.

Generating Test Data Using Duckdb + SQL

For this first part, I will reach into my DuckDB bag of tricks and show you how to easily generate several hundred million rows spread across multiple parquet files using DuckDB and Python, so strap in. First, we need to create the baseline query and sample some data to make sure it looks good. Below is how we do that:

Alright, now that we have this setup, let’s have it export 50M rows to a parquet file using DuckDb’s copy command:

This ran in about 5 seconds on my Mac m2 pro base model and wrote a 1.24GB parquet file out. But now that we have this working, let’s take it a step further and parallelize it to write out 10 files and get us 500M rows and over 10GB’s of data. For this, I’ll leverage Python’s ProcessPoolExecutor:

And there you go. 12GB of data written in about 12 seconds. Note that whenever we use a ProcessPoolExecutor, we need to put the code inside a py file. I’ve found over my career that trying to use a ProcessPoolExecutor directly inside a notebook (ipynb file) does not work and is buggy. It has something to do with the notebook kernel not being able to coordinate to the subprocesses and exits early. I’ve never bothered going down the full rabbit hole on that one and just using a py file works well.

Anyways, the stuff here is a lot of content just for our initial go at it. I’ll save showing how to do this in spark for part 2. I can already see this article turning into a 3 part series. Code link for today’s work below.

Duckdb code

Thanks,

Matt

Matt’s Substack

Discussion about this post