In the previous 2 articles, I demonstrated how to build a large volume dataset using either DuckDB or Spark. But those datasets were purely focused on just generating a massive amount of rows (500M) to stress test a system on volume loads. They were not focused on creating data that resembled anything like you would see in the real world. But no worries, in this post, we are going to finally tackle that problem. We will build a dataset that has the following attributes:
Personal Information: First and Last Names, Birth Date, Net Worth, and Email Address
Geographic Information: City, State, Zip Code, Latitude, and Longitude
Employment Information: Hire Date, Occupation, Salary
System Information: Record Create Timestamp, Unique Transaction ID
If you look across available python packages to do this type of work, you will find these 2 popular ones:
I’ve found though that Mimesis performs significantly faster than Faker, so we will use Mimesis to generate our dataset. We will also need an engine to take the data generated and create a parquet file out of it. With that, we have a good bit of options such as Pandas, Spark, and Polars. For this exercise, I’ll use Polars, as I’ve come to know that package real well and like its performance.
Building The Data
To create a fake dataset with Mimesis and output it to parquet is pretty simple. Below is the code that does that:
You will notice in the above code block a few things:
I’ve predefined the schema for polars above in its own area. I just find this style of coding cleaner. You can put that inside the def if you want, but since we are going to parallel process the code later, I like instantiating the variable once.
The generator for a random integer (function integer_number) accepts a lower and upper bound, which is nice.
I had to convert the latitude and longitude to strings; For some reason, Polars kept bombing when I tried to set the schemas for those as Float32’s. Not sure why, but casting them as strings gets the job done to get them into files. We can always later cast to floats or decimals when we get to the program that needs to consume the data.
How Does the Data Look?
When we run the function above, we can examine it with Polars if we’d like, but I usually instead go with DuckDB, since writing SQL for me is like riding a bike and I don’t have to think. Using DuckDB to query the parquet files, a sample rowset of 5 looks like this:
Now Let’s Scale This
Now that we have our data generator function setup, let’s use Python’s ProcessPoolExecutor, like I did in the first article, to scale this and run in parallel to generate 50 files and 50M rows. The code for that is as follows:
I set my max workers to 8 so that at any given time, I don’t have more than 8 files in flight getting written. This helps control the memory usage on my laptop. If I had a larger rig to run this on, I might increase the max workers depending on the number of cores available on the machine.
Results
This script generated 50M rows across 50 parquet files in a little under 3 minutes. That’s not bad, considering it’s using random data generators, which can carry heavy computation needs. Also, given you usually use a data generator on a project as a one-and-done exercise, I don’t think I need to dig further into seeing if I can stream to the parquet files and write rows in chunks, to further increase the performance.
This concludes our 3-part series on generating test data. And here’s a link to the full code: code generator
Thanks,
Matt
I have used libraries and also written some internal tool to generate data based on real data.
Some options shared here: https://www.junaideffendi.com/p/testing-data-in-apache-spark?utm_source=publication-search