This will be a pretty short article, given the script I’m about to share is pretty straight forward IMO. In a previous article, I discussed and demonstrated how to load files from on-prem up to Google Cloud Storage (GCS). That is a foundational step on moving your data into a cloud platform such as Google BigQuery (GBQ) or SnowBricks (Snowflake/databricks). As we now look forward to the next logical step, how do we load the files from GCS into GBQ? Luckily, Google’s google-cloud-bigquery API makes this incredibly simple. It has the ability to infer schemas on the fly, create tables as needed if they don’t exist, and you can take all the files you had previously uploaded and load them asynchronously into GBQ, which will save you some time vs. loading one file at at time.
Important remark on parallel loading: If you are going to parallel load your data to a GBQ table, you should first have the table already created vs. letting the API create it during the async load process; otherwise, you could end up with a weird series of errors if each thread is trying to create the same table at the same time.
Now let’s take a look at the actual script (below). What this script is doing is the following:
Looping over a GCS folder that I provide as a parameter to get all files ending in “.csv”
As the files are identified, they are passed into Python’s ThreadPoolExecutor as an async load process. The load process is then called to load the table to the corresponding GBQ table.
As you can see above, this script is pretty easy and to-the-point. If you don’t run this process async and are loading hundreds of files, it can take a while. Thus, I try to reach for the ThreadPoolExecutor when I can to load GCS files to GBQ. It saves a lot of time.
A few other reminders:
Don’t expose your bucket names directly in your scripts unless its under a private repo.
Don’t use a JSON service account to authenticate to Google’s cloud services. Instead, as previously discussed in the prior article, use application-default credentials so you don’t risk exposing a service account to the world for anyone to use.
How does the data look loaded?
We will take a quick peak in GBQ to make sure the data loaded correctly and makes sense on an initial glance. Here’s what it looks like from a simple query:
And Voila!
Like I said, this post would be brief and to-the-point. Below is a link to the full script. Hope you enjoyed!
Thanks,
Matt