My first foray into the cloud was the Google Cloud (GCP) back in 2016/17ish. Before that, it was on-prem SQL Servers and Teradata (AKA The Glory Days). Back then, Google BigQuery (GBQ) was just getting its legs. There was no concept of scripting/stored procedures, dynamic sql, or materialized views in BigQuery…heck our IT department had to roll its own “stored procedure-esque” application just to get massive business adoption of the new platform. Fast forward to 2024, and now all those missing things are a reality and make it a lot easier for teams to move data processing from their on-prem/under-the-desk SQL Servers (guilty) and up to GBQ; just remember though:
The cloud is like Hotel California; your data can check in any time you’d like, but it can never leave
I say that tongue-and-cheek; cloud companies want you to get all your data up to their platform because once its up there, business teams will love it; they can write really really bad queries that still perform, the storage and scale is nearly limitless, and migrating from one cloud to another is no fun.
With all that said, let’s talk about the data load patterns that I used when on Google Cloud. Luckily, our IT department did a great job loading the majority of the core data up, but when we had to load our own data, we followed this typical pattern:
Run a script that takes the files on-prem and loads to GCS
Run another script that imports the data from GCS into Google BigQuery (GBQ)
Validate the loads to GBQ
When hunting around online, you will probably find some examples that take local files and load them directly into GBQ. So one might ask “Why bother first putting the data in GCS and then loading to Google BigQuery?” There are many reasons to that:
You can use GCS as a backup layer for all the raw data;
This will make moving the data to another cloud easier down the road if need be
You might not need to load all the data from the raw files into GBQ, which will save on some storage costs
If you realize down the road that you need another column added in GBQ or if you realize an error has been occurring where a certain column is not loading correctly, you can update your loading scripts and simply replay the files already in GCS vs. having to upload all the history all over again.
Your on-prem local system might only hold a rolling 30/60/90 days of data. You can use GCS as very cheap archiving layer for all your history
With that out of the way, how do we actually load these files to GCS? It’s actually pretty simple if we leverage the google-cloud-storage python package. I will assume you have your Google Cloud project setup, a bucket created, and billing established on your project at this point, all of which is required for you to continue this exercise. A few pro-tips though to note before we move forward:
Pro Tip 1: Use the gcloud SDK and setup your default credentials to embed locally. I strongly recommend you never use a service account like good ole chat-gippity will tell you to
How do we setup our local default credentials for Google cloud?
in the gcloud CLI, run this command on your terminal/command prompt:
gcloud auth application-default login
This will have a web browser launch and ask you with a Google page prompt to choose a Google account to continue under. Once you approve its access, it will bury a credential token somewhere on your computer in a hidden file that the google-cloud-storage python package knows to check against when we want to run our scripts as a way to authenticate that you are who you say you are.
We need some test data to load, right?
If you have followed some of my previous posts on LinkedIn, you will notice that I have written a lot of programs to generate test data. My favorite one is still the one I concocted in Go lang a few months ago. The readme on that program can be found here:
To generate 5 csv files of 100k total rows of test data with that program, I ran this command in terminal:
fd create -f 5 -t csv -o ~/test_dummy_data/fd -p data -r 100000
Now the script
Now that we have our credentials set and test data generated, we will create our script to upload our files, and as always, I like to make this stuff go fast so we will do this asynchronously leveraging Python’s ThreadPoolExecutor to maximize our speed since we will be network I/O bound. Below is what the code looks like:
Pro Tip 2: Keep sensitive info, such as your GCS bucket name or Google Project Name stashed locally in an environment variable. You don’t want to end up like this poor soul: AWS S3 Charge
Now granted, the link I provided above is for AWS S3 and not GCS, but in general, it's good practice to not make your bucket names public for all to see. On a lighter note, AWS eventually owned up to the snafu and didn’t charge the user.
I ran the script and it took about 5 seconds to upload 14.5MB of files from my local workstation to GCS - yea, my AT&T “broadband” isn’t that stellar on upload speeds. Here’s what it looks like on the GCS end:
Pretty straight forward…and now we are ready to move into part 2 of this series on how to load the data into GBQ (Google BigQuery). That article will be out in the next few days or so. Here’s a link to the code demonstrated today.
Thanks for reading,
Matt
In the list to read in the weekends 👏🏻