Throughout my career, I’ve seen many evolutions of how we, as data engineers, “ship” pipelines to production. Some of the common methods are through:
CICD Pipelines in Gitlab/Github
Emailing the script to the DBA to run (No one will ever admit this though 😆)
Saving bash jobs to a shared drive and using cron to run them (or windows task scheduler)
Airflow+S3/GCS,etc
Deploying SSIS packages directly on SQL Servers through SSISDB or MSDB
And while these methods have worked well for decades, with the push into more open source platforms such as spark and microservices, we are now running into a new problem with dependencies and compatibility. Additionally, shared infrastructure has become common a.k.a. - a giant server that virtualizes chunks of CPU, RAM, and sometimes storage to a process. We need a way to isolate our applications and software runtimes from one another.
Because of this, containers were introduced to solve the dependency and shared resource problems. Containers essentially are a full fledged application that have an OS, scheduler, ram and cpu management, networking, and your application code “contained” into an isolated environment. But you can interact with it, as if it was yet another service or server that you are familiar connecting with.
Some ppl I’ve chatted with however, have a gross misunderstanding of what a container is. Just look at the utter confusion on Josue’s face:
Side Note - Josue, those are jars, not containers, as Roy pointed out
Enough Chatter, Let’s Go Build A Simple Container That Runs A Python Script
For the rest of this article, I will walk you through how to build a simple container that runs a python script. We will keep it easy and to the point. Here’s what we will be targeting:
A small python image container that has a shared drive mapped back to our local workstation where we can load/edit/test scripts
The script will create a parquet file with duckdb and write it out
Step 1: Download Docker Desktop
If you don’t have Docker desktop installed, go get it at this URL: Docker
Step 2: Create a new directory on your home directory called “de_container_easy”
This will be the working directory where we create our docker container and add our scripts. Within that folder, make a sub folder called scripts. When that is done, we should see something like this:
Step 3: Create a file in our working directory called “docker-compose.yml”
This is the file that essentially tells docker how to build the container. The contents of this file should not be that hard to follow:
What maters in here is just a few things:
The image: we are pulling from the docker container registry the smallest os possible to run python (152MB total size)
the volumes: this is where we map a local folder to our docker container; it basically copies the contents we have in our scripts folder on our workstation, but also allows us to edit the scripts locally, and then run them in the container
The command: this is telling the container on startup to install duckdb and then run our script
Step 4: The Python Script
Create a script file inside your scripts folder called “duckdb_script.py”. There is nothing really special here. I’m simply using duckdb to create a small table and then export it to a parquet file.
Step 5: Run It
To run the container, start docker desktop, and then simply run the following command in terminal:
docker compose up -d
That flag “-d” tells the container to run in the background so that our terminal is free for us to continue to issue commands against. If you don’t include the flag, your terminal will be locked into the container instance, and you would need to start another terminal window to do other things. If all worked successfully, then we should see a newly created file called “items.parquet” show up in our scripts folder.
How Do I Make Changes And Test Though?
To make changes to your script and re-run it is very easy. We can from our local workstation, log into the container with the following command:
docker exec -it python_duckdb_runner "bash"
Now that we are logged into the container, let’s edit our python script and add a print statement and save it:
And now we execute it:
python ./scripts/duckdb_script.py
Bingo!
Great, We Got This Working…But How Does One “Ship” It?
Shipping a container to a production environment is not as hard as it sounds. First, you will want to publish your container up to a repository, called a “container registry”. This is similar to publishing your code to GitHub. All the major cloud providers offer one, as well as docker themselves. From there, the team that needs to run the container can simply point to the published version you have and with their own docker compose file, run it in a container runtime environment. Again, all the major cloud providers have container execution services; you can also roll your own on-prem very easily with Docker.
AWS Docs on Containers: AWS Containers
Summary
This article walked you through the following:
How to create a simple docker container that runs a python script
How to edit and test your python script locally, running inside the docker container
In a future post, we will get more involved and show you how to run a spark script in a container; this requires a little more elbow grease since spark needs the Java runtime.
Thanks for reading,
Matt