Introduction to CICD for Data Engineers

Part 1: Just an Overview

Oct 08, 2024

FYI - this post has no code and only takes a few minutes to read. This is all about educating you on what CICD is. If you’d like to learn, then proceed, otherwise, go back to mindless scrolling on LinkedIn 🤣.

You can be an excellent data engineer and make a decent career for yourself just by building pipelines e.g. spark scripts, DBT workflows etc. But what if you wanted to raise your game up to a level SIGNIFICANTLY higher than what you do today? Well strap in folks, this is where CICD for the data engineer comes into play.

…But what exactly is CICD? It’s a software engineering acronym that stands for Continuous Integration Continuous Delivery. Huh? In short, its a standardized practice and set of tools to allow you to write your code, submit it for review, and then once it is fully approved, a process kicks off to merge your code to the main branch and deploy it to your target application and thus “deliver” it to production. Sounds pretty cool right? But how does that work, and why should I care?

Well, let’s take a look at what happens once you (the data engineer) have successfully built and (hopefully tested) your new whizbang ETL script. If you are just a one-crew wrecking ball, you might go upload the script manually to your spark cluster and schedule it. Or if you are in ELT land, you might just right click in SSMS on the stored procedure and click the “modify/alter” button, paste your new code over and smash the F5 button on your keyboard and feel that instant shot of adrenaline before lunch break. That’s fine and all for smaller orgs/applications, but what if you were a part of a much larger ecosystem that had significant impacts if your code had a bug or caused something unforeseen to happen downstream after you deploy your script? Well, this is where CICD thrives.

With CICD, we take the manual uploading/alteration of an ETL script out of the human and instead have a machine change the code for you as well as track versions of it. Below is a high level step process of how CICD works (from a data engineer perspective). And don’t worry about it, we will get into the finer details in a later post:

Data Engineer codes the Stored Procedure/Spark Script
Data Engineer runs unit testing on said script
Data Engineer submits a pull/merge request on Github/lab
Request is reviewed by other engineers
Once approved and the merge button is clicked, a pipeline kicks off
This “pipeline” is also known as a CICD script
The CICD script is a series of jobs (written in YML) that each run in an isolated environment via docker containers
The jobs should do a test step, and if that passes, then deploy it to production and merge the changes to the target branch

That’s a lot in a nutshell. In a future post, we will provide an example CICD pipeline so that you can wrap your head around what exactly was just discussed above.

Thanks for reading,
Matt

Matt’s Substack

Discussion about this post