Feature flags in a nutshell are a standardized way of deploying new “features” or more/less releases/add-ons to an ETL pipeline, without changing the code base. That last phrase though “without changing the code base” IMO is a loaded statement and let me explain; when you do some simple google or chat gippity searches on feature flags, they will show you how do add features to an ETL pipeline; but one thing you will notice is that those pipelines already CODED themselves to check for and handle these features. So, it’s not like you can turn on a feature flag and the pipeline ‘automagically’ just inherits and updates the dataset. You have to first pre-build the ETL with some type of conditional logic to check if said feature is turned on.
With that said, let’s ground ourselves on this idea of feature flags. TBH, this is not a new idea. The marketing term “feature flag” is relatively new, but you can easily look back at legacy stuff such as MSSQL and see that this idea has been around for decades. SQL Server would implement them via trace flags or database and server level configs. Do you remember running this stuff back when you wanted to take over a production SQL Server from your DBA 😈?
Setting a value of “1” a.k.a. “True” for the advanced options and xp_cmdshell is effectively toggling a feature flag.
OK, So How Would I Use Feature Flags in Spark?
You, as the artist, can choose however you want to deploy a feature flag. 😁
Although, personally speaking, I’ve found the most streamlined way of doing so is with a yaml file that contains your feature(s). This keeps the flag on/off process clean and simple.
In a real-world scenario that I’ve done in the past, I like to standardize my tables with a column called “last_upd_ts” which shows the last time a record was touched, either via an insert or an update. Some tables in DW’s I managed did not always have this column added. So if we want to turn this feature on, we will do the following:
Add it in our YAML file
Update the Spark code to check for said feature
If the feature is on, check if the table has the column added already. If it doesn’t, we need to run an ALTER TABLE statement
update the MERGE statement at the end to accommodate this feature
…Remember how at the beginning of this article, I said that the phrase “without changing the codebase” is a loaded statement? That’s exactly why. We need to pre-wire the pipeline for any and all features we want to take advantage of, at least in this tutorial (more thoughts and pontification at the conclusion on how to make this stuff better).
Alright, Let’s Get a Crackin’
We will use Iceberg and local Pyspark for this tutorial. First, we will establish a spark instance:
Next, we will create our YAML file with our feature. That looks like this:
Pretty straight forward right? Next, let’s read our yaml file and get our feature flag:
Alright, we have the foundational stuff setup. For the rest of this tutorial, I’ll be using a stocks table (similar to a prior article you might have read…I’ve gotten lazy and don’t want to reinvent the wheel for every tutorial 😆). When I first create the table, I won’t include a “last_upd_ts” column. This is how the initial creation looks (assume we remove it from the future pipeline and that this table already exists):
Let’s take a peak:
As you can see, there is no column called “last_upd_ts”.
Now, let’s create a dataframe representing new incoming data we will merge with the table:
Alright, so we have our foundational table, we have our feature flag bool ready to go, and we have our new incoming data. Now the fun begins:
Check that if our feature is turned on, that our target table contains the column for the feature; if not, alter the table:
Assemble the MERGE statement that accounts for this feature flag:
That last part of the MERGE statement for historical backfill is optional. If I’m introducing a new column, I’d like to fill historical nulls on that column that would not get captured in a standard merge statement.
And how does this look when we run all of it?
Bingo!
Conclusion and Other Thoughts
This article demonstrated how feature flags work at a very basic level in a Spark ETL Pipeline. It’s pretty straightforward and I hope this demystifies some of the marketing hype you might have heard.
But going further, if I wanted to implement feature flags at a large scale for an enterprise, I’d make this stuff an API service that is abstracted away from the pipeline. This is where I could see REAL value of a feature and where users really do not have to “change their codebase” whatsoever other than adding in maybe 1 extra line. For our example on adding a last_upd_ts, you could have that be an entirely separate piece of code to where all your pipeline does is something along the lines of (pseudo):
But you’d have to put a lot of elbow grease into this because this API would effectively have to perform all the steps we mentioned above. And what if your MERGE statement had some complex predicates in it? Then this “call_some_api” thing would have to account for all of that and be even more complex.
Anyways, I hope this article helped make sense of the madness of feature flags at a high level.
Thanks for reading,
Matt