Warning - There is no code in this post, which is unusual for my blog. This is instead a 5 minute rant of me getting on my soap box about clouds and platforms and the âdreadedâ lock-in.
Additional Note - Yea I realize chatGpt misspelled databases in the image above. That was iteration #3; first 2 were even worse; Iâll take this as a W.
Every day, I see post after post after post of vendors or platforms promising that by using their âopenâ platform, you will be able to freely move about and leave whenever you want.
Those statements are disingenuous and far from the truth. When your organization makes the decision to invest in a platform or a cloud, they will usually gravitate towards the least complex setup as possible, which is more/less native services on the platform. Itâs rare that you would see a company decide to use AWS as their cloud platform of choice and then subsequently say âOh but hey, to be multi-cloud, we will use Google BigQuery to analyze all the data thatâs in our AWS cloudâ. That would be a very poor decision for numerous reasons such as :
latency to egress the data out of AWS over to GCP
egress costs
just financials in general for having 2 cloud footprints to enable a capability that you could have done in 1 cloud
One could make the counter-argument of âBut actually, we will use Google BigQuery Omni insteadâ. Omni is basically a stripped down version of BigQuery packaged in a VM.
So you are now telling me you want to slap another cloud providerâs stripped down service and run it in your cloud? What happens when you need to start filing tickets when things break? Do you file them with Amazon? Do you file them with Google? Why not both just for fun! đ
Pro Tip - Keep It Simple, Stupid
That pro tip is one of my consistent guiding lights for data engineering. The less kinks in the chain, the less points of failure. The less complexity in your architecture, the less need for highly skilled individuals needed to tune and troubleshoot problems as they arise.
Bottom line - When you invest in a cloud, try to stick with their native services for your data to keep the architecture (and costs) as simple as possible.
Now onto the Platform Argument
You have a lot of vendors out there touting that their platform is open, and that you can leave whenever youâd like and itâs easy to migrate. Iâm calling BS on this one as well.
âThe cloud is like Hotel California. Your data can check in anytime youâd like, but it can never leaveâ ~ Brent Ozar
Letâs look at Snowflake for example. To use their platform with good performance, you have to load your data into Snowflake. So how does one do that? They use snowpark or some other medium that has robust connections to the Snowflake ecosystem.
What happens then if you want to leave Snowflake and go to Google BigQuery? Well, youâd have to export all that data out of snowflake, write a bunch of custom scripts, load the data into Google BigQuery, run tests to true-up and ensure the data migration was successful, and on top of that, you will probably be running parallel ETL loads to both systems for a quite a while as you migrate your user base over.
Well, what about Databricks? They use spark, so it canât be that hard to leave right?âŠRight???
Ok, letâs crack open that can of worms. Over the last few years, what has Databricks been coalescing their audience to for table management? Ding Ding Ding - The Unity Catalog.
So you want to leave Unity and go to AWS Glue?
again, you will need to write a bunch of migration scripts
you will need to have ETL loads for both systems in parallel as you migrate users
Does AWS Glue support workspaces, ML Flows, BI Dashboards?
Kind ofâŠAWS has comparable services, but you will have to go migrate all that as well
Does AWS have the same type of security permission grants that Databricks has on their workspaces? Kind of; Sagemaker Lakehouse was announced at Re-Invent 2024, and is essentially AWSâs answer to Databricks.
Does AWS Glue have very good support for the Delta Lake format? Not really; there are many known bugs with AWS crawlers on delta lake for schema inference; AWS Athena doesnât support writing to delta lake. There are numerous limitations highlighted here. AWS has chosen Iceberg as their lakehouse format for robust support.
Bottom line though - There will still be a ton of work; migration ainât easy. And itâs definitely not as easy as these vendors and platform providers want you to think.
Ok Mr. Smartypants. How Would You Do It Then?
I hate to break it to you, but there is no silver bullet. There is only a list of best practices that Iâve come to know over my career and use as my guidance to minimize the friction as much as possible.
Stick to native services in the cloud as much as you can
Stick with parquet for your open file format, and only transition to a lakehouse format when performance considerations are warranted. If a cloud offers robust support for a lakehouse format such as Delta or Iceberg, then use that
When I mean robust, I mean almost all of the cloud providerâs services work well with the format to read, write, and analyze at scale. I also mean that the cloud provider has continued to invest in said lakehouse format. You see monthly updates highlighting improvements from the cloud provider on that lakehouse format.
For ETL, avoid proprietary vendor tools when possible. If you are a SQL junkie, go with DBT to transform your data; at least you can have it re-point to other data warehouses as needed but still maintain the scripts centrally. If you are not a SQL junkie, go with vanilla spark. Donât get suckered into a cloud providerâs proprietary spark flavor with their own sugar syntax, such as AWS Glue or BigQuery Spark. Use a VM or an offering of spark that doesnât have the vendorâs hooks all in it, and run vanilla spark instead. This will make migrations less painful in the long run.
Conclusion
You canât avoid cloud or platform lock-ins when you choose a place to run your workloads. All you can do is try to keep the reference architecture and tech stack as simple as possible, all in an effort to make migrations less painful in the long run.
Thanks for reading,
Matt