Warning - There is no code in this post, which is unusual for my blog. This is instead a 5 minute rant of me getting on my soap box about clouds and platforms and the ādreadedā lock-in.
Additional Note - Yea I realize chatGpt misspelled databases in the image above. That was iteration #3; first 2 were even worse; Iāll take this as a W.
Every day, I see post after post after post of vendors or platforms promising that by using their āopenā platform, you will be able to freely move about and leave whenever you want.
Those statements are disingenuous and far from the truth. When your organization makes the decision to invest in a platform or a cloud, they will usually gravitate towards the least complex setup as possible, which is more/less native services on the platform. Itās rare that you would see a company decide to use AWS as their cloud platform of choice and then subsequently say āOh but hey, to be multi-cloud, we will use Google BigQuery to analyze all the data thatās in our AWS cloudā. That would be a very poor decision for numerous reasons such as :
latency to egress the data out of AWS over to GCP
egress costs
just financials in general for having 2 cloud footprints to enable a capability that you could have done in 1 cloud
One could make the counter-argument of āBut actually, we will use Google BigQuery Omni insteadā. Omni is basically a stripped down version of BigQuery packaged in a VM.
So you are now telling me you want to slap another cloud providerās stripped down service and run it in your cloud? What happens when you need to start filing tickets when things break? Do you file them with Amazon? Do you file them with Google? Why not both just for fun! š
Pro Tip - Keep It Simple, Stupid
That pro tip is one of my consistent guiding lights for data engineering. The less kinks in the chain, the less points of failure. The less complexity in your architecture, the less need for highly skilled individuals needed to tune and troubleshoot problems as they arise.
Bottom line - When you invest in a cloud, try to stick with their native services for your data to keep the architecture (and costs) as simple as possible.
Now onto the Platform Argument
You have a lot of vendors out there touting that their platform is open, and that you can leave whenever youād like and itās easy to migrate. Iām calling BS on this one as well.
āThe cloud is like Hotel California. Your data can check in anytime youād like, but it can never leaveā ~ Brent Ozar
Letās look at Snowflake for example. To use their platform with good performance, you have to load your data into Snowflake. So how does one do that? They use snowpark or some other medium that has robust connections to the Snowflake ecosystem.
What happens then if you want to leave Snowflake and go to Google BigQuery? Well, youād have to export all that data out of snowflake, write a bunch of custom scripts, load the data into Google BigQuery, run tests to true-up and ensure the data migration was successful, and on top of that, you will probably be running parallel ETL loads to both systems for a quite a while as you migrate your user base over.
Well, what about Databricks? They use spark, so it canāt be that hard to leave right?ā¦Right???
Ok, letās crack open that can of worms. Over the last few years, what has Databricks been coalescing their audience to for table management? Ding Ding Ding - The Unity Catalog.
So you want to leave Unity and go to AWS Glue?
again, you will need to write a bunch of migration scripts
you will need to have ETL loads for both systems in parallel as you migrate users
Does AWS Glue support workspaces, ML Flows, BI Dashboards?
Kind ofā¦AWS has comparable services, but you will have to go migrate all that as well
Does AWS have the same type of security permission grants that Databricks has on their workspaces? Kind of; Sagemaker Lakehouse was announced at Re-Invent 2024, and is essentially AWSās answer to Databricks.
Does AWS Glue have very good support for the Delta Lake format? Not really; there are many known bugs with AWS crawlers on delta lake for schema inference; AWS Athena doesnāt support writing to delta lake. There are numerous limitations highlighted here. AWS has chosen Iceberg as their lakehouse format for robust support.
Bottom line though - There will still be a ton of work; migration aināt easy. And itās definitely not as easy as these vendors and platform providers want you to think.
Ok Mr. Smartypants. How Would You Do It Then?
I hate to break it to you, but there is no silver bullet. There is only a list of best practices that Iāve come to know over my career and use as my guidance to minimize the friction as much as possible.
Stick to native services in the cloud as much as you can
Stick with parquet for your open file format, and only transition to a lakehouse format when performance considerations are warranted. If a cloud offers robust support for a lakehouse format such as Delta or Iceberg, then use that
When I mean robust, I mean almost all of the cloud providerās services work well with the format to read, write, and analyze at scale. I also mean that the cloud provider has continued to invest in said lakehouse format. You see monthly updates highlighting improvements from the cloud provider on that lakehouse format.
For ETL, avoid proprietary vendor tools when possible. If you are a SQL junkie, go with DBT to transform your data; at least you can have it re-point to other data warehouses as needed but still maintain the scripts centrally. If you are not a SQL junkie, go with vanilla spark. Donāt get suckered into a cloud providerās proprietary spark flavor with their own sugar syntax, such as AWS Glue or BigQuery Spark. Use a VM or an offering of spark that doesnāt have the vendorās hooks all in it, and run vanilla spark instead. This will make migrations less painful in the long run.
Conclusion
You canāt avoid cloud or platform lock-ins when you choose a place to run your workloads. All you can do is try to keep the reference architecture and tech stack as simple as possible, all in an effort to make migrations less painful in the long run.
Thanks for reading,
Matt