I’ve been on a spark config kick lately. A week or so ago, I published a post showing how to config Spark for the AWS Glue Iceberg Rest endpoint. Overall, the config is not too complex, but does require a ton of inputs. Comparing that to the duck, you see a stark contrast.
However, after working through that article, I was somewhat disappointed that the AWS Iceberg Glue Rest endpoint doesn’t support CTAS (Create or Replace Table as Select…) statements. It’s not entirely a dealbreaker for me, but I wanted to find a solution that was as close as possible to the AWS Glue Spark Environment, but without having to go full docker bananas.
So We Will Call This AWS Glue Iceberg Part 2
I decided to take an another crack at the spark config for AWS Glue Iceberg, but this time using the AWS Iceberg glue packages instead of the rest endpoint. Below is what that config ended up looking like:
You might ask…what the heck are those network and extra jar option things at the end? I was getting some weird crash with local pyspark on my m2 pro when using the packages required for this concoction, and after some help from Claude, that’s what ended up working.
Does this support CTAS Though?
Yeaaaaassss it does! So now, I have an iceberg/aws/glue config (that’s a mouthful) that supports the following Iceberg operations:
create table
create or replace table as select
update
insert
delete
merge
That pretty much covers it all at this point.
Side Note - I’m looking forward to the day duckdb’s iceberg extension supports merge and CTAS. As of now, it supports basic inserts.
Now Let’s Turn It Up A Notch
The fact remains that Iceberg is still relatively “new” these days and many orgs still have a large swath of their data sitting in either CSV or good ole’ parquet files out on S3. So, can we enhance the config concoction above to also cover S3?
Well after whacking my head against a desk a few times, yes we can. You can either become a paid subscriber here and subsidize my Advil prescription for my headaches or continue to enjoy the recent articles free of charge…you pick 😁
Anyways…what does the full AWS Glue Iceberg, S3 spark config concoction look like? Here it is in all its glory:
Not much different than what I showed earlier in this article. No new packages were added, but we did add 4 more configs towards the bottom.
Pro Tip - Use default credential chain to authenticate to AWS as much as possible. It mimics how running in production works on AWS, and you are less likely to leak credentials when publishing content.
So How Can I Test and Run This?
As usual, I built a test harness script to handle running both AWS Iceberg Glue and S3. That can be found here.
Some Musing
The more I spend time writing spark configs, the more I believe the managed spark services such as AWS Glue and Databricks are worth the $$$. You can spend hours on end fighting spark configs and weird JVM run time errors, or you can just call it a day and give Bezos your hard earned cash:
Additionally, I’m looking forward to the day that Duckdb can accomplish the following, which my gut tells me is not too far off:
Support AWS Glue Iceberg CTAS, MERGE, DELETE, UPDATE
Support AWS Glue Non-Iceberg Catalog Stuff (e.g. parquet and CSV)
BTW - I did post an article recently where I came up with my own creative solve to allow iceberg and non-iceberg tables in aws glue work together in duckdb. That can be found here.
Scale to half a TB and possibly more. I’ve been told by some at Mother Duck that it’s absolutely possible, but I can’t imagine what size rig you’d need to throw at it. I’m wanting to run that half a TB through a laptop.
I think after today’s article, I’m going to take a pause from Spark. As much as it has paved the way for the better part of the last 2 decades of data engineering, it’s an incredible hassle to configure and deal with.
Thanks for reading,
Matt