I’ve been on a never ending quest to seek the holy grail of local testing being as close as possible to matching AWS production, but without having to go full blown docker bananas. When the AWS Glue Iceberg Rest catalog was very subtly announced at re:invent a year ago (circa December 2024), I still didn’t have a decent understanding of what the “rest” catalog hoopla was all about. However, I recently got tinker with it, and as any immature engineer would do, made some quick assumptions that received quite a bit of pushback/backlash?…or just better yet this classic meme from the legendary Steve Jobs:
I’m honestly glad that people spoke up and educated me, as now that I’ve spent a decent amount of time on this Iceberg Rest Catalog, and I have a much different perspective. And because of that, I’d like to share with you an exhaustive test harness of the AWS Iceberg Rest catalog using spark to illustrate its capabilities.
The Environment Config
As mentioned in the intro, my goal is to build a local environment that is closely aligned to the AWS run times so that I don’t have to worry about compatibility issues.
According to Amazon’s docs on Glue 5.0, (which is their most recent rev of Glue Spark), they run the following setup:
Python 3.11
Spark 3.5.4
Scala 2.12.18
Java 17
Thus, we will do the same, and here’s the proof:
And Now…The Spark Config
Every time I have to build out a spark config that has to do more than just interact with a local file system, I get this confused look on my face.
The spark config for AWS Glue Iceberg rest is no different. Luckily, AWS does provide this document that gives us a template. However, the packages referenced are a few years old, and we like to live dangerous and in the now…thus, this is what my pyspark config ended up looking like, in all its glory:
Dear Lord…I should charge people for setting up their spark configs so that they don’t have to…oh wait, Databricks already checked that box 😆.
For posterity, this is what it looks like to attach an Iceberg AWS Glue Rest Catalog in Duckdb:
To be fair though, Duckdb’s implementation of this iceberg rest catalog is more recent, thus they had time to see what made more sense from a configuration standpoint. Additionally, Duckdb, as last month, made their grand foray into supporting writes to iceberg. I’ve done some testing and its definitely buggy, but I have no doubt that the duck will catch up quick.
Side Note - I did add an additional package and 2 more config lines for my pyspark setup vs. what AWS provided in their docs. The additional package allows me to read files from s3, such as parquet and csv, for situation where i’m not querying an iceberg table.
So What Are We Testing?
We will be testing the following common/standard ETL SQL patterns that you will find in arguably 99% of all spark jobs:
Creating a Table
Creating a table from a select statement (CTAS)
Inserting values
Inserting from a SELECT query (which includes a join)
Updates
Updates with a join
Deletes
MERGE - The G.O.A.T.
Additionally, we won’t just test a simple merge. We will test a merge that has complex match predicates and deletes
To make this testing easy, I build a series of SQL files for all these tests:
When we take a peak at one of the individual files, this is what we are looking at:
One might ask “How can you execute spark.sql against a file that has multiple statements and variable placeholders?
Below is the helper function I built to handle these scripts; this is a common trick when I need spark to do more than just make data frames; these functions read in the file, split them into their various segments, and add in/format variable substitutions where applicable.
Well, How Did It Go?
Every one of our tests worked…except for updates with a join (which can be handled in a MERGE statement), and CTAS. For me, CTAS has become a convenience, but it’s by no means a deal breaker. When I attempted to run the CTAS query, which looks like this:
…java/pyspark ended up erroring and barfing about 200 lines of the call stack, but at the top of it, there was this statement:
An error occurred while calling o57.sql.: org.apache.iceberg.exceptions.BadRequestException: Malformed request: Stage create is not supported at org.apache.iceberg.rest.ErrorHandlers$DefaultErrorHandler
According to the AWS docs for the Iceberg Rest Catalog, they make it clear as daylight that this is the actual function call to execute a CTAS (as seen in the screenshot below). Amazon states you can just create the table and run an insert statement as a workaround.
According though to Apache Iceberg docs, CTAS is absolutely a supported statement; thus, AWS just fell asleep at the wheel on this one. My hunch is the engineering must have been a lot of extra work and they put it on the back burner.
I went ahead and did the simple work around to pre-create the skeleton table, and then read in a parquet file and did an insert/overwrite, and that worked fine:
As a second round of validation to ensure the rest of our tests actually “worked”, let’s pop open AWS Athena and see what we got:
Winner Winner Chicken Dinner!
Where Do We Go From Here?
I think the Iceberg Rest catalog for local AWS Glue development provides a “good enough” setup to mimic what Glue 5.0 production would run like; I plan to integrate this more into my workflows going forward as this keeps me on a more pure spark path, but at the same time, I don’t have to deal with EMR.
Thanks for reading,
Matt