Duckdb is one of my favorite data processing platforms, simply because of its ease of use; however, it can do more than just ETL. You can easily create databases, persist them to disk, and heck, now you can build a lakehouse on the platform via ducklake (see previous article here). But how does one ultimately serve the data in the duck for downstream consumption? Here are some well-known options/patterns these days:
giant flattened tables - those work well until you want to do adhoc analysis at different grains that the table doesn’t hold
shared queries - just hope the next user you pass it to doesn’t accidentally edit the query in a way that produces inaccurate results
curated dashboards - these are costly to build and maintain, time consuming, and you will find yourself always making changes based on the audience you are talking to
These options above are ok and have chugged along in corp America for decades, but they all suffer from a common problem:
Providing end-users custom queries or dashboards does not fundamentally solve the creation, ownership, and maintenance of business metrics. They all require manual upkeep and can exist in numerous places.
I can’t tell you how many times in my career I’d create a new dataset or dashboard, build the business metrics into the dashboard, and then 3-6 months later, someone out in the ether comes up to me saying “your numbers don’t match mine”. This starts a domino affect of meetings, distrust in calculations, distrust in the data; and 99% of the time, said user was just not understanding the correct way to pull our data, either because of batch process timing or specific filters to apply to only return active records (Hello ACTV_FLAG = True). So how do you solve this nuance and finally put the kabob on it? Enter the semantic layer…
Semantic Layer - What is It
A semantic layer is a business friendly data source that provides dynamic calculations of metrics at various grains and filters applied, with ZERO intervention of the data engineer or business user to create or update a custom query. I repeat ZERO INTERVENTION. This sounds a little absurd to some when you first read it. You might be thinking “But if we change from zip code to state level, I have to update the GROUP BY in the SQL right???” No, in a good semantic layer, you define the metric once and once only, and then any downstream BI tool or “data sciency” tool can plug right into it. This concept is also what is known as a Single Source of Truth (SSOT). And what’s interesting is this concept of a semantic layer is a few decades old, yet we are just now seeing (Circa June 2025) some of the big players like Snowflake and Databricks implement their own incarnation as seen below:
The first time I got a glimpse of a semantic layer was in SQL Server Analysis Services (SSAS). This product is also known as a “data cube”. When I saw it in action, my mind immediately began to click and I had the “ah hah” moment of “This is truly how one should serve business metrics for quick ad hoc analysis and formal reports”. But SSAS has its drawbacks. The architecture at the time required a full wipe/reload/rebuild (whatever you want to call it) every night to “cube/calculate” out all the metrics in every single possible aggregation and grain. Basically, what SSAS would do is run a job in the evening where the metrics cube was taken offline, and it would calculate every single possible aggregate combination and store it on disk so that when a user hops on the next day, they could slice data fast and it would not have to calculate anything…it would essentially just look up the pre-calculated result.
How Can We Implement A Modern Semantic Layer
Fast forward to today though, and stuff has changed dramatically. Cubes are no longer needing to be fully persisted, although you still can pre-calculate metrics of your choice for added speed. In today’s world, we build virtual semantic layers using modern tools and simple markup languages such as YAML files. And this is now where we get into the meat and potatoes of this article with making a semantic layer for duckdb.
But first, let’s generate our source dataset in duckdb. For this, I’m going to create a database called “city”, a table called “data”, and populate it with some dummy data like so:
Now that we have our dummy dataset, we will now move into our semantic layer. We will be using a semantic modeling tool called “cubedev”, also sometimes called “cube js”. It’s open source, very easy to implement, and fast. Here’s a link to their docs:
To create a semantic layer in cube dev, we need to create just a couple files:
A docker-compose file which pulls in the cubedev service as well as point it to our duckdb database
a YAML file that defines our cube structure (measures and dimensions).
The Docker File
This literally is all it is…not joking:
No fancy docker files with pulling in a bunch of packages; just this; let’s step through it though:
the ports section has 2 ports:
4000 - this port gives us a UI to view the cube on for Q/A, which is nice
15432 - this port allows us to hook up our cube to the BI tool of our choice. When cubedev runs, it creates a Postgres endpoint (port 15432) that we can connect to in tools such as Power BI and Tableau, simply using their Postgres connector
The environment section
the first flag CUBEJS_DEV_MODE=true, enables the UI for us, like we mentioned above
CUBEJS_DB_TYPE=duckdb; this is to simply tell cubedev that our datasource is duckdb
CUBEJS_DB_DUCKDB_DATABASE_PATH=…
this is to indicate where our duckdb database exists
Again, that’s literally it for Docker.
The Cube Definition File
For our cube definition, I put together a simple YAML file that has 3 dimensions and 2 metrics. Below is a screenshot:
It’s pretty easy to get a hang of the syntax on this file; you simply provide the name of the duckdb table you want to hit, then define each metric, provide a datatype and that’s it. You can do a lot more customization on the SQL part for each metric, but we will keep this intro demo simple.
Let’s Launch This Thing
Here we go -
docker compose up
Alright, that should take a few seconds. Then as previously mentioned, the UI for this cube has been exposed on port 4000, so lets go to localhost:4000 and see what we got:
As you can see, we have our cube up and running. Let’s click on the state_abrv, total_population, and count and then “Run Query”:
Wow, that was fast and easy. And this is literally the tip of the iceberg. One thing I found while working with cubedev is that you can adjust the cube YAML file while the container is running and simply refresh in the browser and the new measures and/or dimensions will show up immediately. This makes the building/QA fast and fluid.
But You Said I Could Hook This Thing Into Any BI Tool?
You are correct; as long as the BI tool supports connecting to Postgres, it can be done. For today’s demo (since Power BI doesn’t exist on a Mac) and I don’t want to pay for a Tableau desktop license, I’ll instead demo hooking up the cube in a simple notebook. Below is how to do it:
As you can see, it’s pretty easy; in dev mode, the database that postgres places the cube in is called “test”. The username and password are just username/password. You can customize that later for production use cases though. And that’s all folks.
What’s Next?
I might in a future article(s) go into these topics on cubedev:
creating custom folders to group measures and dimensions
demonstrate how to join multiple tables together in a cube
do more complex measure calculations involving custom sql and filters
Hope this article gave you some ideas and inspired you to try it yourself.
Thanks for reading,
Matt