Throughout my career, I’ve had to at times, pull government and/or public datasets for analytics. As an example, when I worked in transportation and logistics as an analyst, each month we would go to the department of energy website and pull in fuel forecasts to help us understand where our budget would potentially land in case the price of fuel went up. But back when I was doing this, we didn’t have the luxury of the polars/duckdb’s of the world making our job easy. It was a lot of click/download/copy/”pasta” action.
Fast forward to today, and pretty much any csv or excel file that the government publishes can be pulled in relatively easy into a data pipeline for processing and charting. For the rest of this article, I will demonstrate just how brain-dead easy this is with duckdb.
Pro Tip - The best data engineers I ever worked with did one simple thing before certifying and publishing a new dataset. They would plot the data in a chart and check for outliers and anomalies. And if they found one, they would send the chart + the raw data to their business partner and ask if that is expected or if the garbage data needed to get tossed
The Dataset
The dataset we will pull in is power plant information from the EPA website. I’ve always been curious about just how many coal, natural gas, and nuclear plants we have in the United States. To do this, I went hunting around on google until I found this link:
When you download the file, you will see it’s in XLSX format; I browsed through a few tabs until I found the one that had a list of all power plants and the latitude/longitude so we could plot on a map.
Honestly, you gotta hand it to the EPA. They actually include this supplemental PDF that explains all the data.
Wrangling the Data
Wrangling the data was pretty straight forward. Since its on an https location and in xlsx format, I first installed a couple extensions in duckdb:
Next, to query it, I wrote the following:
You’ll notice I put a try_cast in the where clause; the second row of the spreadsheet had more text. I didn’t see anything on the excel extension to skip N amount of header rows. The try_cast technique has been a mainstay in my sql arsenal for a while to help quickly filter out garbage rows.
To limit the amount of hits I was doing on the government website, I decided to make the data local table in duckdb.
Let’s take a peak at the data:
Alright, now that we have it, how do we visualize it?
Enter Matplotlib
I feel that matplotlib doesn’t get enough praise these days in the data engineering world. Most of the time, when data engineers want to plot data, they will either copy/paste to excel or hook in some BI tool such as Tableau or Power BI. But matplotlib is a python package that is good enough and can easily plot your data on a variety of charts including bars, lines, and maps. This took a little elbow grease on my part, and I will not lie, a little bit a co-pilot to assist with getting it done quickly. But in the end, to plot all these power plants on a map, this is what the code looks like:
You’ll notice that I ended up commenting out wind, solar, and hydro; there are so many across the country that it basically renders the map useless. Now onto the fun part, how does this look plotted?
I mean, it’s not Tableau looking fancy, but it helps illustrate the main goal; however, I’m noticing that I can hardly see the 54 nuclear plants on the map. Maybe we should do a bar chart as well?
And here’s how that looks:
Alright, we can definitely see the stark contrast of natural gas power plants (a whopping 1,993 of them) vs. coal and nuclear. Finally though, what kind of analysis would not be complete without a pie chart 😂?
Conclusion
There you have it folks; we were able to demonstrate pulling in a public/government dataset and do some fun stuff with charts, all in a python notebook.
Until next time…
Thanks for reading,
Matt
👏🏻