I find quite often that when I’m working with log files and need to match the data up to the transactional datasets, I have to use some form of fuzzy matching logic. Usually, it’s a combination of hard ID’s as well as timestamps with some slack on the timestamp range between the 2 tables, since logging events are asynchronous and won’t always truly line up to the transactional tables they are logging against. As an example, let’s pretend we have the following datasets (using polars to build the data frames out):
Now, let’s assume we want to join the 2 tables together. Our join criteria will be as follows:
txn_id = txn_id
txn_dt +/- 2 days apart (this is the fuzzy logic)
How would one do this you might ask? In polars, it’s actually pretty simple. You perform an inner join on the txn_id, and then add a filter predicate to the on the txn_dt doing the math we explained above. This looks as follows:
Conclusion
This was probably one of my shortest articles I’ve ever written. But fuzzy matching will help you throughout your career when you need to combine datasets that don’t always easily align. You can extend the example I provided well beyond dates and have it check for things like string comparisons/etc. Hope this helps.
Thanks for reading,
Matt