2017-07-19 data

Preventing common analytical errors: Join duplicates and join misses

Baron Schwartz gives two great examples of just how common - yet profoundly consequential - semantic errors in SQL can be. Seemingly trivial differences in syntax can produce results that look like they’re correct - that is, what the user intended to produce - but in actuality are not. These bugs do not show up at compilation or run-time, but rather in meetings when colleagues or clients scratch their head over numbers that “don’t look right.”

Semantic errors are especially insidious because they can remain undetected for months until people happen to look in the right places, at which point the damage may be already done. Could you imagine trying to manage your cash flow based on a completely misreported balance sheet?

Today I will cover two more major semantic errors, both of which occur during SQL joins:

Join duplications
Join misses

Finally I’ll cover a few steps you can (and should always) use to mitigate the risk of committing these errors.

Join duplications #

A common misconception about SQL joins is that they give you as many rows as the join specifies - for example, a left join will give you the amount of rows in the left table, or an inner join will give you the amount of rows common to both tables.

This is not true: SQL joins will give you at least as many rows as the join specifies. For example, if you have a left table with 10 rows, you are guaranteed to have at least 10 rows after the join, but you may also have 20 or 100 depending on what you are joining to.

This is best demonstrated by example - consider the two following tables:

sales

date	product	customer	quantity
2017-07-01	Stapler	Alice	1
2017-07-01	Tissues	James	3
2017-07-05	Soap	Kevin	2
2017-07-06	Tissues	Kelly	2
2017-07-06	Pack of pens	Richard	3

products

product	price	creation_date_utc
Stapler	4.50	2017-07-01
Tissues	2.75	2017-07-01
Soap	3.50	2017-07-01
Pack of pens	2.50	2017-07-01
Tissues	3.00	2017-08-01

Let’s say I want to join the products table to our log of sales data (the sales table) in order to calculate total revenue, being careful not to drop any sales. In this case I would use a LEFT JOIN because, no matter what, I want to make sure I have the same amount of sales transaction before and after the join. The SQL looks like:

SELECT *, (s.quantity * p.price) AS revenue 
FROM sales s
LEFT JOIN products p ON s.product = p.product

I expect to get a result set like this:

date	product	customer_id	quantity	product	price	creation_date_utc	revenue
2017-07-01	Stapler	3	1	Stapler	4.50	2017-07-01	4.50
2017-07-01	Tissues	3	3	Tissues	2.75	2017-07-01	8.25
2017-07-05	Soap	12	2	Soap	3.50	2017-07-01	7.00
2017-07-06	Tissues	1	2	Tissues	2.75	2017-07-01	5.50
2017-08-06	Pack of pens	12	3	Pack of pens	2.50	2017-08-01	7.50

Total revenue: 4.50 + 8.25 + 7.00 + 5.50 + 7.50 = $32.75

Instead however I get this:

date	product	customer_id	quantity	product	price	creation_date_utc	revenue
2017-07-01	Stapler	3	1	Stapler	4.50	2017-07-01	4.50
2017-07-01	Tissues	3	3	Tissues	2.75	2017-07-01	8.25
2017-07-01	Tissues	3	3	Tissues	3.00	2017-08-01	9.00
2017-07-05	Soap	12	2	Soap	3.50	2017-07-01	7.00
2017-07-06	Tissues	1	2	Tissues	2.75	2017-07-01	5.50
2017-07-06	Tissues	1	2	Tissues	3.00	2017-08-01	6.00
2017-08-06	Pack of pens	12	3	Pack of pens	2.50	2017-08-01	7.50

Total revenue: 4.50 + 8.25 + 9.00 + 7.00 + 5.50 + 6.00 + 7.50 = $47.75

What happened?

We can see that the two rows for “Tissues” in the left table got duplicated - we now have four of them! This is because, when joining on the `product` column, the join condition (or “join-predicate”) is true for multiple rows. In other words, “Tissues” in the left table equals both “Tissues” rows in the right table, so we get two rows for which the condition is true. This happens twice, once for each “Tissues” row in the left table, yielding two duplicated rows.

One can imagine how problematic this is: we have magically grown our sales without ever completed another transaction! Join duplications cause us to overstate our results.

Join duplications, like the one shown above, seems like an easy mistake to catch: you can clearly see the duplication in the data. But is it as easy to catch if this is your query?

SELECT SUM(s.quantity * p.price) AS revenue 
FROM sales s
LEFT JOIN products p ON s.product = p.product

Suddenly your result set looks like this:

revenue
47.5

Not so easy to spot anymore!

Without getting too deep into the theory, this shouldn’t happen if you have a correctly specified data model with proper normalization. You should be performing joins through foreign keys, which link to primary keys, which must be unique, thereby eliminating the risk of join duplications. Alas if only real-world data were so simple! The reality is that data models are rarely ideal, meaning you have to prepare for unexpected results when working with them.

Before we jump to how to fix this, let’s cover the other common semantic error with joins.

Note: join duplications occur with every type of join, including inner joins.

Join misses #

I call these join misses because I think of them intuitively like cache misses: you’d like to have your result pulled from a cache, but it wasn’t there so you got nothing instead. With join misses, we expected to join on a certain key but because the condition is not true (for various reasons), we get nothing instead.

Take this example:

sales

date	product	product_cost_code	quantity
2017-07-01	Stapler	J	1
2017-07-01	Tissues	T	3
2017-07-05	Soap	R	2
2017-07-06	Tissues	T	2
2017-07-06	Pack of pens	X	3

product_costs

product_cost_code	product_cost
J	1.80
T	.60
R	2.10

Say we want to find the total costs associated with each of our sales - to do this we join the product_costs table to the sales table. Again we know we want to never drop sales data, so we use a LEFT JOIN. Our query looks like:

SELECT *, (quantity * product_cost) AS total_cost
FROM sales s
LEFT JOIN product_costs c ON s.product_cost_code = c.product_cost_code

Our result set looks like:

date	product	product_cost_code	quantity	product_cost_code	product_cost	total_cost
2017-07-01	Stapler	J	1	J	1.80	1.80
2017-07-01	Tissues	T	3	T	.60	1.80
2017-07-05	Soap	R	2	R	2.10	4.20
2017-07-06	Tissues	T	2	T	.60	1.20
2017-08-06	Pack of pens	X	3	`NULL`	`NULL`	`NULL`

Our total cost across all sales is: 1.80 + 1.80 + 4.20 + 1.20 = $9.00.

But is it really? Surely the pack of pens costs something? Just because we don’t have a record in the product_costs table for it doesn’t mean there’s really no cost, it simply means it’s not in the database. This join miss merits more investigation. Perhaps the pack of pens is an invalid transaction; perhaps it should have a record in the product_costs table. It’s hard to tell, but we can be fairly sure that our total cost isn’t $9.00.

Again, this seems like it should be pretty obvious to spot - after all, there’s an entire row of NULLs from the right table!

How about now?

SELECT SUM(quantity * product_cost) AS total_cost 
FROM sales s
LEFT JOIN product_costs c ON s.product_cost_code = c.product_cost_code

Results:

total_cost
9.00

Suddenly it’s not so clear if the $9.00 is correct or not.

Unlike join duplications, which overstate our results, join misses cause us to understate our results.

The solution: be a data skeptic #

Perhaps the most valuable skill in data analysis is skepticism: what are all the ways my analysis can be wrong? It is all too easy to fit a hypothesis post hoc for why your numbers make sense - far harder is it to sanity check along the way and see if each data transformation confirms your a priori hypothesis. Preventing join duplications and join misses requires a high degree of skepticism.

Join duplications

Join duplications have a simple solution: always count the number of rows before and after a join. If you start with 100 transactions and after joining a table you have 140, you need to investigate.

In SQL, this means running a COUNT() after each join. If you join several tables (eg. 5+) and don’t take any COUNT()s along the way, you are taking a large risk. Better to join, check for join duplications, check for join misses, and then incrementally continue joining.

In Python pandas, this is easy to do with a decorator. I use a decorator around pandas.merge() called @guarantee which guarantees that the number of rows going into the join equals the number of rows coming out - else throw an error.

If this guarantee fails, you need some selection mechanism from the right table in order to return just one result (eg. records with the most recent date, lowest/highest ID number). Whatever you do, the values in the join key should be unique. If you don’t perform this sampling for unique rows, you’ll end up with extra rows and inflated results.

Join misses

Join misses are a little trickier to debug because some NULL values rightly belong in your data, while others are simply introduced by join misses.

You need to investigate these manually. My strategy is to pick a column you expect to be highly populated in the right table and count the number of nulls in that column before and after the join. If they don’t match, you have produced join misses and need to investigate if they are legitimate.

In pandas I typically use Series.value_counts(dropna=False) to check the distribution of data in a column before and after joins. If this looks like exploratory data analysis (EDA), that’s because it is. You should always be doing EDA even after you initially vet your data set - this is part of the sanity checking process.

Conclusion #

A prudent analyst must always be aware of the potential for join duplications and misses. You may have join misses (say in pandas) if you have two columns which are different data types - their values may be the same but the data types themselves are not comparable. You may have join duplicates (in SQL) if your data is case-sensitive because an “A” will be treated the same way as an “a”, violating the uniqueness we require on the join key.

One thing I am always surprised to see omitted in introductory SQL courses is an admission of just how tricky joins are. Unless you can safely assume that your data model is perfect (ie. never), joins are extremely prone to introducing errors. If you join data but join it incorrectly (and produce inaccurate results), you are better off not having joined the data at all. Joins should not be used unless you know the risks that come with them.

SQL joins are not easy, nor are they meant to be intuitive. They require critical thought and foresight. If there’s one thing constantly I remind myself when analyzing data, it’s to take my time and think about all the ways I could be wrong.