Why data analysts should exercise caution when taking averages (Part 1)

This is Part 1 in a 2-part series on miscalculating averages and its implications.

Few statistics rival the simplicity and explanatory power of those which are the most basic: the percentage, sum and average, among others. As a result they crop up everywhere, from exploratory data analysis to data dashboards and management reports.

But one of these, the arithmetic mean, is unusually problematic: while it is conventionally taught with one-dimensional data in mind, calculating it for multidimensional data requires a fundamentally different process. In fact, calculating the arithmetic mean as if your data is one-dimensional will produce grossly incorrect figures, sometimes orders of magnitude off from what was intended. For me, this was a humbling experience: even the arithmetic mean is no less deserving of double- and triple-checking than any other calculation.

Back to basics

The arithmetic mean is defined as:

SUM(all observations) / COUNT(number of observations)

We can see this through a simple example:

first_name num_apples
Katie 4
Alan 8
John 10
Tess 8
Jessica 5

In this example, there is one categorical dimension (first_name), one numerical measure (num_apples) and five observations, also called records, rows, measurements or facts. The grain of fact is what makes each individual record unique: in this case, the grain is composed only of the (first_name) dimension. If Katie appeared twice in the name column, we would know that (first_name) could not be the grain because it is no longer composed of unique observations.

Using the formula above, we can calculate the arithmetic mean:

SUM(4 + 8 + 10 + 8 + 5) / 5 = 7

In SQL, we would write:

SELECT AVG(num_apples) FROM apples

Adding complexity: two-dimensional data

first_name last_name num_apples
Katie Smith 4
Alan Johnson 8
Alan Howard 10
Tess O'Neal 8
Katie Powers 5

Now there two dimensions (first_name, last_name), one measure (num_apples) and again five observations. Notably the grain is no longer just (first_name) because Katie and Alan appear twice - the grain is now defined as the combination of two columns: (first_name, last_name).

We can also say the grain is (effectively) equal to the primary key of the table. The primary key specifies the set of one or more columns whose data must be unique for a given table; in addition, every table must have a primary key. You can think of the grain of fact as the semantic description of what makes a record unique, while the primary key is the database implementation of that description. I say "effectively" equal because while the grain and primary key may not actually contain the same set of columns, they will almost always be one-to-one (ie. each grain of fact corresponds to only one unique record as defined by the primary key, and vice versa).

We can now calculate a few different means. "What is the average number of apples across people?":
SELECT AVG(num_apples) FROM apples

This evaluates to 7, just as before.

How about: "What's the average number of apples across people with unique first names?" We can see that there are 3 unique first names (Katie, Alan, Tess) and the same total as before (35), so the answer is 35 / 3 = 11.66. In SQL we would write:

SELECT AVG(num_apples) FROM (
    SELECT first_name, SUM(num_apples)
    FROM apples
    GROUP BY first_name
) AS t

Notice how by changing the business question, we decreased the denominator from 5 to 3. We did this by reducing the dimensionality of the data (also known as collapsing or rolling up). Our grain was composed of two dimensions (first_name, last_name) but our analysis concerned only one dimension (first_name), so we summed up all the data beneath that one dimension (first_name) and calculated the average across it. The point here is that we specified the count of items in the denominator by collapsing our data onto a single dimension.

Summarizing where we are so far

We know that the combination of one or more dimensions composes the grain of fact (and effectively primary key) of the table. The business question requires an average on one dimension which is less than the number of dimensions composing the grain of fact - therefore we have to collapse the data onto that one dimension before taking the average. By collapsing the data, we guarantee the number of records in the denominator so that it doesn't include all the records in our table (given by the primary key), but rather only those records we care about (given by, soon to be discussed, the "collapsing key").

When calculating the arithmetic mean, there are two questions you need to answer:

  1. What exactly is the business question asking? On which dimension are we taking the average?
  2. How do we translate this business question into SQL?
For example, "the average number of items sold across days" implies that, despite how many records are in our table, we need to collapse the data on the dimension "day" alone and take the average of the measure "number of items," which ultimately yields a single number.

Why did I write "across days" instead of "by days"? Because grammar can make things confusing...

GROUP BYs and averages

date_picked grade first_name last_name num_apples
2017-01-01 freshman Katie Smith 4
2017-01-04 junior Alan Johnson 8
2017-01-04 junior Alan Howard 10
2017-01-04 freshman Tess O'Neal 8
2017-01-01 freshman Katie Powers 5

Here we have four dimensions (date_picked, grade, first_name, last_name), one measure (num_apples) and as usual five observations. At this point it's not entirely clear what the grain is: it could be (first_name, last_name) but also could be (date_picked, grade, first_name, last_name). In other words, if (Katie, Smith) appeared twice with any grade or date_picked, should those two records actually be combined into one? Or, are grade and date_picked part of what makes a record unique? Fortunately, as we will see, knowing the primary key is not relevant to our calculation of the arithmetic mean (but you could always check the actual implementation in the database).

We can calculate some more averages - for example, "what's the average number of apples picked across days?" There are two unique days (2017-01-01, 2017-01-04) so we know that's our denominator and our total, as before, is 35. The answer is 35 / 2 = 17.5 apples picked per day, on average. In SQL:
SELECT AVG(num_apples) FROM (
    SELECT date_picked, SUM(num_apples)
    FROM apples
    GROUP BY date_picked
) AS t
Now another average, "what's the average number of apples picked across days by grade?"

So we'll look at each grade one-by-one: first there is "freshman". Within "freshman", we have two unique dates (2017-01-01 and 2017-01-04), so 2 is our denominator. The numerator is the sum of those three records (4 + 8 + 5) = 17. The average apples picked per day for freshmen then is 17 / 2 = 8.5. Next is "junior". Here we have only one unique date (2017-01-04) and two records, the sum of which is (8 + 10) = 18. The average apples picked per day for juniors is 18 / 1 = 18. In SQL:

SELECT grade, AVG(num_apples) FROM (
    SELECT date_picked, grade, SUM(num_apples)
    FROM apples
    GROUP BY date_picked, grade
) AS t
GROUP BY grade

In the results grid, your SQL output would look like:

grade AVG(num_apples)
freshman 8.5
junior 18
What would happen if I took the same question as above, but reworded it as "what's the average number of apples picked by day by grade?" Suddenly it's not so clear what exactly I'm grouping on (grade) and what I'm averaging across (days) because I use "by" to mean both. Confusing!

If the business question isn't clear, it's your job as the data analyst to figure out how the words translate into SQL.

Formalizing the pattern

Understanding how to average multidimensional data is best tackled with a bit of terminology. There are three essential elements:

  • Collapsing key: the key whose dimensions we explicitly specify to be used for either grouping or averaging the data. This key is specified by the inner query's GROUP BY clause.
  • Grouping key: the key whose dimensions specify the groups for which we will calculate averages. This (optional) key is specified by the outer query's GROUP BY clause.
  • Averaging key: the key across which we ultimately want to evaluate the metric.
    • This key is not explicitly specified: rather, it equals the collapsing key minus the grouping key. Any dimensions which are in the collapsing key but not in the grouping key will yield the averaging key.
    • Most business questions are best understood with an averaging key of one dimension.

Let's put these terms into an actual SQL query, recycling the example from above:

SELECT grade, AVG(numApples) FROM (
   SELECT datePicked, grade, SUM(numApples)
   FROM apples
   GROUP BY datePicked, grade // COLLAPSING KEY
) AS t

// AVERAGING KEY (IMPLIED): [datePicked, grade] - [grade] = [datePicked]

For example, if the collapsing key has 4 dimensions and the grouping key has 2, the averaging key will have 2 (difficult to interpret). If the collapsing key has 1 dimension and there is no grouping key, the collapsing key equals the averaging key. 

Why is the primary key left out of all of this? Because you can never know how many records you are unwittingly including in the denominator of your calculation (which, at most, is given by the primary key of the table). As a result, you must explicitly define a collapsing key. You can think of the primary key as what defines an individual fact for your table, while the collapsing key is what defines an individual fact for your business question.

Finally, notice what gives rise to the common mistake: we never explicitly declare the most important piece of our business question, the averaging key! Here we see "the pitfall of multidimensional averages": the averaging key is implicit, not explicit.

The takeaway from this section: collapsing key - grouping key = averaging key

A final example: tying it all together

date product state purchaser quantity
2016-12-23 vacuum NY Brian King 1
2016-12-23 stapler NY Brian King 3
2016-12-23 printer ink NY Brian King 2
2016-12-23 vacuum MA Lauren Mills 1
2016-12-23 printer ink MA John Smith 5
2016-12-24 vacuum MA Lauren Mills 1
2016-12-24 keyboard NY Brian King 2
2016-12-24 stapler NY Trevor Campbell 1
2016-12-25 keyboard MA Tom Lewis 4

Knowing what we know now, we won't concern ourselves with what the primary key is. We can jump right into some business questions.

"What is the daily average number of items sold by state and purchaser?"
SELECT state, purchaser, AVG(quantity) FROM (
   SELECT state, purchaser, date, SUM(quantity)
   FROM transactions
   GROUP BY state, purchaser, date
) AS t
GROUP BY state, purchaser

// AVERAGING KEY: [state, purchaser, date] - [state, purchaser] = [date]
"What is the daily average quantity sold by product in NY?"
SELECT product, AVG(quantity) FROM (
   SELECT product, date, SUM(quantity)
   FROM transactions
   WHERE state = "NY"
   GROUP BY product, date
) AS t
GROUP BY product

// AVERAGING KEY: [product, date] - [product] = [date]

I will be the first to say that answering these questions requires a double-take - what exactly is the question asking? Then, how can I translate it into SQL? But that's precisely the point - calculating multidimensional averages requires considerable thought.

Wrapping up

Business questions don't come in the form of database logic or program code. Rather, they are formulated using natural language and must be translated into data language. Again, "calculate the average number of travelers by day by city" is not entirely clear. As a data analyst, you must clarify: "what exactly are we taking the average of?" Here it's helpful to think in terms of collapsing, grouping and averaging keys, especially when conceptualizing which records are going into your denominator.

This problem isn't confined to SQL, but rather any store of relational data, such as pandas DataFrames or R data tables. If you're anything like me, you'll go poring over your old code in search of averages and wonder, "just what exactly am I averaging here?"