Why data analysts should exercise caution when taking averages (Part 2)

This is Part 1 in a 2-part series on miscalculating averages and its implications.

In Part 1 of this series, I explained the fundamental problem with calculating multidimensional averages: you can never be sure what a single observation represents in the database (given by the primary key), and as a result, you need to "guarantee" what an observation is by specifying a collapsing key. While Part 1 emphasized how to structurally think about the problem of multidimensional averages, this post covers some of its implications.


1. When it comes to averages, what makes a good question?

While every analysis begins with a good question, parsing that question linguistically can sometimes be challenging.

As a recap of my first post, every question with an average should contain at least two elements and sometimes three:

  1. Measure
  2. Averaging key (specifies the denominator)
  3. Grouping key (optional)

For example, "what is the average daily quantity of products sold?" has a measure of "quantity" and an averaging key of "day". Similarly, "How many texts does the average person send per day?" has a measure of "number of texts", a grouping key of "day" and an averaging key of "person". I think it often helps to rephrase the question like this: "the average [measure] across [averaging key] by [grouping key]."

In contrast, "what is the daily average?" is an incomplete question because there is no measure. Similarly, "what is the average quantity purchased" is incomplete because there is no averaging key (that is, average quantity purchased across people, across cohorts, over time, etc.?).

Thinking in terms of grouping and averaging keys can help parse linguistically similar questions:

  • "What is the average daily share price over the last 20 days?"
    • Output: a single value, the average, calculated by summing each day's closing share price over the last 20 days divided by 20.
  • "What is the average share price by day for the last 20 days?
    • Output: a list of 20 days (ie. "for each day"), each with its average share price over the day. For example, if your data comes at the minute-level (ie. averaging key is "across minutes"), each average would be sum of all prices that trading day divided by the count of minutes.
  • "What is the average daily share price by security in a portfolio over the last 20 days?
    • Output: a list of securities, (ie. "for each security"), each with a daily average over the last 20 days.
  • "What is the average share price by security by day?"
    • Output: [ambiguous - grouping key and averaging key are indiscernible]

To reiterate, the key here is knowing that the averaging key defines how many records will be in the denominator. If you are unsure what your denominator is, or conversely, how to interpret your averaging key, you will have trouble answering your initial question.


2. What aggregation functions besides the mean are susceptible to the pitfall?

Unfortunately, while the difficulty of working with multidimensional data may be most salient with the arithmetic mean, it is not confined to it. In fact, many other aggregation functions, such as the standard deviation (also uses n as the denominator), median, mode, maximum, minimum, first, last, count and count may all be similarly affected.

We can see this with an example.

  Example 1
date product state purchaser quantity
2016-12-23 vacuum NY Brian King 1
2016-12-23 stapler NY Brian King 3
2016-12-23 printer ink NY Brian King 2
2016-12-23 vacuum MA Lauren Mills 1
2016-12-23 printer ink MA John Smith 5
2016-12-24 vacuum MA Lauren Mills 1
2016-12-24 keyboard NY Brian King 2
2016-12-24 stapler NY Trevor Campbell 1
2016-12-25 keyboard MA Tom Lewis 4

"For each day, what is the max quantity purchased by any individual?"

Since the primary key is more granular than the key on which we are analyzing - that is, (purchaser) - we again will have to collapse the data. In SQL:

SELECT date, MAX(quantity) FROM (
    SELECT date, purchaser, SUM(quantity)
    FROM transactions
    GROUP BY date, purchaser // COLLAPSING KEY
) AS t

// MAX KEY (IMPLIED): [date, purchaser] - [date] = [purchaser]

From this example, it's clear that while there is no denominator to be affected, aggregating the actual values themselves changes what the answer will be. We can imagine how median, mode and the other calculations are all similarly affected when the underlying data is aggregated into new values.


3. Can you interpret an averaging key with more than one dimension?

Yes, the averaging key can be a composite key (ie. composed of one or more dimensions). However, the composite key has to make sense. 

  Example 2
first_name last_name num_apples
Katie Smith 4
Alan Johnson 8
Alan Howard 10
Tess O'Neal 8
Katie Powers 5

For example, the composite key of (first_name, last_name) makes sense because it has a natural interpretation: full_name. Full name corresponds to a single individual, and it makes sense to take averages on the level of individuals - as a result, (first_name, last_name) would be a legitimate averaging key. Many times, however, the averaging key will not make sense with two or more dimensions - this is an indication that more aggregation (ie. the collapsing key) should have been performed.


4. How far off can the calculated average be with the wrong averaging key?

By way of example, we can evaluate how the calculated average directionally changes as we change the length of the averaging key.

Looking at Example 2 above, what is the average when the averaging key is (first_name, last_name)? It is SUM(4 + 8 + 10 + 8 + 5) / 5 = 35. What is the average when the averaging key is (first_name)? It is SUM(4 + 8 + 10 + 8 + 5) / 3 = 11.66.

As you can tell, the more granular the the averaging key, the more records will be in the denominator (ie. higher) and the smaller the average will be. Conversely, the more collapsed the averaging key, the fewer records will be in the denominator (ie. lower) and the larger the average will be. For highly dimensional datasets (more granularity), this means averages can be significantly lower than what makes sense due to inadequate aggregation.