2017-04-13 data

Why a simple average is the most frequently miscalculated statistic (Part 2)

This is Part 2 in a 2-part series on miscalculating averages and its implications.

In Part 1 of this series, I explained the fundamental problem with calculating multidimensional averages: you can never be sure what a single observation represents in the database (given by the primary key), and as a result, you need to “guarantee” what an observation is by specifying a collapsing key. While Part 1 emphasized how to structurally think about the problem of multidimensional averages, this post covers some of its implications.

1. When it comes to averages, what makes a good question? #

While every analysis begins with a good question, parsing that question linguistically can sometimes be challenging.

As a recap of my first post, every question with an average should contain at least two elements and sometimes three:

  1. Measure
  2. Averaging key (specifies the denominator)
  3. Grouping key (optional)

For example, “what is the average daily quantity of products sold?” has a measure of “quantity” and an averaging key of “day”. Similarly, “How many texts does the average person send per day?” has a measure of “number of texts”, a grouping key of “day” and an averaging key of “person”. I think it often helps to rephrase the question like this: “the average [measure] across [averaging key] by [grouping key].”

In contrast, “what is the daily average?” is an incomplete question because there is no measure. Similarly, “what is the average quantity purchased” is incomplete because there is no averaging key (that is, average quantity purchased across people, across cohorts, over time, etc.?).

Thinking in terms of grouping and averaging keys can help parse linguistically similar questions:

To reiterate, the key here is knowing that the averaging key defines how many records will be in the denominator. If you are unsure what your denominator is, or conversely, how to interpret your averaging key, you will have trouble answering your initial question.

2. What aggregation functions besides the mean are susceptible to the pitfall? #

Unfortunately, while the difficulty of working with multidimensional data may be most salient with the arithmetic mean, it is not confined to it. In fact, many other aggregation functions, such as the standard deviation (also uses n as the denominator), median, mode, maximum, minimum, first, last, count and count may all be similarly affected.

We can see this with an example.

transactions

date product state purchaser products_sold
2016-12-23 vacuum NY Brian King 1
2016-12-23 stapler NY Brian King 3
2016-12-23 printer ink NY Brian King 2
2016-12-23 vacuum MA Lauren Mills 1
2016-12-23 printer ink MA John Smith 5
2016-12-24 vacuum MA Lauren Mills 1
2016-12-24 keyboard NY Brian King 2
2016-12-24 stapler NY Trevor Campbell 1
2016-12-25 keyboard MA Tom Lewis 4

“For each day, what is the max quantity purchased by any individual?”

Since the primary key is more granular than the key on which we are analyzing - that is, (purchaser) - we again will have to collapse the data. In SQL:

SELECT 
    date, 
    MAX(quantity) 
FROM (
    SELECT date, purchaser, SUM(quantity)
    FROM transactions
    GROUP BY date, purchaser  -- COLLAPSING KEY
) AS t
GROUP BY date  -- GROUPING KEY

-- MAX KEY (IMPLIED): [date, purchaser] - [date] = [purchaser]

From this example, it’s clear that while there is no denominator to be affected, aggregating the actual values themselves changes what the answer will be. We can imagine how median, mode and the other calculations are all similarly affected when the underlying data is aggregated into new values.

3. Can you interpret an averaging key with more than one dimension? #

Yes, the averaging key can be a composite key (ie. composed of one or more dimensions). However, the composite key has to make sense. 

firs_name last_name num_apples
Katie Smith 4
Alan Johnson 8
John Howard 10
Tess O’Neal 8
Jessica Powers 5

For example, the composite key of (first_name, last_name) makes sense because it has a natural interpretation: full_name. Full name corresponds to a single individual, and it makes sense to take averages on the level of individuals - as a result, (first_name, last_name) would be a legitimate averaging key. Many times, however, the averaging key will not make sense with two or more dimensions - this is an indication that more aggregation (ie. the collapsing key) should have been performed.

4. How far off can the calculated average be with the wrong averaging key? #

By way of example, we can evaluate how the calculated average directionally changes as we change the length of the averaging key.

Looking at Example 2 above, what is the average when the averaging key is (first_name, last_name)? It is SUM(4 + 8 + 10 + 8 + 5) / 5 = 35. What is the average when the averaging key is (first_name)? It is SUM(4 + 8 + 10 + 8 + 5) / 3 = 11.66.

As you can tell, the more granular the the averaging key, the more records will be in the denominator (ie. higher) and the smaller the average will be. Conversely, the more collapsed the averaging key, the fewer records will be in the denominator (ie. lower) and the larger the average will be. For highly dimensional datasets (more granularity), this means averages can be significantly lower than what makes sense due to inadequate aggregation.