How NOT To Lie With Statistics “On Average”

One cannot get through the day without coming across the phrase “on average”. It is the most commonly used descriptive statistic, for example, in sports, customer reviews, course grades, time to destination. Yet, few understand what it really means (pun intended). For example, it does NOT always imply that half the sample is above this value and half below. It is NOT always the most common value in the sample.

In his iconic book How To Lie With Statistics, Darrell Huff says

My trick was to use a different kind of average each time, the word “average” having a very loose meaning. It is a trick commonly used, sometimes in innocence but often in guilt, by fellows wishing to influence public opinion or sell advertising space. When you are told that something is the average you still don’t know very much about it unless you can find out which of the common kinds of average it is- mean, median, or mode.

Darrell Huff, How To Lie With Statistics

When you come face to face with an average figure, here are a few questions to ask:

  • Average of what?
  • Was it the appropriate type of average?
  • Was the average influenced by any outliers?
  • Was the data bell-shaped? Skewed?
  • How many observations were in the dataset (sample size)?

The discussion below will help evaluate these questions.

Average vs Mean

What is generally called the average or mean and defined as “the sum of all the observations divided by the number of observations” is actually the arithmetic mean. While this is the most commonly used “average”,  there are other types: geometric mean, harmonic mean, median, mode.

So strictly speaking, average is any statistical measure that summarizes the data down to one value and tells something about a typical value from that distribution (the central tendency).

Arithmetic Mean

Leveling Approach: Beyond the “add all and divide”, one way to think of the arithmetic mean, is using the leveling method. Think of your data values as columns of blocks. Now if you were to even-off all the columns to be the same height, the height of the leveled-off column is the arithmetic mean.

In other words, replacing every value in the dataset with the arithmetic mean, will give the same SUM as the original dataset.

It is possible that this number ends up being the most common observation in the data set, but it is also possible that this number is not a part of the existing dataset. 

Find the balance: If you were to balance your data distribution on the tip of your finger, the arithmetic mean would be that point of balance. Therefore, arithmetic mean can be thought of as the center of mass of the data (distribution).

Do not let the word “center” imply that half of the data lies on one side. This is only true for symmetric distributions. If the distribution has a long right tail (right skewed), the balance is found to the right of the peak. If the distribution has a long left tail (left skewed), the balance is found to the left of the peak. 

Understanding the arithmetic mean

Lets say the n number of observations in your dataset represent an n-sided polygon with all sides adding up to p (perimeter). If you were to redraw the polygon such that all sides were equal in length and the perimeter remained the same, that length would be the arithmetic mean. For example for a 5 sided polygon (pentagon) of sides 2,3,1,2,7 the perimeter is 15. A pentagon with all sides equal and perimeter 15 would have to have side length of 3, which is the arithmetic mean.

When to use arithmetic mean?

It is appropriate for observations that are normally distributed or when data has a narrow range. It is especially useful when comparing two similar and not-skewed datasets.

Geometric Mean

Geometric mean is the nth root product of n numbers. In other words, to calculate the geometric mean of n observations, multiply all the observations together and then take the nth root. E.g. to calculate the geometric mean of 2, 6, 18; multiply the numbers to get (2 x 6 x 18= 216) and find the 3rd root (cube root of 216 = 6). So, the geometric mean is 6.

Replacing every value in the dataset with the geometric mean, will give the same PRODUCT as the original dataset. In this example, replacing all three observations with 6, gives the same product of 216.

Another way to think of geometric mean is to think of dataset with n values, as an n-dimensional shape (not n-sided). If you were to then squish this shape into one with equal dimensions, the geometric mean is the value of the side of that shape. Geometric mean of two numbers is 2 and 18 is the square with side 6. Geometric mean of 2, 6, 18 is a cube of all sides 6.

When to use geometric mean?

Since, it is essentially a log-transformation of the data, it mitigates the effect of very large and/or very small values. Therefore, geometric mean should be calculated for datasets where the observations are skewed or scaled. It should be used to get the average value of observations that are derived from other values (e.g. percentages, rates, ratios). 

Note: When calculating the geometric mean of percentages or growth rates, convert them into decimals first. For example, if a bacterial colony grows by 12% on day 1, 4% on day 2 and 2% on day 3, the geometric mean is calculated by getting the cube root of (1.12 x 1.04 x 1.02), which is 1.059. This tells us that the average growth rate for 3 days is 5.9%.

Harmonic Mean

Harmonic mean is calculated as the reciprocal of the arithmetic mean of the reciprocals of the observations. Find the reciprocal of all the observations in the dataset. Then calculate the arithmetic mean and take the reciprocal.

When to use harmonic mean?

Use it to calculate average value of rates or ratios that vary in periods or lengths. So, the numerator is the same and the denominator of the ratio varies. E.g. if you drove to work at 20 mph and returned from work at 30 mph, the average speed is the harmonic mean because the distance (miles of the mph) is the same but the time (hours of mph) varies. So, the average speed for the round trip is 24 mph, not the arithmetic mean of 25 mph. Harmonic mean automatically takes into account the different amounts go time spent traveling at different speeds.

If the situation was that you drove to work for an hour at 20 mph and then at 30 mph for another hour, the average speed would be calculated using the arithmetic mean. This is because the time (hour of mph) you drove at a certain speed is the same.

Relationship Between The Three Means

  • Harmonic mean is arithmetic mean after applying a reciprocal transformation
  • Geometric mean is arithmetic mean after applying a logarithmic transformation
  • Harmonic mean is smaller than geometric mean, which is smaller than arithmetic mean; i.e. harmonic < geometric < arithmetic

Median

To find the median, arrange your data in ascending order. The middle observation is the median. If there are even number of items in the data set then the median is the arithmetic mean of the two middle observations. 
So, median is truly the middle value of the dataset, with half the observations lying above it and half below. Because it is solely based on the middle value, it is not affected by outliers.  It is also an intuitive measure to understand and easy to calculate. 

When to use median?

If the dataset includes a lot of outliers, it may be appropriate to use median. It should also be used for ranked values (ordinal data) e.g. (scale of important, neutral, not important).

Relationship Between Arithmetic Mean, Median and Mode:

  • For a positively skewed distribution (long right tail), the mean > median > mode
  • For a negatively skewed distribution (long left tail), the mean < median < mode

Conclusion

When calculating any type of average value, the goal is generally to understand the data by summarizing it down to one number. There are many different ways to get that number. It is perfectly okay to use the arithmetic mean as your average value if it makes sense but not simply out of habit. Conversely, when you come across a reported “average” think about which calculation was applied and if it was appropriate.

Lastly, you should never rely on just one number to describe and understand a dataset. It is often helpful to look at how different average values of the same dataset behave in relation to one another. It is also important to look at measures of spread like range, interquartile range, standard deviation and variance.