What Use is an Average?

March 11, 2014

What’s the point of an average, anyway? Common knowledge (the 1967 Children’s World Book Encyclopedia) states that an average has something to do with an arithmetic mean and/or a “central tendency”. I’m pretty sure that a “central tendency” is a sly sexual reference, so we’ll go back over to mathematics being mean.

What’s an Average?

An average is the sum of all numbers in a sequence divided by the count of all numbers in a sequence. Remember when credit cards used your average balance to compute interest every month instead of the ending balance? Pay that bad boy off at the beginning of the bill cycle, buy a Cadillac on Tuesday, and then sit around until the end of the month before returning your Cadillac. Free Cadillac, I say. Anyway… math. Averages are an average way of measuring things. By average I mean average in the disappointed way that your parents looked at you when you brought home average grades in junior high school. By average I mean barely acceptable. Why do averages suck? Look at this “sequence” of numbers: [1000, 1000, 1000, 1000, 96000] - the average is 20,000. If you have a street with four people squatting in houses and one person selling drugs from a Contoso FroYo van, you could say that the average income of people on that street is $20,000 a year. That’s a fair assessment. After all, none of these people live below the poverty line, on average, so we can keep pesky do-gooders out of our neighborhood and get back to selling “yogurt”. What if those numbers record the duration of a query in microseconds? What if that query should generate an alert when the query takes longer than 50 milliseconds to execute. Statistically speaking, we’re okay, right? On average the query is great.

More Than An Average

We need something more than an average, clearly. On average, my arguments are much better than this… What’s better than an average? Maybe one of these other funny words that the 1967 Children’s World Book Encyclopedia contains… It says that “the mode is the value that appears most often in a set of data (citation needed)”. That seems like a better approach, right?

In the case of the “mode” (not to be confused with pie a la mode), the mode of our query would be 0ms. VICTORY! No further tuning is required. Modes can obscure meaningful data, too, but in different ways. So if averages and modes are out, what’s left?

The Berm

I could tell you about using the berm to collect statistics… While the berm may be a common barrier, level shelf, or space between two areas, it’s not a great statistical measure. There’s a thing called a median, which I could have sworn was a berm, that might be useful. The median is a number that sits in the middle of the distribution of the data. It’s like a berm, but it has nothing to do with traffic… I think. Now, the problem with the median is it’s the ACTUAL value in the middle of a set of numbers. So with a set that looks like {0, 0, 0, 0, 100000} the median value is still 0. (Did you know that sets of numbers should have curly brackets and not square brackets? Me neither! I want curly fries…) P.S. That definition of the median is sorta wrong - if there’s no clear middle, then you take the average of the two numbers closest to the middle. In this case, I think the median would be 50,000. The 1967 Children’s World Book Encyclopedia has some pages missing. It also talks a lot about non-Euclidean geometry and trying to fit rabbits inside a sphere. It’s a weird encyclopedia.

Clearly, Statistics Are Worthless

You said it, headline! Statistics are useless. Wait, that’s not right… Statistics are only as useful as the statistics we compute. I think that’s more appropriate. The problem with applying computations to a set of numbers is that any single computation can’t really represent the numbers. This is exactly like how you voted for that one politician but they don’t agree with you. Numbers: the pork belly spending of the mathematical world. What are we to do if there’s no one number that can represent our data effectively?

Let’s Start A Gang!

There’s strength in numbers, and what better way to get strength from a variety of numbers than to have a lot of them. (That sounded better in my head.) If we really want to see what’s going on inside our SQL Server - or any system - we need to gather more than one metric. We need to gather more than just the average and the median. We need values that show us how the information is spread out. To get an accurate measure we need to look at a number of different data points - this doesn’t mean we need to capture and store every data point about our SQL Server, but we do need to look at something more significant. To start with, perform regular aggregation at intervals; collect the average and median and combine them with different percentiles. Ideally we want to grab the 50th, 75th, 90th, 95th, 99th, and the 99.5th percentile. Collecting this gang of information gives us the ability to figure out if we really are looking at a 20 millisecond query duration or if we’re looking at queries that typically take 1 millisecond to execute with a 96 millisecond outlier.

A Thank You

Michael Swart challenged me to write something crazier than my previous attempt at T-SQL Tuesday. I’m sad to say that I couldn’t dig up anything equally as crazy this time around. Hopefully, though, this is just crazy enough…