Statistical Principals for the Performance Tester

Over the years I found that members of software development teams, developers, testers, administrators and managers alike have an insufficient grasp on how to apply mathematics or interpret statistical data on the job.
As performance testers, we must know and be able to apply certain mathematical and statistical concepts.
Exemplar Data Sets
This section refers to three exemplar data sets for the purposes of illustration.
Data Sets Summary
The following is a summary of Data Set A, B, and C.
Sample Size
Min.
Max.
Avg
Median
Normal
Mode
95th %
Std Dev.
Data Set A
100
1
7
4
4
4
4
6
1.5
Data Set B
100
1
16
4
1
3
1
16
6.0
Data Set C
100
0
8
4
4
1
3
8
2.6
Data Set A
100 total data points, distributed as follows:
●      5 data points have a value of 1.
●      10 data points have a value of 2.
●      20 data points have a value of 3.
●      30 data points have a value of 4.
●      20 data points have a value of 5.
●      10 data points have a value of 6.
●      5 data points have a value of 7.
Data Set B

100 total data points, distributed as follows:

●      80 data points have a value of 1.
●      20 data points have a value of 16.
Data Set C
  100 total data points, distributed as follows:
●      11 data points have a value of 0.
●      10 data points have a value of 1.
●      11 data points have a value of 2.
●      13 data points have a value of 3.
●      11 data points have a value of 4.
●      11 data points have a value of 5.
●      11 data points have a value of 6.
●      12 data points have a value of 7.
●      10 data points have a value of 8.
Averages
Also known as arithmetic mean, or mean for short, the average is probably the most commonly used and most commonly misunderstood statistic of them all. Just add up all the numbers and divide by how many numbers you just added-what could be simpler?
When it comes to performance testing, in this example, Data Sets A, B, and C each have an average of exactly 4.
 In terms of application response times, these sets of data have extremely different meanings.
Given a response time goal of 5 seconds, looking at only the average of these sets, all three seem to meet the goal.
Looking at the data, however, shows that none of the data sets is composed only of data that meets the goal, and that Data Set B probably demonstrates some kind of performance anomaly.
Use caution when using averages to discuss response times, and, if at all possible, avoid using averages as your only reported statistic.
Percentiles
It is a straightforward concept easier to demonstrate than define. Consider the 95th percentile as an example. If you have 100 measurements ordered from greatest to least, and you count down the five largest measurements, the next largest measurement represents the 95th percentile of those measurements. For the purposes of response times, this statistic is read “Ninety-five percent of the simulated users experienced a response time of this value or less under the same conditions as the test execution.”
The 95th percentile of data set B above is 16 seconds. Obviously this does not give the impression of achieving our five-second response-time goal. Interestingly, this can be misleading as well: If we were to look at the 80th percentile on the same data set, it would be one second. Despite this possibility, percentiles remain the statistic that I find to be the most effective most often. That said, percentile statistics can stand alone only when used to represent data that’s uniformly or normally distributed and has an acceptable number of outliers.
Uniform Distributions
Uniform distribution is a term that represents a collection of data roughly equivalent to a set of random numbers that are evenly distributed between the upper and lower bounds of the data set. The key is that every number in the data set is represented approximately the same number of times. Uniform distributions are frequently used when modeling user delays, but aren’t particularly common results in actual response-time data. I’d go so far as to say that uniformly distributed results in response-time data are a pretty good indicator that someone should probably double-check the test or take a hard look at the application.
Normal Distributions
Also called a bell curve, a data set whose member data are weighted toward the center (or median value) is a normal distribution. When graphed, the shape of the “bell” of normally distributed data can vary from tall and narrow to short and squat, depending on the standard deviation of the data set; the smaller the standard deviation, the taller and more narrow the bell. Quantifiable human activities often result in normally distributed data. Normally distributed data is also common for response time data.
Standard Deviations
By definition, one standard deviation is the amount of variance within a set of measurements that encompasses approximately the top 68 percent of all measurements in the set; what that means in English is that knowing the standard deviation of your data set tells you how densely the data points are clustered around the mean. Simply put, the smaller the standard deviation, the more consistent the data. To illustrate, the standard deviation of data set A is approximately .7, while the standard deviation of data set B is approximately 6.
Another rule of thumb is this: Data with a standard deviation greater than half of its mean should be treated as suspect.
Statistical Significance
Mathematically calculating statistical significance, also known as reliability. Whenever possible, ensure that you collect at least 100 measurements from at least two independent tests.
While there’s no hard-and-fast rule about how to decide which results are statistically similar without complex equations that call for volumes of data, try comparing results from at least five test executions and apply these rules to help you determine whether or not test results are similar enough to be considered reliable if you’re not sure after your first two tests:
1.    If more than 20 percent (or one out of five) of the test execution results appear not to be similar to the rest, something is generally wrong with either the test environment, the application or the test itself.
2.    If a 95th percentile value for any test execution is greater than the maximum or less than the minimum value for any of the other test executions, it’s probably not statistically similar.
3.    If measurement from a test is noticeably higher or lower, when charted side-by-side, than the results of the rest of the test executions, it’s probably not statistically similar.
4.    If a single measurement category (for example, the response time for a specific object) in a test is noticeably higher or lower, when charted side-by-side with all the rest of the test execution results, but the results for all the rest of the measurements in that test are not, the test itself is probably statistically similar.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.