Chapter 5 Summary statistics
One of the most common uses of statistics is to provide a summary of some set of data. We can summarize datasets by describing specific properties. Let’s consider some of the basic summary statistics.
First let’s consider some data. The following table is a frequency table of GPAs among 23 students.
| GPA | Frequency | 
|---|---|
| 3.3 | 4 | 
| 3.4 | 5 | 
| 3.5 | 10 | 
| 3.6 | 0 | 
| 3.7 | 0 | 
| 3.8 | 1 | 
| 3.9 | 2 | 
| 4.0 | 1 | 
This table summarizes the 23 values in the sample. However, it is not a very useful format for performing calculations in R. Instead, we will consider all the data by putting it in the following format:
gpa.sample <- c(rep(3.3,4), rep(3.4,5), rep(3.5, 10), rep(3.6, 0), rep(3.7, 0), rep(3.8, 1), rep(3.9, 2), rep(4.0, 1))
gpa.sample##  [1] 3.3 3.3 3.3 3.3 3.4 3.4 3.4 3.4 3.4 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5
## [18] 3.5 3.5 3.8 3.9 3.9 4.05.1 Central tendency
There are three ways of defining the central tendency of the data. They are:
- mean
- median
- mode
5.1.1 Mean
The mean is the sum of all values divided by the number of individuals.
\(\overline {x}=\sum _{i=1}^{n}\dfrac {x_{i}} {n}\)
All this equation means is that we add up all the values and divide by the total number of values. In this case, that’s:
## [1] 80.8## [1] 3.5130435.1.2 Median
The median is the middle value when all values are arranged in increasing order. For \(n\) values, the median value is:
- if \(n\) is odd, the median value is the \((n+1)/2\) position
- if \(n\) is even, the median value is the average of the \(n/2\) and \((n+1)/2\) positions
In our sample of 23 individuals (an odd number) the median value is the value at position \((23+1)/2 = 12\), which makes it 3.5.
Luckily, we can easily calculate that in R.
## [1] 3.5###Mode
The mode is the most common value in set of data. There is no formula for calculating that. We can easily see from the frequency table that the most frequent value, or the mode, is: \(3.5\)
R doesn’t have a function for determining the mode. In practice, the mode is not used very often as a summary statistic.
5.2 Variability
In addition to sumamrizing the central tendency of the data, we often want to descrive the variability in the data, or the spread. Different data can have the same central tendency, but very different amounts of variability.
5.2.1 Variance
A useful measure of the variability in the data is the variance. In order to measure the variance in a sample we:
- measure the distance of each value from the mean
- determine the average distance of all values from the mean
The problem is that some values are greater than the mean, so this difference will be positive, and some values are less than the mean, so this difference will be negative. The solution is to take the square of the difference. As the square of both positive and negative numbers is positive these measures of difference from the mean will all be positive.
The formula for the variance is:
\(s ^{2}=\dfrac {\sum _{i=1}^{n}\left( x_{i}-\overline {x}\right) ^{2}} {n-1}\)
The numerator of this equation, \(\sum _{i=1}^{n}\left( x_{i}-\overline {x}\right) ^{2}\), is called the sum of squares
The denominator of this equation, \({n-1}\), is called the degrees of freedom (more about that later).
So, the variance can be considered the average of the sum of squares.
For our data, the variance is:
##  [1] 0.0453875236 0.0453875236 0.0453875236 0.0453875236 0.0127788280
##  [6] 0.0127788280 0.0127788280 0.0127788280 0.0127788280 0.0001701323
## [11] 0.0001701323 0.0001701323 0.0001701323 0.0001701323 0.0001701323
## [16] 0.0001701323 0.0001701323 0.0001701323 0.0001701323 0.0823440454
## [21] 0.1497353497 0.1497353497 0.2371266541## [1] 0.866087## [1] 0.03936759which, thankfully we can compute in R using the function var():
## [1] 0.039367595.2.2 Standard deviation
The problem with using variance as the measure of variability in the data is that the units of are different from the data as it is the average of the squared differences. The simple solution is to take the square root of the variance, which we define as the standard deviation.
The formula for the standard deviation is:
\(S =\sqrt {\dfrac {\sum _{i=1}^{n}\left( x_{i}-\overline {x}\right) ^{2}} {n-1}}\)
We can determine the standard deviation for our data by taking the square root of the variance using the R function sqrt().
## [1] 0.1984127or, more easily, we can use the R function sd().
## [1] 0.19841275.3 Degrees of freedom
The number of degrees of freedom is literally the number of ways that our data is free. When we estimate a parameter from the data, like the mean, our data has become constrained and has lost a degree of freedom.
To illustrate this, consider our 23 values, which we have estimated to have a mean of 3.5130435.