Data Science and Computing with Python for Pilots and Flight Test Engineers → Sample Mean, Median, Variance, and Standard Deviation

Sample Mean, Median, Variance, and Standard Deviation

Sample Mean

The (arithmetic) sample mean of $n$ samples $x_i$ is defined as

$$ \overline{x} = \frac{1}{n} \sum_{i=1}^{n} x_i $$

One needs to distinguish this from the actual mean $\mu$ of the probability distribution, from which the samples were drawn. $\overline{x}$ can be used as an estimate for $\mu$, and becomes more accurate the larger the number of samples $n$ is. But it is only an estimate and not the same thing.

Median

If we have $n$ samples of a quantity, then the median is the value of a sample $x_i$ for which a many samples exist with a higher value as there are samples with a lower.

Variance

The variance can be defined in two ways. If the true mean $\mu$ of the probability distribution is known, from which the samples are drawn, then the variance is defined as

$$ Var = \frac{1}{n}\sum_{i=1}^{n}(x_i-\mu)^2 $$

However, if the mean has to be estimated from the sample mean, i.e. if we replace the true $\mu$ with the sample mean $\overline{x}$ above, then the variance is defined as (notice the change in denominator)

$$ Var = \frac{1}{n-1}\sum_{i=1}^{n}(x_i-\overline{x})^2 $$

The variance can also be denoted by $V$ or $\sigma^2$.

Standard Deviation

The standard deviation $\sigma$ is defined as the square root of the variance, $\sigma = \sqrt{Var}$.

Covariance

If we have two different random variables, from which we draw samples $x_i$ and $y_i$, then the covariance is defined as

$$ Cov = \frac{1}{n}\sum_{i=1}^{n}(x_i-\mu_x)(y_i-\mu_y) $$

if the means $\mu_x$ and $\mu_y$ of the two populations are known. If they are estimated from the sample means $\overline{x}$ and $\overline{y}$, then replace again the denominator $n$ by $n-1$, just as we have done in the definition of the variance.

The covariance becomes equal to the variance, if $x_i$ and $y_i$ are the same (and come from the same distribution). This is obvious from comparing the two corresponding formulas.

Code

The DataStatistics class below implements the calculation of these quantities.

class DataStatistics:
    
    def mean(self, data):
        """ Computes the estimate of the mean from the data. """
        # Could alternatively use the Numpy built-in function:
        # mean = np.mean(data)
        mean = 0.0
        for i in range(0, len(data)):
            mean = mean + data[i]
        mean = mean / len(data) # this is the denominator n.
        return mean
    
    def variance(self, data, mean=None):
        """ Implements sample variance, which has n-1 in denominator, assuming the mean is being estimated
        from the data, too; does not implement the population variance, which has n in the denominator
        and assumes the mean in known, not also estimated. 
        
        If no mean is supplied, then the mean is computed from the data. However
        for certain application a different mean can be supplied,
        e.g. zero for the computation of the CEP around a target. """
        
        if mean is None:
            mean = self.mean(data)
        # the above allows to supply this method with a different mean than from the data. 
            
        variance = 0.0
        for i in range(0, len(data)):
            variance = variance + (data[i]-mean)**2
        variance = variance / (len(data)-1) # this is the n-1 in denominator.
        return variance
    
    def standard_deviation(self, data, mean=None):
        """ Implements sample standard deviation (with n-1 in denominator, see above), assuming the mean
        has been estimated from the data too (as opposed to being known. """
        variance = self.variance(data, mean)
        stddev = np.sqrt(variance)
        return stddev
    
    def median(self, data):
        """ Computes the median of the data, using a Numpy built-in function. """
        median = np.median(data) # picks middle value (which has equal number of smaller and larger values).
        return median