Data Science and Computing with Python for Pilots and Flight Test Engineers
Probability of Sample Mean and Confidence Intervals
Introduction
Probability of Sample Mean from a Population
Let us have a data sample of \(n\) individual measurements (sample size) from a population, and let \(\overline{x}\) denote the computed mean from this sample. In this lesson we will compute the probability that this mean \(\overline{x}\) occurred at a certain confidence level, given some already known statistical quantities about the population.
- If the actual mean \(\mu\) and the standard deviation \(\sigma\) (or variance \(\sigma^2\) ) of the population both are known, use the normal (Gaussian) distribution method below.
- If the actual mean \(\mu\) of the population is known, yet the standard deviation \(\sigma\) is unknown, but the sample size is large (\(n\ge30\)), estimate the actual standard deviation of the population, \(\sigma\), from the sample standard deviation \(s\) and can then again use the normal (Gaussian) distribution method below.
- If the actual mean \(\mu\) of the population is known, yet the standard deviation \(\sigma\) is unknown, but the sample size is small (\(n<30\)), use the Student’s t-distribution method below. In this method, you will obtain a probability estimate for \(\overline{x}\) directly from the \(t\) value calculated (which also take into account the standard deviation \(s\) of the sample), by looking up the probability of the \(t\) value in a table.
Confidence Interval
Above we assumed that we know at least the actual mean of the population and maybe the actual standard deviation. In this paragraph, it is different. If the actual mean is not known, but the actual population variance is known, then we can infer the mean of the population \(\mu\) and establish a confidence interval on our inference from our sample, in which we believe the actual (yet unknown) mean of the population to lie with a certain confidence level. For this case, we would again use the normal (Gaussian) distribution, regardless of sample size. If the standard deviation is also unknown, then we can estimate the standard deviation from the sample and use the normal (Gaussian) method again, if the sample is large \(n\ge30\). If the sample is small and the standard deviation is unknown, we must use Student’s t-distribution method again.
The goal in these cases is to compute the boundaries of the confidence interval, within which we believe the actual mean of the population to lie, given the sample we took.
Notation Convention
We call a normal (Gaussian) distribution with zero mean, \(\mu=0\), and standard deviation equal to 1, \(\sigma=1\), a normalized normal (Gaussian) distribution or a \(z\)-distribution. Therefore, if you see the word \(z\)-distribution appear anywhere, this refers to a normalized normal (Gaussian) distribution with \(\mu=0\) and \(\sigma=1\). This is in particular the case, when we refer to the normal (Gaussian) distribution look-up table, which tabulates only a \(z\) distribution (the user needs to convert their normal (Gaussian) distribution to a normalized one, by shifting the mean and dividing by the variance, before applying the table).
Normal (Gaussian) Distribution Method
Probability of Sample Mean given known Population Mean and Standard Deviation (or unknown Standard Deviation and a Large Sample Size, \(n\ge30\)).
Transformation from a general normal (Gaussian) distribution with mean \(\mu\) and standard deviation \(\sigma\) (using variable \(x\)) to the normalized normal distribution with mean \(\mu_0=0\) and standard deviation \(\sigma_0=1.0\) (using variable \(z\)) is accomplished with
$$ z = \frac{x-\mu}{\sigma} $$
$$ dz = \frac{1}{\sigma}\,dx $$
Given the known population mean \(\mu\) and standard deviation of \(\sigma\), a sample mean value of \(\overline{x}\) from \(n\) samples given by
$$ \overline x = \mu + z \frac{\sigma}{\sqrt{n}} $$
occurs with the probability of the corresponding \(z\) from the normalized \(z\)-distribution probability table. (The latter probability depends on the chosen probability level (or \(\alpha\)=1-probability) and on the number of degrees of freedom \(\nu=n-1\), as well as if the probability is taken one-tailed or two-tailed.)
In the above, the population standard deviation (or variance) was assumed to be known. If it is not known, it can be estimated from the sample, but care must be given. The population standard deviation \(\sigma\) can be estimated from the sample standard deviation \(s\), if \(n\ge30\). If \(n<30\) (and the variance/standard deviation of the population is unknown!), use Student’s t-distribution instead (see further below).
Confidence Intervals
The two-sided confidence interval on inferred population mean \(\mu\) from an obtained sample mean \(\overline{x}\) (and a known population standard deviation \(\sigma\) (possibly inferred from a sample standard deviation by setting \(\sigma\) equal to the sample standard deviation \(s\), i.e. \(\sigma=s\)), is:
$$ \mu = \overline x \pm z_{1-\frac{\alpha}{2}} \frac{\sigma}{\sqrt{n}} $$
Analogously for one-sided confidence intervals (use only the \(+\) or \(-\) then, and compute the probability of \(z_{1-\alpha}\) from the table accordingly).
Student’s t-Distribution Method
Probability of Sample Mean given known Population Mean and unknown Standard Deviation and a Small Sample Size, \(n<30\)
Given a known population mean \(\mu\) and a sample mean \(\overline{x}\) and sample standard deviation \(s\)computed from our sample, we can compute the \(t\)-value using the following formula.
$$ t = \frac{\overline{x}-\mu}{\frac{s}{\sqrt{n}}} $$
Once you obtain this value for the parameter \(t\), go and look up the probability for the obtained value of \(t\) at a certain confidence level in the Student’s t-distribution table (or calculate it with the corresponding method from our ProbabilityTables class). This probability is the probability with which such a sample mean would occur in such a population at that confidence level. (This probability depends on the chosen confidence level (or \(\alpha\)=1-probability) and on the number of degrees of freedom \(\nu=n-1\) (where \(n\) is the sample size), as well as if the probability is taken one-tailed or two-tailed.)
This procedure is to be applied, if the sample size is small, i.e. the number of samples is \(n<30\). If the number of samples is \(n\ge30\), you can use the normal (Gaussian) distribution method above, instead, and estimate the sample variance (and standard deviation) from the sample. (The sample size is then large enough that one assumes that the estimate of the variance from the sample is pretty accurate.)
In all cases it is assumed that the actual population is Gaussian distributed. Student’s t-distribution method simply takes into account the larger uncertainty in the estimate of the population variance from the sample, if the sample size is small. If the population variance (or standard deviation) is known, use the Gaussian probability distribution method, regardless of sample size.
Confidence Intervals
The two-sided confidence interval on the inferred population mean \(\mu\) from an obtained (measured and computed) sample mean \(\overline{x}\) and sample standard deviation \(s\) is:
$$ \mu = \overline x \pm t \frac{s}{\sqrt{n}}. $$
Analogously for the one-sided case (use only the \(+\) or \(-\) then, and compute the probability of \(t\) from the table accordingly).