All Machine Learning Researchers should know this
Most machine learning and mathematical problems involve extrapolating a subset of data to infer for a global population. As an example, we may only get 100 replies on a survey to our new website, whereas our target market is 10 million customers. It’s infeasible to ask all 10 million customers what they think, so we have to use the feedback from the 100 to infer.
Probability Distributions explain to us the likelihood of different things happening. Conceptually, they can tell us “Event A is probably not going to happen, but Event B is much more likely”. They can also tell us “The likelihood of Z>5 is quite low”. Now we generally think that distributions apply to data however when you delve deeper into the theoretical side of statistics, we find that actually everything has its own distribution.
As a practitioner of Machine Learning and Data Science: day in and day out we use the Sample Mean. Therefore, we need to be at one with the dynamics and limitations of our most-used tool.
With the above example, say we have a new feature that we want to incorporate into our business. We can’t ask all 10 million customers what they think of the new feature before we integrate it, so instead we find a small group (our sample) of customers and calculate the mean result of that group. Again we think though, would the results change if we tried to measure this of another group of customers? What if we found 100 groups of 20 customers: would each group give the same result?
In this example, we are ‘sampling’ a small subset (through groups) of our population (of 10 million customers) to try approximate how the population distribution thinks. We use the sample mean to approximate the population mean, and as it’s an approximation, it has its own distribution.
Now as distributions are characterised by the mean and variance, we can make a big leap to defining the distribution of the sample mean by first deriving a mean and variance.
From this, we can achieve our goal in saying: “From our experiment, the average customer is X% likely to approve the new feature and we are confident on that percentage within +/- Y%”
Mean of the Sample Mean
To calculate the expectation of any statistic, you can simply wrap an expectation around the functional form of the statistic:
The derivation continues by calculating the expectation of the formulae. We take out the constants (1/n) and are left with a sum of expectations of the variable X (which are all independent). These are all the same (μ), so we have (nμ)/μ = μ, and thus, we are left with the statistic we started with.
This result is huge because it proves that the sample mean we derive is directly approximating the population mean. So for example, if we have a group of 100 customers from a population of a 10 million independent customers (samples), then the mean of this small sample is a very good estimate of the population mean (within a certain range). We can now make more powerful predictions with significantly less data.
Now that we’ve derived the first moment of our distribution, let’s move onto the second moment.
Variance of the Sample Mean
The proof for the variance for the sample mean is equally as simple.
Again we see that the variance function can feed through the sample mean statistic to extract 1/n. We are left with our X variables which because they are they independent, the variance of these can be added together. From here, it’s a simple substitution and rearranging to get the equation together.
The end result tells us that the variance of each sample mean is equivalent to the variance of the underlying data divided by the number of data points in each sample. This is another huge result as it tells us by how much the variance of our mean estimate decreases as N increases. We can almost fully approximate how the distribution of our sample mean should look. For example: with normally distributed data (variance =1), when N=100 samples, we can now say that a sample mean would be within +/- 0.1 (as 1 divided by square root of 100 = 0.1).
Distribution of the Sample Mean: Central Limit Theorem
The expressions derived above for the mean and variance of the sampling distribution are not difficult to derive or new. However the simplicity of it all is really stands out with the Central Limit Theorem that regardless of the shape of the parent population (be it from any distribution), the sampling distribution of the sample mean approaches a normal distribution.
Now the Central Limit Theorem (CLT) proves (under certain conditions) that when random variables are added, by adding together these random variables, at the limit (asymptote), the distribution converges towards a normal distribution even if the original variables themselves are not normally distributed. This generally occurs when N>25 (proof and demonstration).
Note: The proof of the CLT is not a short proof so I’ve left it out from this article.
The mean and variance derived above characterise the shape of the distribution and given that we now have knowledge of the asymptotic distribution, we can now infer even more with even less data. The characteristics of the normal distribution are extremely well covered and we can use what knowledge we have now to even more better understand the dynamics our estimate of the sample mean.
Now given that we have proven that that our sample mean statistic has a mean of μ and a variance of sigma²/n, let us now show in practise what happens when we repeatedly calculate the sample mean, and, whether this looks like a normal distribution or not.
Bootstrap methods use Monte Carlo Simulation to approximate a distribution. In the below example, (and as per the code at the end), we have generated 10,000 samples (from a random number generator being seeded by a normal distribution). Then from this population, we calculate the mean of samples of 100 numbers, record this mean, and do it again (100 times)
Now it’s awesome to see that the mean of sample means is quite close to the mean of a normal distribution (0), which we expected given that the expectation of a sample mean approximates the mean of the population, and which we know the underlying data to have as 0. Moreover, the standard deviation of the sample means is 0.1, which is also correct as the standard deviation = root(sigma²/N) = 1/root(100) = 1/10 = 0.1. Further, the shape of the distribution looks like a bell curve, and if we increase N (from 100, to say, 10000) we can see how the distribution looks even better:
Also, as another example, if we change the underlying population data to be uniformly distributed (say, between 0 and 100): if we calculate the sample mean 10,000 times, where each sample mean contains 100 points— as derived by the central limit theorem, we again converge to a normal distribution:
In the above article, I derived the mean and variance of the sample mean, I then used bootstrap techniques to highlight the central limit theorem: all of which we can use to aid our understanding of the sample mean, which in turn helps us to approximate the dynamics of the underlying population.
Now as a reader, I hope that you understand that the distributional properties of such a simple statistic can allow for very powerful inferences. Knowing the limitations of your estimates becomes more important in times of stress. From a business perspective, there’s no point making new features or changing a business strategy if you know that your goal is outside the realm of the limits that your current business resides within. As such, you can use these methods to guide your inference to help you make sensible decisions as you go along.
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Generate Population Data
df = pd.DataFrame(np.random.randn(10000,1)) #normal
#df = pd.DataFrame(np.random.randint(0,100,size=(10000,1))) #uniform
pop_mu = df.mean(axis=0)
pop_st = df.std(axis=0)
# Generate Sample Means and Standard Deviations
s_mu = [df.sample(100).mean() for i in range(1000)]
# Plot Sample Means
plt.title('Sampling Distribution of Sample Mean (100 samples where N = 1000)')
plt.axvline(x=np.mean(s_mu), label='Mean of Sample Means')
plt.axvline(x=np.mean(s_mu) + np.std(s_mu), label='Std of Sample means', color='r')
plt.axvline(x=np.mean(s_mu) - np.std(s_mu), label='Std of Sample means', color='r')