Infinity (and beyond…)
The study of asymptotic distributions looks to understand how the distribution of a phenomena changes as the number of samples taken into account goes from n → ∞. Say we’re trying to make a binary guess on where the stock market is going to close tomorrow (like a Bernoulli trial): how does the sampling distribution change if we ask 10, 20, 50 or even 1 billion experts?
The understanding of asymptotic distributions has enhanced several fields so its importance is not to be understated. Everything from Statistical Physics to the Insurance industry has benefitted from theories like the Central Limit Theorem (which we cover a bit later).
However, something that is not well covered is that the CLT assumes independent data: what if your data isn’t independent? The views of people are often not independent, so what then?
Let’s first cover how we should think about asymptotic analysis in a single function.
From first glance at looking towards the limit, we try to see what happens to our function or process when we set variables to the highest value: ∞.
As an example, assume that we’re trying to understand the limits of the function f(n) = n² + 3n. The function f(n) is said to be “asymptotically equivalent to n² because as n → ∞, n² dominates 3n and therefore, at the extreme case, the function has a stronger pull from the n² than the 3n. Therefore, we say “f(n) is asymptotic to n²” and is often written symbolically as f(n) ~ n².
Conceptually, this is quite simple so let’s make it a bit more difficult. Let’s say we have a group of functions and all the functions are kind of similar. Let’s say each function is a variable from a distribution we’re unsure of e.g. a bouncing ball. How does it behave? What’s the average heigh of 1 million bounced balls? Let’s see how the sampling distribution changes as n → ∞.
An Asymptotic Distribution is known to be the limiting distribution of a sequence of distributions.
Imagine you plot a histogram of 100,000 numbers generated from a random number generator: that’s probably quite close to the parent distribution which characterises the random number generator. However, this intuition supports theorems behind the Law of Large numbers, but doesn’t really talk much about what the distribution converges to at infinity (it kind of just approximates it).
For that, the Central Limit Theorem comes into play. In a previous blog (here) I explain a bit behind the concept. This theorem states that the sum of a series of distributions converges to a normal distribution: a result that is independent of the parent distribution. So if a parent distribution has a normal, or Bernoulli, or Chi-Squared, or any distribution for that matter: when enough estimators of over distributions are added together, the result is a normal.
I would say that to most readers who are familiar with the Central Limit Theorem though, you have to remember that this theorem strongly relies on data being assumed to be IID: but what if it’s not, what if data is dependant on each other? Stock prices are dependent on each other: does that mean a portfolio of stocks has a normal distribution?
The answer is no.
(Ledoit, Crack, 2009) assume stochastic process which is not in-dependent:
As we can see, the functional form of Xt is the simplest example of a non-IID generating process given its autoregressive properties. The distribution of the sample mean here is then latterly derived in the paper (very involved) to show that the asymptotic distribution is close to normal but only at the limit:
however, for all finite values of N (and for all reasonable numbers of N that you can imagine), the variance of the estimator is now biased based on the correlation exhibited within the parent population.
“You may then ask your students to perform a Monte-Carlo simulation of the Gaussian AR(1) process with ρ ≠ 0, so that they can demonstrate for themselves that they have statistically significantly underestimated the true standard error.”
This demonstrates that when data is dependant, the variance of our estimators is significantly wider and it becomes much more difficult to approximate the population estimator. This begins to look a bit more like a student-t distribution that a normal distribution.
However given this, what should we consider in an estimator given the dependancy structure within the data? Ideally, we’d want a consistent and efficient estimator:
Asymptotic Consistent Estimators
Now in terms of probability, we can say that an estimator is said to be asymptotically consistent when as the number of samples increase, the resulting sequence of estimators converges in probability to the true estimate.
Let’s say that our ‘estimator’ is the average (or sample mean) and we want to calculate the average height of people in the world. Now we’d struggle for everyone to take part but let’s say 100 people agree to be measured.
Now we’ve previously established that the sample variance is dependant on N and as N increases, the variance of the sample estimate decreases, so that the sample estimate converges to the true estimate. So now if we take an average of 1000 people, or 10000 people, our estimate will be closer to the true parameter value as the variance of our sample estimate decreases.
Now a really interesting thing to note is that an estimator can be biased and consistent. For example, take a function that calculates the mean with some bias: e.g. f(x) = μ + 1/N. As N → ∞, 1/N goes to 0 and thus f(x)~μ, thus being consistent.
This is why in some use cases, even though your metric may not be perfect (and biased): you can actually get a pretty accurate answer with enough sample data.
An estimator is said to be efficient if the estimator is unbiased and where the variance of the estimator meets the Cramer-Rao Lower Inequality (the lower bound on an unbiased estimator). However a weaker condition can also be met if the estimator has a lower variance than all other estimators (but does not meet the cramer-rao lower bound): for which it’d be called the Minimum Variance Unbiased Estimator (MVUE).
Take the sample mean and the sample median and also assume the population data is IID and normally distributed (μ=0, σ²=1). We know from the central limit theorem that the sample mean has a distribution ~N(0,1/N) and the sample median is ~N(0, π/2N).
Now we can compare the variances side by side. For the sample mean, you have 1/N but for the median, you have π/2N=(π/2) x (1/N) ~1.57 x (1/N). So the variance for the sample median is approximately 57% greater than the variance of the sample mean.
At this point, we can say that the sample mean is the MVUE as its variance is lower than the variance of the sample median. This tells us that if we are trying to estimate the average of a population, our sample mean will actually converge quicker to the true population parameter, and therefore, we’d require less data to get to a point of saying “I’m 99% sure that the population parameter is around here”.
As such, when you look towards the limit, it’s imperative to look at how the second moment of your estimator reacts as your sample size increases — as it can make life easier (or more difficult!) if you choose correctly!
In a number of ways, the above article has described the process by which the reader should think about asymptotic phenomena. At first, you should consider what the underlying data is like and how that would effect the distributional properties of sample estimators as the number of samples grows.
Secondly, you would then consider for what you’re trying to measure, which estimator would be best for you. In some cases, a median is better than a mean (e.g. for data with outliers), but in other cases, you would go for the mean (converges quicker to the true population mean).
In either case, as Big Data becomes a bigger part of our lives — we need to be cognisant that the wrong estimator can bring about the wrong conclusion. This can cause havoc as the number of samples goes from 100, to 100 million. Therefore, it’s imperative to get this step right.
Message if you have any questions — always happy to help!
Code for Central Limit Theorem with Independent Data
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Generate Population Data
df = pd.DataFrame(np.random.randn(10000,1)) #normal
#df = pd.DataFrame(np.random.randint(0,100,size=(10000,1))) #uniform
pop_mu = df.mean(axis=0)
pop_st = df.std(axis=0)
# Generate Sample Means and Standard Deviations
s_mu = [df.sample(100).mean() for i in range(1000)]
# Plot Sample Means
plt.title('Sampling Distribution of Sample Mean (100 samples where N = 1000)')
plt.axvline(x=np.mean(s_mu), label='Mean of Sample Means')
plt.axvline(x=np.mean(s_mu) + np.std(s_mu), label='Std of Sample means', color='r')
plt.axvline(x=np.mean(s_mu) - np.std(s_mu), label='Std of Sample means', color='r')