The Student t-Distribution

Probability Density Function for the Student t-Distribution.

For the Sake of Statistics, forget the Normal Distribution.

To be clear: This is targeted at Data Scientiststs/Machine Learning Researchers and not at Physicists

Statistical normality is overused. It‘s not as common and only really occurs in the impractical ‘limits’ [[2][3][4]]. To garner normality, you need to have substantial well-behaved independent dataset (CLT) but for most research projects: small sample sizes and independence are usually what researchers are faced with. We tend to have crappy data that you fudge to look normal but in fact, those ‘anomalies’ in the extremes are telling you something is up.

Lots of things are ‘approximately’ normal. That’s where the danger is.

The point of this article is not to talk about Kurtosis, but rather to discuss why phenomena within society do not follow the Normal Distribution, why extreme events are more likely and why relaxing certain constraints make you realise that the Student t-Distribution is more prevalent than thought. Most importantly, why researchers assume normality when it’s just not normal.

Unlikely events are actually more likely than we expect and not because of kurtosis, but because we’re modelling the data wrong to begin with. Say that we run a forecast on the weather over a 100 year period: is 100 years of data enough to assume normality? Yes? No! The world has been around for millions of years. We are underestimating the tails simply because we’re not considering the limits of our data. 100 years is just a sample of the entire historical data: much of which we’ll never see.

Moreover, Limpert and Stahl in 2011(which we discuss later: [4]) discuss this exact phenomenon: how often a symmetric normal distribution can be assumed, tests can be run and conclusions can be made, but where researchers have clearly misinterpreted results because of their trust in symmetric normality.

As a statistician, it’s important to know what you don’t know and the distribution that you’re basing your inferences on. It’s important to remember we primarily use a normal distribution for inferences because it offers a tidy closed form solution (note OLS Regression, Gaussian Processes etc), but in reality, the difficulties in solving harder distributions are why, at times, they make better predictions.

Firstly, let’s talk about the Mathematics of t-Distributions.

Please skip the maths to jump to more discourse


Where did the Student t-distribution come from?

In the Guinness Brewery of Dublin Ireland, William Sealy Gosset published a paper in 1895 under the pseudonym ‘Student’ detailing his statistical work into the “frequency distribution of standard deviations of samples drawn from a normal population”. There’s a debate about who actually came up with the student t-distribution first, but noting work by (Helmert and Luroth, 1876) came a bit earlier, let’s focus on the maths.

Assume we have a random variable with a mean μ and a variance σ², derived from a normal distribution. Then we know that if we find a sample estimate of the mean (say μ⁰), then the following variable z = [μ⁰-μ] / σ is a normal distribution, but, it now has a mean of 0 and a variance of 1. We’ve normalised the variable, or, we’ve standardised it.

However, imagine we have a relatively small sample size (say n < 30) which is part of a greater population. Then our estimate of mean (μ) will remain the same, however, our estimate of standard deviation, our sample standard deviation, has a denominator of n-1 (bessels correction).

Because of this, our attempt to normalise our random variable has not resulted in a standard normal, but rather has resulted in a variable with a different distribution, namely: a Student-t Distribution of the form:

X bar is the sample mean, μ is the estimate of the population mean, s is the standard error, and n is the number of samples.

This is significant to note because it tells us that even though something may be normally distributed, in small sample sizes, the dynamics of sampling from this distribution completely change, which is largely being characterised by Bessel’s correction.

I find it amazing that a small nuance to the formulae of the variance has far-reaching consequences into the distributional properties of the statistic.


What are Degrees of Freedom?

Degrees of freedom are a combination of how much data you have and how many parameters you need to estimate. They show how much independent information go into a parameter estimate. In this light, you want a lot of information to go into parameter estimates to obtain more precise estimates and more powerful hypothesis tests (note sample variance in relation to number of samples). So to make better estimates, you want many degrees of freedom, however usually, the degree’s of freedom corroborate with your sample size (and more precisely, the n-1 of Bessel’s Correction).

Varying the number of degrees of freedom for on a Student t-distribution

Degrees of Freedom are important to Student t-Distributions as they characterise the shape of the curve. The more degrees of freedom you have, the more your curve looks bell shaped and converges to a standard normal distribution.


Proof of Convergence to Normal Distribution

The probability density function for the t-distribution is complex but here it is:

is the number of degrees of freedom and Γ is the gamma function.

The properties of it can be quite fascinating: in fact, the Student t-Distribution with ν =1 is approximately Cauchy, and on the other end of the spectrum, the t distribution approaches a normal distribution as ν > 30. The proof is as follows. If Xn is a t-distributed variable, it can be rearranged to show that the variable can be written as follows:

I wish I could make these formulae smaller

where Y is a standard normal variable and X²n is a Chi-square random variable with n degrees of freedom, independent of Y. Separately, we know that X²n can be written as a sum of squares of n independent standard normal variables Z¹ …. Z¹⁰⁰:

and when n tends to infinity, the following ratio of the chi-squared variable

converges in probability to E[Z²i] = 1 by the law of large numbers.

Moreover, as a consequence of Slutsky’s theorem, Xn converges in distribution to X=μ+σY, which, thus, is normal. Further supplementary material can be found here.


Examples of failures of the Normal Distribution

I’ve explained in the introduction why the normal distribution isn’t relevant a lot of the time and I’ve proven how the Student t-Distribution is intrinsically related to both the Cauchy and the Normal Distribution. However, now let’s look at examples of where the Normal Distribution (and the dynamics it has) are assumed to be fundamental but in reality, results are skewed which questions any reliance on any normal assumptions.

The article here discusses the case of being ‘95% Confident’. Assuming a normal distribution, 95% of your results should be within 2σ of your mean (in a symmetric manner). Therefore, any results outside of this range are anomalous. Limpert and Stahl (2011) show that skewness in the results distorts the symmetry postulated by a number of authors in a number of different fields, thereby miscalculating likelihoods being an issue plaguing several fields.

Fields where this has caused problems are as wide-ranging as you can imagine:

Moreso, Potvin and Roff (1993) argue the case for non-normality being more prevalent in ecological data, and look at alternative non-parametric statistical methods, with Micceri close behind (1989) comparing the prevalence of the normal distribution for psychometric measures to that of the unicorn and other improbable creatures.

These are serious accusations and with a fine tooth and comb, it becomes pretty clear that all that seems normal is not.


In anger, I’ve explained why statistical normality is overused and why it’s caused so many failed out of sample experiments. We assume too much and reluctantly let the data speak. Other times, we let the data speak too much and ignore the limitations of our data.

We need to think more about the practical limitations of our data, but also the fundamental limitations of any distribution we assume. By making a realistic assumption on the full shape of data, we would see that making more conservative estimates would tend to perform significantly better out of sample.


Thanks for reading! Please message me if you have any more questions!


References

  1. Helmert FR (1875). “Über die Berechnung des wahrscheinlichen Fehlers aus einer endlichen Anzahl wahrer Beobachtungsfehler”.
  2. Potvin, C. & Roff, D.A. (1993). Distribution-free and robust statistical methods: Viable alternatives to parametric statistics? Ecology 74 (6), 1617–1628. Read
  3. Micceri, T. (1989) The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin 105 (1), 156–166. Read [free pdf]
  4. Limpert, Stahl (2011): Problems with Using the Normal Distribution — and Ways to Improve Quality and Efficiency of Data Analysis

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Powered by WordPress.com.

Up ↑

%d bloggers like this: