Robust Statistical Methods

Anomalies hidden in plain sight. Chart from Liu and Neilson (2016)

Methods that Data Scientists Should Love

A robust statistic is a type of estimator used when the distribution of the data set is not certain, or when egregious anomalies exist. If we’re confident on the distributional properties of our data set, then traditional statistics like the Sample Mean are well positioned. However, if our data has some underlying bias or oddity, is our Sample Mean still the right estimator to use?

Let’s imagine a situation where the data isn’t so friendly.

Let’s take an example that involves the sample mean estimator. We know that the sample mean gives every data point a 1/N weight which means that if a single data point is infinity, then the sample mean will also go to infinity as this data point will have a weight of ∞/N = ∞.

This is at odds to our sample median which is little affected by any single value being ±∞. That’s because the sample median does not apply weight to every datapoint. In fact, we can say that the sample median is resistant to gross errors whereas the sample mean is not.

A gross error is a data point that is misleading (usually 3σ or more)

In fact, the median will tolerate up to 50% gross errors before it can be made arbitrarily large; we say its breakdown point is 50% whereas that for the sample mean is 0%.

The breakdown point of an estimator is the proportion of gross errors an estimator can withstand before giving an abnormal result.

Robust statistics are often favoured to traditional sample estimators due to the higher breakdown point. It’s not unusual for data to involve anomalies if the recording of data involves some manual effort, however, the mean and median should normally be quite close. Now if you assume that your underlying data contains some gross errors, then it’s worthwhile using a robust statistic.

Let’s first look at what outliers mean in terms of relative efficiency.


Relative Efficiency

Relative Efficiency is the comparison between variances of sample estimators. We previously saw that if data is well behaved, the variance of a sample estimator should go to 0 as n goes to ∞. We also saw that for normally distributed data, the sample mean has a lower efficiency than the sample median. But what if the data is not normally distributed?

If we have Student T-distributed data with 5 degrees of freedom, the sample median has a much lower efficiency and is, therefore, a better estimator to use to approximate the population mean. So much so, it can have an Asymptotic Relative Efficiency (ARE) of 96%.

Let’s say we’re doing an example on stock returns: Stock returns have roughly student t-distributed data with about 5–7 degrees of freedom so given the above discussion, the median is a rather good metric here.

The Sample Median has a much higher degree of efficiency than the Sample Mean for Financial Data

If you can smell something fishy in your data, I recommend using methods with higher degrees of efficiency and higher breakdown points. Let’s look at robust regression methods.


M-Estimators in Robust Regression

OLS Regression applies a certain amount of weight to every datapoint:

Closed form derivation of OLS regression coefficient [source]

Say X~N(0,1), and Y is also ~N(0,1). Say X¹=1, its contribution to beta would be (X¹*Y¹)/(X¹*X¹) = (1 * Y¹/1*1) = Y¹. As Y¹ is also uniform normal, we would expect the Beta to be around +/- 1 (both sets have the same variance, so regression is equivalent to correlation).

However, say now Y¹ was accidentally stored as 10,000 (you can blame the intern), the contribution to the estimator of this point beta would go up from 1 to 10,000! That’s crazy and clearly not desired!

Regressions are thus very sensitive to anomalous data-points (at worst, the problem can be exponential) and given the above discussion, we would prefer to use an estimator with a higher breakdown point and a higher degree of efficiency. This is to ensure that our estimator doesn’t get thrown around by rogue data-points so if the potential lack of normality in the data is worrying, then the researcher should use robust estimation methods:

M-estimators are variants of Maximum Likelihood Estimation (MLE) methods. MLE methods attempt to maximise the joint-probability distribution whereas M-estimators try to minimise a function ⍴ as follows:

Solving the problem of M-Estimators [source]

The astute reader will quickly see that Linear Regression is actually a type of M-Estimator (minimise the sum of squared residuals) but it’s not fully robust. Below we have 4 other types of M estimators and more can be found here:

Different choices of functions for your M-Estimator [source]

As an example, Least Absolute Deviation (LAD) estimates the coefficients that minimises the sum of the absolute residuals as opposed to sum of squared errors. This means that LAD has the advantage of being resistant to outliers and to departures from the normality assumption despite being computationally more expensive.

As a practitioner, I would encourage researchers to try multiple method because there’s no hard and fast rule. It’s much more convincing to demonstrate to use several estimators giving similar results, rather than a sporadic and unexplainable set of results.

As a final point, we have to remember though that M-estimators are only normal asymptotically so even when samples are large, approximation can be still be very poor. It all depends on type and size of the anomaly!


In the above article, we broadly discuss the field of Robust Statistics and how a practitioner should approach with caution. Normal data may exist but at the limit, kurtosis plagues reality. Experiments on fatter tails (Student T-distributed) data highlights that the sample median is much more efficient than the sample mean but I generally like to put both side by side to see any noticeable differences. Further, robust regression methods offer a higher breaking point to give more realistic estimations but are pretty slow to compute.

Robust Statistics are a bit of an art because sometimes you need them and sometimes you don’t. Ultimately every data point is important so leaving some out (or down weighting certain ones) is rarely desirable. Given that limitation, I always encourage researchers to use multiple statistics in the same experiment so that you can compare results and get a better feel for relationships because after all, one ‘good’ result may just be lucky.


Thanks for reading! If you have any questions please message — always happy to help!


References

  1. Huber, Peter J. (1981), Robust statistics
  2. Little, T. The Oxford Handbook of Quantitative Methods in Psychology. Retrieved October 14, 2019
  3. Liu, X., & Nielsen, P.S. (2016). Regression-based Online Anomaly Detection for Smart Grid Data. ArXiv, abs/1606.05781.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Powered by WordPress.com.

Up ↑

%d bloggers like this: