What Could Go Wrong: Linear Regression

Common mistakes people make when doing regression analysis

Khang Pham
8 min readFeb 7, 2022
Photo by Andrea De Santis on Unsplash

Introduction

A peaceful greeting to you. “What Could Go Wrong” is my new series, where we investigate the common mistakes when applying Machine Learning algorithms to real-world problems. This is the first post of the series, and I hope you will enjoy it!

Linear Regression is the good ol’ friend of statisticians and machine learning engineers. It helps us evaluate the relationship between a variable and an outcome when we hold other factors constant. The variables we use to explain the outcome are called predictors, explanatory variables, or independent variables. The outcome is called the dependent variable (I will definitely write a blog about fancy names). Heavily inspired by Chapter 12 of the wonderful book “Naked Statistics” by Chris Wheelan (see reference), in this post, I will show you some common mistakes one can make when doing regression analysis.

Nonlinearity

Obviously, the first criterion for one to apply a regression analysis is that there exists a linear relationship between the explanatory variables and the outcome variable. Doing linear regression on people’s height and weight is fine, on a family’s income and the performance of their children is okay; but on your car’s age and its value? Hmm, I doubt that [1].

Photo by Dan Gold on Unsplash

You can realize that there is a relationship between your car’s age and its value, but you cannot simply fit a straight line to it and expect it to generalize well. The first few years after you bought your car, its value will constantly go down. But as time flies, your car now becomes some sort of antique and will restore its long-lost honor. You can then make a fortune by selling it to an affluent collector at the auction in the black market (if you are still alive, of course).

The point here is that linear regression is meant to use when there is a linear correlation between variables. There are a lot more criteria and assumptions you have to respect for your model to be effective. However, those are a bit too technical for the scope of this blog post. I suggest you read a textbook to have a better understanding of the problem.

Omitted Correlated Variable (Multicollinear)

Another pitfall you may step into when doing regression analysis is when you include many explanatory variables that are highly correlated with each other.

Photo by Ridham Nagralawala on Unsplash

This may not lead to a tragic falsehood like many other mistakes in this list, but our model may not give us too many meaningful results. To be more concrete, when predictor variables are correlated, the estimated regression coefficient of any one variable depends on which other predictor variables are included in the model [2].

As you may have known, in multivariate linear regression, we want to examine the effect of an explanatory variable on the outcome when holding all other predictors constant. Thus, when two explanatory variables are highly correlated, holding one constant while investigating the other won’t make sense since changing one predictor necessarily would change the values of the others.

Correlation ≠ Causation

Guess what, the two above issues are not all that the troublesome correlation can cause. No correlation (between predictors and outcome) is a NO-NO. Too much correlation (between explanatory variables) is a NO-NO. A dataset with good linearity and no multicollinearity but is badly interpreted is also a NO-NO (and you still think linear regression is a piece of cake?). When we observe a strong association between two variables, we do not have enough proof to conclude that one variable causes the change of the other.

Photo by Marek Studzinski on Unsplash

For example, suppose we are searching for potential motivations for the increase of Skyrim’s sales over the last decade. We have an explanatory variable which is the market value of Elon Musk’s OpenAI. Then we would almost certainly find a positive and statistically significant association between them because they simply both rose over that period. In fact, I doubt that there is any relationship between Todd Howards selling us the same game for 10 years and the growth of one of the giants in the AI industry.

Variable Bias

Here comes one of the most severe mistakes we can make when doing regression analysis: omit possible variable biases. There are certain explanatory variables that we must control, a.k.a they must be included in our regression equation. Leaving them out will make the results inaccurate and misleading. This is also the trick that some publishers use to create clickbait titles and draw non-sense conclusions. You should be skeptical the next time you see an article saying, “Swearing Will Make Students’ Performance Go Down”. The article points out a strong and positive association between how likely they say cuss words and their test scores in the class. What could be wrong with the analysis that drew that conclusion? The answer is the age group of the students. If “age” is not one of the explanatory variables, then the analysis will miss the fact that students who swear more are, on average, older than other goodies (you can think of the two groups as a class of happy kids vs. a bunch of grumpy 15 years old). And it is also very likely that the test scores of younger students are higher than their senior since they receive (perhaps) easier questions and also more forgiving judgment.

As you can see, if we omit important explanatory variables, our regression result will be misleading at best, or it will infer an opposite conclusion at worst. So, keep that in mind the next time you consider which line to choose when checking out at the supermarket. You should probably prefer one with 3–4 kiddos buying snacks for their sleepover over one with Karen and her son and an overpacked cart.

Photo by Joshua Fernandez on Unsplash

Sampling Bias

Let us start with a question: do extra summer classes really help weak students who fail an exam? Suppose our dependent variable is the difference between a student’s test scores before and after taking the test. As usual, we will divide our sample into a control group (students who do not register for summer classes) and a treatment group (students who register for summer classes). After some fancy lines of code, we find out that students in the treatment group actually improve better than students in the other group. Could we confidently decide to force all weak students to take extra summer classes and inform the press that we have solved one of the most intriguing questions of the educational system?

If you think yes, then maybe you have missed one obvious fact: the students who decide to take the extra classes are different from students who don’t themselves. Maybe the willingness to improve is what makes students in the treatment group achieve a better result. In other words, we do not have control over other variables between the 2 groups when we do our analysis. We are seeking patterns that generalize to the whole population, however, our sample is not representative enough. This is called the sampling bias. In fact, it is often very hard and expensive to get a non-biased sample from the target population. A good statistician must realize what are the potential biases that his/her model may produce.

Photo by Taylor Wilcox on Unsplash

P/S: in the above example, our methodology to compare the treatment with its counterpart is also not so good. One way to have a better evaluation of our variables of interest is to use the “Differences in differences” [3] method to isolate the treatment effect.

Bonus: Reverse Causality

So we have mentioned that the risk of implying causation when observing correlation is that there is in fact no causation at all. Another extreme is that there exists a causality between the two variables but in an opposite direction. Suppose we are trying to find the effect of positive thinking on the annual income of a country’s residents. A strong and positive association between the two variables does not necessarily mean that thinking more positively will earn you more money. It could, but the opposite could also be true: people tend to think about brighter things when they have a decent amount in their bank account.

Identifying causality is sometimes the matter of “common sense” [4]. The rule of thumbs is, when doing regression analysis, we should not include predictors that might be affected by the outcome we are trying to explain. We should research to prove that our explanatory variables affect the dependant variable, not the other way around.

Photo by NeONBRAND on Unsplash

Conclusion

Regression analysis is a simple yet effective statistical tool. However, being simple means that there will be some potential risks living under the hood. Any mistake from bad sampling to bad interpretation might cost you your own life and bring an end to our dumb world. Okay maybe it is not that disastrous, but it will definitely cost you your time (and effort) as the result will be nothing but a mess. I hope that from now on, whenever you decide to fit a linear regression model to your problem, you are well aware of what could go wrong if you do not use it properly.

Thank you so much for reading this. I am looking forward to seeing you in the next post!

References

[1] Conor O’Sullivan, Finding and Visualising Non-Linear Relationships (2021), Towards Data Science.

[2] Dr. Iain Pardoe, 12.3 — Highly Correlated Predictors. In Penn State’s Department of Statistics, STAT 501: Regression Methods.

[3] Charles Wheelan, Naked Statistics: Stripping the Dread from the Data (2012).

[4] Stephanie, Reverse Causality: Definition, Examples (2016), Statistics How To.

Goodbye traveler, may your road lead you to warm sands.

--

--

Khang Pham

Language enthusiast 🇫🇷 🇬🇧 🇻🇳 | NLP Researcher | Contact me at: https://vkhangpham.github.io