Explain 'Multi-col-linearity' in data processing

Question

Can anyone please explain the Multi-col-linearity term from machine learning data processing, in simple words. As the term is very important with respect to data processing and has confusing explanations.

So, please explain it in simple words as I am new to ML using python.

This isn't a Python or exclusively a Machine learning topic, its a concept in statistics which is relevant to machine learning and/or Python. I would recommend checking out https://stats.stackexchange.com/questions/1149/is-there-an-intuitive-explanation-why-multicollinearity-is-a-problem-in-linear-r — L Robin, Apr 29 '20 at 17:17

score 1 · Answer 1 · answered Apr 29 '20 at 17:08

Multi-co-linearity means that two or more of your predictor variables have a strong linear relationship with one another. This can be problematic in training models as you are basically training on two versions of the same variable, which can skew results and hyper-parameters if not handled correctly. It is especially problematic with using regression based models.

An example could be number of reviews and number of downloads for a video game. We may be trying to predict the price and regressing on number of reviews and downloads would be somewhat redundant since generally, the more people that play the game, the more reviews there are.

score 1 · Answer 2 · answered Oct 11 '20 at 09:54

In Machine Learning, Output(Response or predicted) variable of a model will depend on input(Predictor or explanatory) variable by some degree of linearity(Positive or Negative).

But in some datasets or model with multiple input variables(X1, X2, X3,X4 and X5 for example), we will see the linear relationship exists among the input variables itself. It means, X1 is correlated with X2, and X1 is correlated with X3 as well. So, X1, X2 and X3 are correlated with each other in this case and we see multicollinearity exists in this model. Note, multicollinearity explains the correlation between one input variable with another input variable (not with output variable)

Let's take housing price prediction model for clear understanding. Consider the below input variables Square Feet Size, No of bedrooms, UDS (Undivided Share in Sq. Feet) in our datasets and output(predicted) variable Housing Price

It shows that all 3 input variables are correlated to each other. How?

If no of bedrooms increases , then size of the house increases as well. If the size of the house increases, UDS increases, there exists multicollinearity and we should resolve multicollinearity issue before the model is trained

Explain 'Multi-col-linearity' in data processing

2 Answers2