0

I was going through the documentation to understand the Coefficient of Determination and from the document i got an understanding that Coefficient of Determination is nothing but R x R (correlation coefficient)

so i took the housing price dataset from kaggle.com and started to try on it for better understanding, this is my code

took the correlation coefficient

test_data=pd.read_csv(r'\house_price\test.csv')
_d=test_data.loc[:,['MSSubClass','LotFrontage']]
_d.fillna(0,inplace=True)
_d.corr()

enter image description here

now, taking the Coefficient of Determination like this

from sklearn.metrics import r2_score
r2_score(_d['MSSubClass'],_d['LotFrontage'])

for which, i got the value -0.9413195412943647

ideally shouldnt it be 0.060531252961 ? as -0.246031 x -0.246031 = 0.060531252961

Lijin Durairaj
  • 4,910
  • 15
  • 52
  • 85

2 Answers2

1

What you are referencing as the "documentation" is just a blogpost describing one of the many variations of R2. I recommend reading the official scikit-learn documentation to understand their implementation in the r2_score.

In short, a value of 0 means that the model does not perform any different from a model that simply predicts the expected value (i.e. the mean) of the target variable. A value of 1 on the other hand means that the model is perfect with no errors in its predictions. However, and this is the main difference to what your provided blogpost states, you will see that it allows for negative values as a model can perform arbitrarily worse than simply predicting the expected value of the target variable.

And this is what the r2_score of scikit-learn is telling you in your case: the model you fit is worse, i.e. produces a higher error on average, than just predicting the mean of the house prices.

afsharov
  • 4,774
  • 2
  • 10
  • 27
1

following the docs: https://scikit-learn.org/stable/modules/model_evaluation.html#r2-score

the r2_score is defined as: enter image description here

Whereer the df.corrmethod is (with pearson correlation): enter image description here

so let's built an example:

x   y
1   1
1   0
0   0
1   1

correlation: 4*(1+0+0+1) - 3*2 / sqrt(4*(3-9)*4*(2-4)) = 8-6/ sqr(-24*4*-8) = 2/sqr(-24*4*-8) wherever R2 is: 1-((0)^2+(1)^2+(0)^2+(0)^2) / (1-0.75)^2+(1-0.75)^2+(0 - 0.75)^2 +(1-0.75)^2

Hope that helps

PV8
  • 5,799
  • 7
  • 43
  • 87
  • so the definition of corelation determination according to the document is wrong, is that what you mean by this https://blog.uwgb.edu/bansalg/statistics-data-analytics/linear-regression/what-is-the-difference-between-coefficient-of-determination-and-coefficient-of-correlation/ – Lijin Durairaj May 29 '20 at 14:23
  • I think that are two different topics in your link, one for R2 and the ohter one for coefficient of correlation – PV8 Jun 03 '20 at 08:55