I am rather new to R and Python both and tried to compare calculation results related to residuals of regression analysis. I want to know what went "wrong", or how to "interpret" it (i.e. whether or not it is something expected). Note that the same data is used for both R and Python.
R Code
# data
y <- c(3.099999905, 3.24000001, 3, 6, 5.300000191,
8.75, 11.25, 5, 3.599999905, 18.18000031)
x <- c(11, 12, 11, 8, 12, 16, 18, 12, 12, 17)
df <- data.frame(wage = y, educ = x)
# OLS
mod <- lm(wage ~ educ, data=df)
summary(mod)
# residuals
u.hat <- resid(mod)
mean(u.hat)
var(u.hat)
sd(u.hat)
cor(df$educ, u.hat)
Python Code
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
# data
y = pd.Series([3.099999905, 3.24000001, 3, 6, 5.300000191,
8.75, 11.25, 5, 3.599999905, 18.18000031])
x = pd.Series([11, 12, 11, 8, 12, 16, 18, 12, 12, 17])
df = pd.DataFrame({'wage': y, 'educ': x})
# OLS
mod = smf.ols(formula='wage ~ educ', data=df)
results = mod.fit()
results.summary()
# residuals
uHat = results.resid.values
np.mean(uHat)
np.var(uHat)
np.std(uHat)
np.corrcoef(df['educ'].values, uHat)[0, 1]
Results
- The regression results are the same for both (not reported here).
- Calculation results related to residuals are summarized in the table below.
Problems?
The values of mean and correlation coefficients are different, but they are sufficiently close to zero. Does it mean that we can ignore the differences as "expected"?
The values of variance and standard deviation are indeed different. How can I think about them? Perhaps, it is not an expected result. What am I missing?
Thanks for your help in advance.