0

I am looking at data from this kaggle competition. I focus on these 2 columns:

  • LotArea
  • LotFrontage

Here LotFrontage has missing values whereas LotArea has not. Both variables are very correlated. So I thought I fit a linear regression model and impute missing values of LotFrontage using the fitted model. Here is my attempt (I am an R newby):

ggplot(OriginalData, aes(x = LotArea, y = LotFrontage)) + geom_point()

fit <- lm(LotFrontage ~ LotArea, OriginalData)
tidy(fit)

Slope <- coef(fit)[term = 'LotArea']
Intercept <- coef(fit)[term = '(Intercept)']

OriginalData$LotFrontage[is.na(OriginalData$LotFrontage)] <- Intercept + (Slope * OriginalData$LotArea)

sum(is.na(OriginalData$LotFrontage))
ggplot(OriginalData, aes(x = LotArea, y = LotFrontage)) + geom_point()

I think there is something not quite right. Just wondering, how could I draw a simple line in the scatter plot using the fitted slope and intercept please? Thanks!

cs0815
  • 16,751
  • 45
  • 136
  • 299

1 Answers1

1

First, you made a mistake at the procedure of imputing missing values.

Data$Y[is.na(Data$Y)] <- Intercept + (Slope * Data$X)

The values in front of and behind the <- symbol have different lengths. It results in a warning.

You should revise it as :

Data$Y[is.na(Data$Y)] <- (Intercept + (Slope * Data$X))[is.na(Data$Y)]

And if you wanna add a simple regression line, you can use :

  • (1) geom_abline( )

+ geom_abline(slope = Slope, intercept = Intercept)

But it's under the situation that you have slope & intercept.

And geom_abline() can only make a straight line.(Simple linear regression)

  • (2) geom_smooth( )

+ geom_smooth(method = "lm")

It use smoothing methods to fit data, eg. lm, glm, gam, loess, MASS::rlm. You can search the help page to get detailed informations.

Darren Tsai
  • 32,117
  • 5
  • 21
  • 51
  • thanks found the geom_abline as well. does the calculation of the 'predicted/imputed' value really make a difference? it would just be applied to the missing anyway. I appreciate that your calculation is more efficient ... – cs0815 Oct 07 '18 at 16:02
  • 1
    There are many imputing algorithms dealing with missing values and no one is perfect. Imputing by regression is a way and it also has shortcomings. In R, there are some packages specializing in missing values such as `mice`. – Darren Tsai Oct 07 '18 at 18:08