I am looking at data from this kaggle competition. I focus on these 2 columns:
- LotArea
- LotFrontage
Here LotFrontage has missing values whereas LotArea has not. Both variables are very correlated. So I thought I fit a linear regression model and impute missing values of LotFrontage using the fitted model. Here is my attempt (I am an R newby):
ggplot(OriginalData, aes(x = LotArea, y = LotFrontage)) + geom_point()
fit <- lm(LotFrontage ~ LotArea, OriginalData)
tidy(fit)
Slope <- coef(fit)[term = 'LotArea']
Intercept <- coef(fit)[term = '(Intercept)']
OriginalData$LotFrontage[is.na(OriginalData$LotFrontage)] <- Intercept + (Slope * OriginalData$LotArea)
sum(is.na(OriginalData$LotFrontage))
ggplot(OriginalData, aes(x = LotArea, y = LotFrontage)) + geom_point()
I think there is something not quite right. Just wondering, how could I draw a simple line in the scatter plot using the fitted slope and intercept please? Thanks!