0

I am trying to run a ridge, lasso regression as well as randomForest model on the total replacement cost from a csv file.

This is what I did as follows:

data$TOTAL_REPLACEMENT_VALUE=log(data$TOTAL_REPLACEMENT_VALUE) 
n_total=nrow(data) 
n_train=round(n_total*0.7)
training_data=data[1:n_train,]
test_data=data[n_train+1:n_total,]
X_train_cost_model=model.matrix(TOTAL_REPLACEMENT_VALUE~TYPE,data=training_data) 
X_test_cost_model=model.matrix(TOTAL_REPLACEMENT_VALUE~TYPE,data=test_data) 
Y_train_cost=training_data[,"TOTAL_REPLACEMENT_VALUE"] 
Y_test_cost=test_data[,"TOTAL_REPLACEMENT_VALUE"]

I proceed on to run a ridge and lasso regression via this:

install.packages("glmnet",dependencies = TRUE)
library(glmnet) 
ridge_replacement_cost_model=cv.glmnet(X_train_cost_model,Y_train_cost,alpha=0,type.measure = "mse")
ridge_pred_replacement_cost=predict(ridge_replacement_cost_model,newx = X_test_cost_model,exact=TRUE,s="lambda.min")  
lasso_replacement_cost_model=cv.glmnet(X_train_cost_model,Y_train_cost,alpha=1,type.measure = "mse")
lasso_pred_replacement_cost=predict(lasso_replacement_cost_model,newx = X_test_cost_model,exact=TRUE,s="lambda.min") 

install.packages("randomForest")
library(randomForest)
rf_total_replacement_cost_model=randomForest(TOTAL_REPLACEMENT_VALUE~TYPE,                                                data=training_data,importance=TRUE)                                              
rf_pred_replacement_cost=predict(rf_total_replacement_cost_model,test_data,type="class") 

However, I encountered these errors

Error in glmnet(x, y, weights = weights, offset = offset, lambda = lambda,  :    number of observations in y (590) not equal to the number of rows of x (589)

Error in na.fail.default(list(TOTAL_REPLACEMENT_VALUE = c(18.126980599175,  : 
  missing values in object

The first error occurred after running the ridge and lasso regression while the second error occurred after running the randomForest model. I understand there is a thread on similar issues but I do not understand where went wrong. Any help is really appreciated.

  • You have missing values in your data. This is causing problems. Remove the rows with missing values before running the model. – Gregor Thomas Jul 03 '18 at 16:14
  • Hi Gregor, I actually convert the missing values to its mean replacement cost. Hence, there is no empty blanks in my data set. – Justin Messi Jul 03 '18 at 16:26
  • 1
    For all the code you do show, you don't show any code that does anything with missing values - replacing with the mean or anything else.. And the error message is very clear: `"missing values in object"`. So it really does seem like there are missing values in your data. Btw, in R missing values are usually coded as `NA`, so you may be right that your data doesn't have any "empty blanks", but the error message is telling you that there are missing values. – Gregor Thomas Jul 03 '18 at 16:52
  • 1
    The first error message says that y has 590 observations and x has 589, making it seem like the problem is a missing value in `x`. – Gregor Thomas Jul 03 '18 at 16:54
  • Oh I have checked my csv file and I missed out on an empty cell that I am supposed to perform an arithmetic operation on. Thanks for your help Gregor! – Justin Messi Jul 04 '18 at 01:12

0 Answers0