2

I have a data frame, which contains the "date variable". (the test data and code is available here)

However, I use "function = caretFunc". It shows error message.

    Error in { : task 1 failed - "missing value where TRUE/FALSE needed"
In addition: Warning messages:
1: In FUN(newX[, i], ...) : NAs introduced by coercion
2: In FUN(newX[, i], ...) : NAs introduced by coercion
3: In FUN(newX[, i], ...) : NAs introduced by coercion
4: In FUN(newX[, i], ...) : NAs introduced by coercion
5: In FUN(newX[, i], ...) : NAs introduced by coercion
6: In FUN(newX[, i], ...) : NAs introduced by coercion
7: In FUN(newX[, i], ...) : NAs introduced by coercion
8: In FUN(newX[, i], ...) : NAs introduced by coercion
9: In FUN(newX[, i], ...) : NAs introduced by coercion
10: In FUN(newX[, i], ...) : NAs introduced by coercion

What can I do?


Code to reproduce the error:

library(mlbench)
library(caret)
library(maps)
library(rgdal)
library(raster)
library(sp)
library(spdep)
library(GWmodel)
library(e1071)
library(plyr)
library(kernlab)
library(zoo)

mydata <- read.csv("Realestatedata_all_delete_date.csv", header=TRUE)
mydata$estate_TransDate <- as.Date(paste(mydata$estate_TransDate,1,sep="-"),format="%Y-%m-%d")
mydata$estate_HouseDate <- as.Date(mydata$estate_HouseDate,format="%Y-%m-%d")

rfectrl <- rfeControl(functions=caretFuncs, method="cv",number=10,verbose=TRUE,returnResamp = "final")
results <- rfe(mydata[,1:48],mydata[,49],sizes = c(1:48),rfeControl=rfectrl,method = "svmRadial")

print(results)
predictors(results)
plot(results, type=c("g", "o"))
Andre Holzner
  • 18,333
  • 6
  • 54
  • 63
  • when I exclude columns 5 and 14 (`estate_TransDate` and `estate_HouseDate`), rfe takes much longer instead of returning relatively fast with an error message. If you type `warnings()` after running your code, you'll see lot's of `In FUN(newX[, i], ...) : NAs introduced by coercion`. I guess it's trying to convert the date objects to a numerical value which then produces NAs. It's probably better to 'normalize' these date fields e.g. by replacing them with the number of days or years since a reference date, e.g. 1973-01-01 for `estate_TransDate` and 1900-01-01 for `estate_HouseDate` or for both – Andre Holzner Nov 13 '15 at 14:41
  • Please send the results of `sessionInfo()` – topepo Nov 13 '15 at 16:07
  • > sessionInfo() R version 3.2.0 (2015-04-16) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 8 x64 (build 9200) locale: [1] LC_COLLATE=Chinese (Traditional)_Taiwan.950 LC_CTYPE=Chinese (Traditional)_Taiwan.950 [3] LC_MONETARY=Chinese (Traditional)_Taiwan.950 LC_NUMERIC=C [5] LC_TIME=Chinese (Traditional)_Taiwan.950 attached base packages: [1] stats graphics grDevices utils datasets methods base – Chia-Hsien Lee Nov 17 '15 at 08:45

1 Answers1

0

You have NAs (missing values) in mydata in the following input variables (which you feed to the classifier):

colnames(mydata)[unique(which(is.na(mydata[,1:48]), arr.ind = TRUE)[,2])]

gives:

 [1] "Aport_Distance"       "Univ_Distance"        "ParkR_Distance"
 [4] "TRA_StationDistance"  "THSR_StationDistance" "Schools_Distance"
 [7] "Lib_Distance"         "Sport_Distance"       "ParkS_Distance"
[10] "Hyper_Distance"       "Shop_Distance"        "Post_Distance"
[13] "Hosp_Distance"        "Gas_Distance"         "Incin_Distance"
[16] "Mort_Distance" 

In addition, it looks like your date variables (transaction date and house date) seem to be converted to NAs inside rfe(..) .

The SVM regressor seems not to be able to deal with NAs as is.

I would convert the dates to something like 'years since a given reference':

mydata$estate_TransAge <- as.numeric(as.Date("2015-11-01") - mydata$estate_TransDate) / 365.25
mydata$estate_HouseAge <- as.numeric(as.Date("2015-11-01") - mydata$estate_HouseDate) / 365.25

# define the set of input variables
inputVars = setdiff(colnames(mydata),

                    # exclude these
                    c("estate_TransDate", "estate_HouseDate", "estate_TotalPrice")
                   )

And also remove those entries with any NA in any of the columns you use as input to the regressor:

traindata <- mydata[complete.cases(mydata[,inputVars]),]

then run rfe with:

rfectrl <- rfeControl(functions=caretFuncs, method="cv",number=10,verbose=TRUE,returnResamp = "final")
results <- rfe(
               traindata[,inputVars], 
               traindata[,"estate_TotalPrice"],
               rfeControl=rfectrl,
               method = "svmRadial"
              )

In my case, this would have taken a long time to complete, so I tested it only on one percent of the data using:

traindata <- sample_frac(traindata, 0.01)

The question remains what to do if your are given data to predict the price where some of input variables as NA.

Andre Holzner
  • 18,333
  • 6
  • 54
  • 63