1

i'm a bit newbie in R data mining algorithms and I need to develop a script that help me to predict an event. So, i've chosen a decision tree model to help with this task.

My dataset has this structure:

_____________________________
ATTR1 | ATTR2 | ATTR3 | CLASS
  Y   |   N  |   N    |    N
______|______|_______ |_______

and this are the scripts that i've created:

library(party)
myFormula <- CLASS ~ ATTR1 + ATTR2 + ATTR3

ind <- sample(2, nrow(myData), replace=TRUE, prob = c(0.7,0.3))
trainData <- myData[ind==1,]
testData <- myData[ind==2,]

energy_ctree <- ctree(myFormula, data=trainData)
testpred <- predict(energy_ctree, newdata= testData)

all this commands work just fine. So, my doubt is about to predict new lines of data!

i've called the function predict(energy_ctree ,newdata=newdataSet) with new dataset excluding the CLASS columns (that I want to find through decision tree model prediction).

This is the error message i get:

"Error in checkData(oldData, RET) : 
  Levels in factors of new data do not match original data"

So, what are the steps to predict de Class column of my newDataSet based on the decisionTree model that i've created before.

Thanks in advance.

Carlos Lima

cmnlima
  • 53
  • 2
  • 9
  • It means one of your variables is a factor, and when you split your data in two, one of the levels of that factor did not appear at all in one of the two sets (by chance). You'll need to split your data more carefully, to ensure that all levels appear at least once in both sets. – joran Dec 04 '13 at 20:06

4 Answers4

0

If you have categorical data, and some column values are present in your testing set ( the new data ) but not in the training set, R will complain. For example, if the attribute Attr1 in your training data contains only the levels "No" and "Yes" as shown below, using decision trees in R will not be possible on a new data set where the column Attr1 contains "Maybe" for example.

    Attr1 ......... ( training set)
     "No"
     "No"
     "No"
     "Yes"
     "Yes"


    Attr1: .......(testing set)
    "Yes"
    "No"
     .
     .
    "maybe"   // R will complain about this value ( it never found it during the training)

One possible solution is to specify the levels in advance. For the previous example, you can specify the levels of Attr1 before doing your training as follows:

    levels(Attr1, c("No","Yes","Maybe")). 

By doing so, your training set does not have to contain the value "Maybe" for the attribute Attr1.

John
  • 627
  • 10
  • 18
  • Thanks for the response. In my case, training set and testing have the same levels and "my new data" (to be calculated the class value) also have the same structure. I realy don't know how to predict new cases of data and how I need to transform the dataset – cmnlima Dec 05 '13 at 10:08
0

Same problem even i encoutered. What i did is, wrote the final preprocessed file into a csv, and read it again to a dataframe,then applied those test data on the model.It perfectly.

Reason behind : Because there were few categorical value in test dataframe, which even after removal, was there in the list with 0 rows(which doesn't occur in training dataset).

Shalini Baranwal
  • 2,780
  • 4
  • 24
  • 34
0

I just had this problem and here's how I solved it:

1 - Verify the factor variables, they have to have the same levels. 2 - Verify variables that are numeric in one table and integer in the second one.

By clearing this issues my script begun to run smoothly.

Diego Rodrigues
  • 824
  • 4
  • 10
0

If you've made changes to the class of variables in the training data(like converting char to factors), you need to reflect the same in the test dataset. I made these changes and it runs smoothly thereafter.

Praveen Kumar
  • 321
  • 3
  • 5