0

I found this code in a tutorial about multilabel classification with package mlr.

library("mlr")

yeast = getTaskData(yeast.task)
labels = colnames(yeast)[1:14]
yeast.task = makeMultilabelTask(id = "multi", data = yeast, target = labels)

lrn.br = makeLearner("classif.rpart", predict.type = "prob")
lrn.br = makeMultilabelBinaryRelevanceWrapper(lrn.br)

mod = train(lrn.br, yeast.task, subset = 1:1500, weights = rep(1/1500, 1500))

pred = predict(mod, task = yeast.task, subset = 1:10)
pred = predict(mod, newdata = yeast[1501:1600,])

I understand the structure of the dataset yeast, but I do not understand how to use the code when I have new data which I want to classify because then I wouldn´t have any TRUE or FALSE values for the labels. Actually I would have some training data with the same structure as yeast but for my new data the columns 1:14 would be missing. Am I missunderstanding something? If not: How can I use the code correctly?

Edit:

Here´s a sample code how I would use the code:

library("tm")

train.data = data.frame("id" = c(1,1,2,3,4,4), "text" = c("Monday is nice weather.", "Monday is nice weather.", "Dogs are cute.", "It is very rainy.", "My teacher is angry.", "My teacher is angry."), "label" = c("label1", "label2", "label3", "label1", "label4", "label5"))
test.data = data.frame("id" = c(5,6), "text" = c("Next Monday I will meet my teacher.", "Dogs do not like rain."))

train.data$text = as.character(train.data$text)
train.data$id = as.character(train.data$id)
train.data$label = as.character(train.data$label)
test.data$text = as.character(test.data$text)
test.data$id = as.character(test.data$id)

### Bring training data into structure
train.data$label = make.names(train.data$label)
labels = unique(train.data$label)

# DocumentTermMatrix for all texts
texts = unique(c(train.data$text, test.data$text))
docs <- Corpus(VectorSource(unique(texts)))
terms <-DocumentTermMatrix(docs)
m <- as.data.frame(as.matrix(terms))

# Logical columns for labels
test = data.frame("id" = train.data$id, "topic"=train.data$label)
test2 = as.data.frame(unclass(table(test)))
test2[,c(1:ncol(test2))] = as.logical(unlist(test2[,c(1:ncol(test2))]))
rownames(test2) = unique(test$id)

# Bind columns from dtm
termsDf = cbind(test2, m[1:nrow(test2),])
names(termsDf) = make.names(names(termsDf))

### Create Multilabel Task
classify.task = makeMultilabelTask(id = "multi", data = termsDf, target = labels)

### Now the model
lrn.br = makeLearner("classif.rpart", predict.type = "prob")
lrn.br = makeMultilabelBinaryRelevanceWrapper(lrn.br)
mod = train(lrn.br, classify.task)

### How can I predict for test.data?

So, the problem is that I do not have any labels for test.data because that is what I would actually like to compute?

Edit2:

When I simply use

names(m) = make.names(names(m))
pred = predict(mod, newdata = m[(nrow(termsDf)+1):(nrow(termsDf)+nrow(test.data)),])

the result is for both texts the same and really not that I would expect.

WinterMensch
  • 643
  • 1
  • 7
  • 17
  • Could you provide a sample of the data set you would like to train on since I am not sure I follow your problem. And it will definitely help in troubleshooting since the problem is not related to the built in data set. – missuse Feb 12 '18 at 10:32
  • Of course, I edited my post. – WinterMensch Feb 12 '18 at 11:07
  • Hi, do you know what `makeMultilabelBinaryRelevanceWrapper` actually does? If not, maybe this paper helps: https://journal.r-project.org/archive/2017/RJ-2017-012/index.html – Giuseppe Feb 12 '18 at 12:04
  • Not really, I just took the code from the tutorial. But still, the idea of multilabel text classification is to match a given text to one or more predefined classes/labels (which are described through the model made of train.data)? – WinterMensch Feb 12 '18 at 12:53
  • @WinterMensch One problem I can observe here is that in the example you are training on 4 rows only, which is very low and you are not performing any tuning. The resulting model performs poorly by predicting the probability of the classes to be the same as the frequency of the classes. What happens when you provide 100s of rows for training? – missuse Feb 12 '18 at 13:36
  • In your case `makeMultilabelBinaryRelevanceWrapper` just fits a classification tree for each label separately. So you can simply check if you get the same results if you train a classification tree on each label separately. – Giuseppe Feb 13 '18 at 15:05

0 Answers0