0

I am trying to do a multilabel text classification, based on the tutorial available here: https://mlr-org.github.io/Multilabel-Classification-with-mlr/

I am getting this error: Error in checkLearnerBeforeTrain(task, learner, weights) : Task 'cottonseed.Class' is a one-class-problem, but learner 'classif.rpart' does not support that!

where cottonseed.Class is one of my class labels. I have in total 117 class labels so I am not sure why I am getting this error of a "one-class-problem"

My features/words (columns) and documents (rows) are derived from the Document-Term Matrix. The class labels are columns at the end of my data.frame with TRUE/FALSE values for each row (document).

Here is the code:

library(tm)
library(proxy)
library(RTextTools)
library(fpc)
library(wordcloud)
library(cluster)
library(stringi)
library(dplyr)
library(magrittr) 

#install.packages("tm.corpus.Reuters21578", repos = "http://datacube.wu.ac.at")
library(tm.corpus.Reuters21578)

data(Reuters21578)
reuters = Reuters21578

# remove all documents that do not have topic category for classification (remaining 11367)
reuters = tm_filter(reuters, FUN = function(x) !identical(meta(x)[["topics_cat"]] , character(0)))

# some documents appear to be empty -> remove all empty docs (remaining 11305)
reuters = tm_filter(reuters, FUN = function(x) !identical(meta(x)[["heading"]] , character(0)))

# get the trainset and the testset
reuters_lewissplit = tm_filter(reuters, FUN = function(x) meta(x)[["lewissplit"]] == "TRAIN" || meta(x)[["lewissplit"]] == "TEST")

# extract all topics/categories from the train and test sets
allTopics_lewissplit <- sapply(reuters_lewissplit, function(x){x$meta$topics_cat})
classes = unique(unlist(sapply(reuters_lewissplit, function(x){x$meta$topics_cat}), recursive = FALSE, use.names = FALSE))
classes[order(classes)]

# remove dashes because package mlr complains

library(stringr)
classes <- str_replace(classes, "-", ".")

# data frame with logical representation of classes
classesDF = data.frame(matrix(FALSE, ncol = length(classes)+1, nrow = length(allTopics_lewissplit)))
# I am adding the .Class to each class name because mlr complains if the class name is the same as a feature name
classes = paste0(classes, ".Class") 
colnames(classesDF) <- c(classes, c("TRAIN"))


for (i in 1:length(allTopics_lewissplit)) {
  topics =   unique(allTopics_lewissplit[[i]])
  topics <- str_replace(topics, "-", ".")
  topics = paste0(topics, ".Class")
  classesDF[i,topics] = TRUE
  if (meta(reuters_lewissplit[[i]])[["lewissplit"]] == "TRAIN") {
    classesDF[i,"TRAIN"] = TRUE
  }
}

# remove numbers
reuters_lewissplit <- tm_map(reuters_lewissplit, removeNumbers)

# eliminate extra white spaces
reuters_lewissplit <- tm_map(reuters_lewissplit, stripWhitespace)

# convert to lower case
reuters_lewissplit <- tm_map(reuters_lewissplit, content_transformer(tolower))

# remove stop words
reuters_lewissplit <- tm_map(reuters_lewissplit, removeWords, stopwords("english"))

# length(stopwords("english"))
# stopwords("english")

# remove punctuation
reuters_lewissplit <- tm_map(reuters_lewissplit, removePunctuation)


# create Document Term Matrix (DTM)
ndocs <- length(reuters_lewissplit)
# ignore extremely rare words i.e. terms that appear in less then 1% of the documents
minTermFreq <- ndocs * 0.01
# ignore overly common words i.e. terms that appear in more than 50% of the documents
maxTermFreq <- ndocs * .5


dtm = DocumentTermMatrix(reuters_lewissplit,
                         control = list(
                           wordLengths=c(4, 15),
                           bounds = list(global = c(minTermFreq, maxTermFreq)), 
                           weighting = weightTfIdf
                         ))


dtm.matrix = as.matrix(dtm)

####################################################################################################################
# Multilabel classification 
####################################################################################################################

library(mlr)

# join the dtm with the class labels
tmp = cbind (data.frame(dtm.matrix), classesDF[, 1: 117]) 

target = classes
target

reuters.task = makeMultilabelTask(data = tmp, target = target)

# We set a seed, because the classifier chain wrapper uses a random chain order. Next, we train a learner. 
# I chose the classifier chain approach together with a decision tree for the binary classification problems.

binary.learner = makeLearner("classif.rpart")
lrncc = makeMultilabelClassifierChainsWrapper(binary.learner)


# Now let’s train and predict on our dataset:

n = getTaskSize(reuters.task)
train.set = seq(1, 7733, by = 1)
test.set = seq(7734, 10741, by = 1)

set.seed(1729)
reuters.mod.cc = train(lrncc, reuters.task, subset = train.set)
reuters.pred.cc = predict(reuters.mod.cc, task = reuters.task, subset = test.set)

# common multilabel performance measures
listMeasures("multilabel")

##  [1] "multilabel.f1"       "multilabel.subset01" "multilabel.tpr"
##  [4] "multilabel.ppv"      "multilabel.acc"      "timeboth"
##  [7] "timepredict"         "multilabel.hamloss"  "featperc"
## [10] "timetrain"

# classifier chains method performance

performance(reuters.pred.cc, measures = list(multilabel.hamloss, multilabel.subset01, multilabel.f1, multilabel.acc))

It fails at the line: reuters.mod.cc = train(lrncc, reuters.task, subset = train.set)

Any insights would be greatly appreciated!

Thank you, Laura

laura7s
  • 3
  • 3
  • It sounds like when you partitioned the data you ended up with a partition that has only one label in it. You may want to try stratifying the partitions. Are your classes very unbalanced? – Lars Kotthoff Jul 24 '18 at 12:17
  • Hi Lars, do you mean that only one document is TRUE for this class in the partition or that all documents that are TRUE for this class are not TRUE for any other class? I thought that the second case might be a problem and I added a dummy class to test where all documents were TRUE but this gave me the same error so I guess you are referring to the first scenario? – laura7s Jul 24 '18 at 14:58
  • The error would have been caused by the second. Not sure what's going on then, could you post a complete reproducible example please? – Lars Kotthoff Jul 24 '18 at 20:08
  • Dear Lars, thank you, I have added the code to the original comment. – laura7s Jul 24 '18 at 21:37
  • I can't reproduce the error; the package "tm.corpus.Reuters21578" is not available. – Lars Kotthoff Jul 25 '18 at 08:39
  • @Lars, You can install from here: install.packages("tm.corpus.Reuters21578", repos = "http://datacube.wu.ac.at") library(tm.corpus.Reuters21578) data(Reuters21578) – laura7s Jul 25 '18 at 12:46
  • Ok, thanks. It looks like the issue is that one of the classes has no positive examples in one of the splits (probably one of the small classes). The easiest way to fix this is to use a classifier that supports oneclass. – Lars Kotthoff Jul 25 '18 at 14:58
  • Ok, I understand now the problem. THank you! – laura7s Jul 29 '18 at 16:06

0 Answers0