Just run into this problem. I was using a data frame with several thousands of columns created out of words and word splits. One of my columns resulted with the name "in" another in "if". When one tries to do something like data$in, there is an error message complaining about that. See example:
require(tm)
text<-data.frame(colText<- c("namein", "Inmortal"))
corpus <- Corpus(DataframeSource(text))
corpus[[1]]
<<PlainTextDocument (metadata: 7)>>
namein
ctrl <- list(tokenize = strsplit_character_tokenizer,wordLengths=c(1, Inf))
dtm <- DocumentTermMatrix(corpus, control = ctrl)
str(dtm)
dtm$dimnames$Terms
[1] "a" "al" "e" "ein" "i" "in" "inm" "inmo" "l" "m" "me" "mo" "n" "na" "nam" "name" "o" "ort"
[19] "r" "rt" "rtal" "t"
dtmF <- as.data.frame(inspect(dtm))
dtm$inm
[1] 0 1
dtmF$in
Error: unexpected 'in' in "dtmF$in"
strsplit_character_tokenizer <- function(x){
r<-list()
max=4
for (i in 1:max) {
reg<-paste("([[:alnum:]]{",i,"})", sep="")
tmp=unlist(strsplit(gsub(reg, "\\1 ", x), " "))
r<-c(r,tmp)
}
return (unlist(r))
}
As a result when I train a svm for classification it crashes, How can one overcome this issue? i could rename some of those column names, but I would like a more generic solution Thanks