0

I have a problem moving from a tm object to a koRpus object. I have to normalize a corpus with tm tools, lemmatize the results with koRpus and return to tm to categorize the results. In order to do this I have to transform the tm object into a R dataframe, which I then transform into an excel file, then into a txt file, and finally into a koRpus object. This is the code:

#from VCORPUS to DATAFRAME 
dataframeD610P<-data.frame(text=unlist(sapply(Corpus.TotPOS, `[`, "content")), stringsAsFactors=F)

#from DATAFRAME to XLSX 
#library(xlsx)
write.xlsx(dataframeD610P$text, ".\\mycorpus.xlsx")

#open with excel 
#save in csv (UTF-8)

#import in KORPUS and lemmatization with KORPUS/TREETAGGER 

tagged.results <- treetag(".\\mycorpus.csv", treetagger="manual", lang="it", sentc.end = c(".", "!", "?", ";", ":"),
                          TT.options=list(path="C:/TreeTagger", preset="it-utf8", no.unknown=T)) 

Then I need to do it all backwards to get back to tm. This is the code:

#from KORPUS to TXT 
write.table(tagged.results@TT.res$lemma, ".\\mycorpusLEMMATIZED.txt")

#open with a text editor and formatting of the text

#from TXT to R
Lemma1.POS<- readLines(".\\mycorpusLEMMATIZEDfrasi.txt", encoding = "UTF-8")

#from R object to DATAFRAME
Lemma2.POS<-as.data.frame(Lemma1.POS, encoding = "UTF-8")

#from DATAFRAME to CORPUS
CorpusPOSlemmaFINAL = Corpus(VectorSource(Lemma2.POS$Lemma1.POS))

Is there a more elegant solution to do this without leaving R? I’d really appreciate any help or feedback.

BTW, does anyone know how to ask tm which document inside a VCorpus contains a specific token? I usually transform the corpus into a dataframe to identify the document. Is there a way to do this in tm?

Giorjet
  • 93
  • 2
  • 11

1 Answers1

0

Thanks to unDocUMeantIt it's possible find some answers here https://github.com/unDocUMeantIt/koRpus/issues/6

Giorjet
  • 93
  • 2
  • 11