how can I use TermDocumentMatrix for persian text in R?

Question

I want to view term frequencies in documents, my documents contain Persian text. I used R as follows:

keycorpus <- Corpus(DirSource("E:\\Sample\\farsi texts"))
tm.matrix <- TermDocumentMatrix(keycorpus)
View(as.matrix(tm.matrix))

Although this code is OK for english texts, unfortunately it does not work on Persian texts. How can I do this?

PLease add the error and if you don't mind a portion of the farsi text. — amonk, Jun 14 '17 at 09:22
The encoding is UTF-8 . There is no error, but output of termdocumentmatrix in this case is just contains numbers and punctuation and the Persian terms are neglected. — M.Rabiei, Jun 18 '17 at 06:25

saeed_ans · Answer 1 · 2018-01-14T08:01:57.493

suppose that you have a text file named 1.txt then:

 Sys.setlocale(locale = "Persian",category = "LC_ALL")
 setwd("E:\\Sample\\farsi_texts")
 text<-readLines("1.txt",encoding = "windows-1256")
 keycorpus <- Corpus(VectorSource(text))
 tm.matrix <- TermDocumentMatrix(keycorpus)
 View(as.matrix(tm.matrix))

it shows each word repetition in each line you can use this code to aggregate:

tm.iteration<-as.data.frame(apply(tm.matrix,1 ,sum)) View(as.matrix(tm.iteration))

how can I use TermDocumentMatrix for persian text in R?

1 Answers1