0

I'm reading a Korean text file and trying to remove the most appearing terms(stopwords) and the least appearing terms from a Term Document Matrix which is generated in R. From the code below I'm able to get the TDM, but it has weights for all the terms in the document. Is there any way in which I can remove such terms so that I get the TDM for terms which would make more sense? Thanks

library(ktm)
old <- read_csv(file = "Past-Korean1.csv", locale = locale(date_names = "ko", 
encoding = "UTF-8")) 
q <- tokenizer(old$Description, token = "tag")
y_ko <- document_term_frequencies(q[, c("text_id", "word")])
tdm_ko <- document_term_matrix(y_ko)
tdm_ko <- as.DocumentTermMatrix(tdm_ko, weighting=weightTfIdf)
train1_ko <- as.matrix(tdm_ko)
Kailash Sharma
  • 51
  • 1
  • 2
  • 7
  • Where does the package ktm come from? It is not available on cran. But normally you there should be a stopwords function to remove common words. – phiver Jul 31 '18 at 12:56
  • I tried removing the stopwords but if I'm not wrong I'll have to have a corpus first to do that. Also not sure if I use the below line, to remove the stopwords, will it work for the Korean stopwords where I'll be replacing "en" with "ko". corpus <- tm_map(corpus, removeWords, stopwords("en")) – Kailash Sharma Aug 01 '18 at 04:58
  • @phiver I tried removing the stopwords but if I'm not wrong I'll have to have a corpus first to do that. Also not sure if I use the below line, to remove the stopwords, will it work for the Korean stopwords where I'll be replacing "en" with "ko". corpus <- tm_map(corpus, removeWords, stopwords("en")) – Kailash Sharma Aug 01 '18 at 10:31
  • The line will not work as there are no stopwords for ko in the snowball. You need to get the stopwords from the stopword package like this: `korean_stops <- stopwords::stopwords("ko", source = "stopwords-iso")`. Then you can remove the words. And yes you need a corpus first. – phiver Aug 01 '18 at 10:42
  • Okay, thank you for the help @phiver. – Kailash Sharma Aug 02 '18 at 05:43

0 Answers0