0

I'm currently preprocessing korean corpus using KoNLP, in R.

library(stringr)
library(tm)
library(KoNLP)
library(dplyr)
library(rJava)
useNIADic()

myfunc_extract <- function(doc){
  doc <- as.character(doc)
  doc2 <- paste(SimplePos22(doc))
  doc3.nc <- str_match(doc2, '([가-힣]{2,}+)/[N][C]')
  doc4.nc <- doc3.nc[,2]
  doc3.pv <- str_match(doc2, '([가-힣]{2,}+)/[P][V]')
  doc3.pv <- doc3.pv[,1]
  doc4.pv <- gsub("/PV", "다", doc3.pv)
  doc3.pa <- str_match(doc2, '([가-힣]{2,}+)/[P][A]')
  doc3.pa <- doc3.pa[,1]
  doc4.pa <- gsub("/PA", "다", doc3.pa)
  doc5 <- rbind(doc4.nc, doc4.pv, doc4.pa)
  doc5[!is.na(doc5)]
}
stop=read.csv("stopwords.c(U).csv")
stop=as.character(stop)
text = c("화질 좋고 시야각이 넓어서 좋아요  스마트폰 연동 만족")
text_cor = VCorpus(VectorSource(text))
text_cor_tdm = TermDocumentMatrix(text_cor,control=list(tokenize=myfunc_extract,
                                                        wordLength=c(2,Inf),
                                                        removePunctuation=T,
                                                        removeNumbers=F,
                                                        stopwords=stop,
                                                        weighting=weightBin))
tdm_stem=as.matrix(text_cor_tdm)
View(tdm_stem)

"myfunc_extract" is a function which extracts nouns, verbs and adjectives in Korean syntax.

My intention was extracting "화질"(noun), "시야각"(noun), "스마트폰"(noun).

The other expressions would be filtered by stopwords.

However, tdm_stem(TDM matrix) returned "시야각", "스마트폰", and "화질" is missing.

But, i didn't append "화질" to the stopwords, though.

So, it must be something gone wrong while processing TermDocumentMatrix, but i don't know why.

What you think is the problem here?

K.K.SAN
  • 11
  • 4

0 Answers0