I'm currently preprocessing korean corpus using KoNLP, in R.
library(stringr)
library(tm)
library(KoNLP)
library(dplyr)
library(rJava)
useNIADic()
myfunc_extract <- function(doc){
doc <- as.character(doc)
doc2 <- paste(SimplePos22(doc))
doc3.nc <- str_match(doc2, '([가-힣]{2,}+)/[N][C]')
doc4.nc <- doc3.nc[,2]
doc3.pv <- str_match(doc2, '([가-힣]{2,}+)/[P][V]')
doc3.pv <- doc3.pv[,1]
doc4.pv <- gsub("/PV", "다", doc3.pv)
doc3.pa <- str_match(doc2, '([가-힣]{2,}+)/[P][A]')
doc3.pa <- doc3.pa[,1]
doc4.pa <- gsub("/PA", "다", doc3.pa)
doc5 <- rbind(doc4.nc, doc4.pv, doc4.pa)
doc5[!is.na(doc5)]
}
stop=read.csv("stopwords.c(U).csv")
stop=as.character(stop)
text = c("화질 좋고 시야각이 넓어서 좋아요 스마트폰 연동 만족")
text_cor = VCorpus(VectorSource(text))
text_cor_tdm = TermDocumentMatrix(text_cor,control=list(tokenize=myfunc_extract,
wordLength=c(2,Inf),
removePunctuation=T,
removeNumbers=F,
stopwords=stop,
weighting=weightBin))
tdm_stem=as.matrix(text_cor_tdm)
View(tdm_stem)
"myfunc_extract" is a function which extracts nouns, verbs and adjectives in Korean syntax.
My intention was extracting "화질"(noun), "시야각"(noun), "스마트폰"(noun).
The other expressions would be filtered by stopwords.
However, tdm_stem(TDM matrix) returned "시야각", "스마트폰", and "화질" is missing.
But, i didn't append "화질" to the stopwords, though.
So, it must be something gone wrong while processing TermDocumentMatrix, but i don't know why.
What you think is the problem here?