R - how to create DocumentTermMatrix for Korean words

Question

I hope those text mining gurus, that are also Non-Koreans can help me with my very specific question.

I'm currently trying to create a Document Term Matrxi (DTM) on a free text variable that contains mixed English words and Korean words.

First of all, I have used cld3::detect_language function to remove those obs with non-Koreans from the data.

Second of all, I have used KoNLP package to extract nouns only from the filtered data (Korean text only)

Third of all, I know that by using tm package, I can create DTM rather easily.

The issue is that when I use tm pakcage to create DTM, it doesn't allow only nouns to be recognized. This is not an issue if you're dealing with English words, but Korean words is a different story. For example, if I use KoNLP to extract nouns only, I can extract "훌륭" from "훌륭히", "훌륭한", "훌륭하게", "훌륭하고", "훌륭했던", etc.. and tm package doesn't recognize this as treats all these terms separately, when creating a DTM.

Is there any way I can create a DTM based on nouns that were extracted from KoNLP package?

I've noticed that if you're non-Korean, you may have a difficulty understanding my question. I'm hoping someone can give me a direction here.

Much appreciated in advance.

you could have a look at udpipe, that can handle Korean. You can use that after removing observations with non-Koreans text. After annotating the text, you select all the nouns and put those into tm (or quanteda / tidytext) to get a dtm. — phiver, May 30 '22 at 10:59
Thanks for your comment. What does it mean by "annotating the text"? I'm looking into udpipe at the moment and it is looking rather promising. — Brian, May 30 '22 at 12:57
All sorted! Thank you so much for your help! This udpipe package is quite impressive! — Brian, May 30 '22 at 16:22

R - how to create DocumentTermMatrix for Korean words

0 Answers0