Error in reading Chinese in txt: corpus() only works on character, corpus, Corpus, data.frame, kwic objects

Question

I try to produce a wordcloud and obtain word frequency for a Chinese speech using R, jiebaR and corpus, but cannot make a corpus. Here is my code:

library(jiebaR)
library(stringr)
library(corpus)

cutter <- worker()

v36 <- readLines('v36.txt', encoding = 'UTF-8')

seg_x <- function(x) {str_c(cutter[x], collapse = '')}

x.out <- sapply(v36, seg_x, USE.NAMES = FALSE)

v36.seg <- x.out
v36.seg

library(quanteda)

corpus <- corpus(v36.seg)  #Error begins here.
summary(corpus, showmeta = TRUE, 1)
texts(corpus)[1]

tokens(corpus, what = 'fasterword')[1]

tokens <- tokens(v36.seg, what = 'fasterword')
dfm <- dfm(tokens)
dfm

My text file comprises the following paragraphs:

Error begins when I create a corpus. R returns:

Error in corpus.default(v36.seg) : 
  corpus() only works on character, corpus, Corpus, data.frame, kwic objects.

I don't understand why the text is problematic. Grateful if you can help me solve the problem. Thank you.

Check whether x.out is actually a character vector or data.frame. Also why don't you read in the text directly into quanteda? And please don't paste a picture of your text in the question, just put the text in here so we can copy it into our R sessions. — phiver, Jan 28 '20 at 09:52
@phiver 1. typeof (x.out) returns list, not character nor df. 2. Because I'll import a longer text into R in the future. The above Chinese text is a test case. 3. I'm afraid some stackoverflow-ers cannot view Chinese characters properly here. Here is the original text: 為了盡量減低疫情在社區擴散的風險，政府發言人今日（一月二十八日）宣布，政府部門將會於農曆新年假期後（即一月二十九日起）實施特別上班安排。除了提供緊急和必須公共服務的人員外，政府僱員在假期後無需返回寫字樓辦公，而是留在家中工作。　　有關安排暫定實行至二月二日，政府屆時會再檢視情況。　　政府發言人呼籲私人機構在可行情況下，作類似安排。　　個別政府部門會就受影響的公共服務作出公布。 — ronzenith, Jan 28 '20 at 14:17

score 1 · Answer 1 · answered Jan 28 '20 at 14:32

Given your text example in the comments, I put these in a text file. Next following Ken's instructions you will see that the text is nicely available in quanteda. From there you can do all the NLP you need. Do check out the Chinese example on the quanteda reference pages.

Disclaimer: I can't seem to paste the Chinese example text from your comment into this answer as the system thinks I'm putting in spam :-(

library(quanteda)
library(readtext)

v36 <- readtext::readtext("v36.txt", encoding = "UTF8")

my_dfm <- v36 %>%  corpus() %>%
  tokens(what = "word") %>%
  dfm()  

# show frequency to check if words are available.
dplyr::as_tibble(textstat_frequency(my_dfm))

# A tibble: 79 x 5
   feature frequency  rank docfreq group
   <chr>       <dbl> <int>   <dbl> <chr>
 1 ，              6     1       1 all  
 2 政府            6     1       1 all  
 3 。              5     3       1 all  
 4 在              3     4       1 all  
 5 的              3     4       1 all  
 6 安排            3     4       1 all  
 7 發言人          2     7       1 all  
 8 （              2     7       1 all  
 9 一月            2     7       1 all  
10 ）              2     7       1 all  
# ... with 69 more rows

Your link on Chinese example seems to be very useful. Will read in a moment. One thing: I see the comment "to set the font correctly for macOS". What if I use Windows and Ubuntu? — ronzenith, Jan 28 '20 at 14:59
@ronzenith, Not sure if Ubuntu should be an issue, linux tends to be better at handling non-latin character sets, Windows is always an issue, it is the reason why I specify the encoding to UTF-8 when reading in the text. — phiver, Jan 28 '20 at 15:04
yes. I use Windows, and need to set R to read Chinese at the begnning: Setlocale cht — ronzenith, Jan 28 '20 at 15:12

score 0 · Answer 2 · answered Jan 28 '20 at 13:52

Impossible to tell without a reproducible example, but I can suggest two things that will probably solve this. The first is to simplify reading your text file using the readtext package. The second is that you definitely want the "word" tokenizer, not "fasterword" which simply splits on whitespace - which Chinese does not use between words. "word" knows the Chinese word boundaries.

Try this:

library("quanteda")

readtext::readtext("v36.rtxt") %>%
    corpus() %>%
    tokens(what = "word") %>%
    dfm()

Error in reading Chinese in txt: corpus() only works on character, corpus, Corpus, data.frame, kwic objects

2 Answers2