2

I am trying to measure the number of times that different words co-occur with a particular term in collections of Chinese newspaper articles from each quarter of a year. To do this, I have been using Quanteda and written several R functions to run on each group of articles. My work steps are:

  1. Group the articles by quarter.
  2. Produce a frequency co-occurence matrix (FCM) for the articles in each quarter (Function 1).
  3. Take the column from this matrix for the 'term' I am interested in and convert this to a data.frame (Function 2)
  4. Merge the data.frames for each quarter together, then produce a large csv file with a column for each quarter and a row for each co-occurring term.

This seems to work okay. But I wondered if anybody more skilled in R might be able to check what I am doing is correct, or might suggest a more efficient way of doing it?

Thanks for any help!

#Function 1 to produce the FCM

get_fcm <- function(data) {
  ch_stop <- stopwords("zh", source = "misc")
  corp = corpus(data)
  toks = tokens(corp, remove_punct = TRUE) %>% tokens_remove(ch_stop)  
  fcm = fcm(toks, context = "window", window = 1, tri = FALSE)
  return(fcm)
}

>fcm_14q4 <- get_fcm(data_14q4)
>fcm_15q1 <- get_fcm(data_15q1)

#Function 2 to select the column for the 'term' of interest (such as China 中国) and make a data.frame

convert2df <- function(matrix, term){
  mat_term = matrix[,term]
  df = convert(mat_term, to = "data.frame")
  colnames(df)[1] = "Term"
  colnames(df)[2] = "Freq"
  x = df[order(-df$Freq),]
  return(x)
}

>CH14 <- convert2df(fcm_14q4, "中国")
>CH15 <- convert2df(fcm_15q1, "中国")

#Merging the data.frames

df <- merge(x=CH14q4, y=CH15q1, by="Term", all.x=TRUE, all.y=TRUE)
df <- merge(x=df, y=CH15q2, by="Term", all.x=TRUE, all.y=TRUE) #etc for all the dataframes... 

UPDATE: Following Ken's advice in the comments below, I have tried doing it a different way, using the window function of tokens_select() and then a document feature matrix. After labelling the corpus documents according to their quarter, the following R function should take the tokenized corpus toks and then produce a data.frame of the number of times words co-occur within a specified window of a term.

COOCdfm <- function(toks, term, window){
  ch_stop = stopwords("zh", source = "misc")
  cooc_toks = tokens_select(toks, term, window = window)
  cooc_toks2 = tokens(cooc_toks, remove_punct = TRUE)
  cooc_toks3 = tokens_remove(cooc_toks2, ch_stop)
  dfmat = dfm(cooc_toks3)
  dfmat_grouped = dfm_group(dfmat, groups = "quarter")
  counts = convert(t(dfmat_grouped), to = "data.frame")
  colnames(counts)[1] <- "Feature"
  return(counts)
} 
Nick Olczak
  • 305
  • 3
  • 14

1 Answers1

1

If you are interested in counting co-occurrences within a window for specific target terms, a better way is to use the window argument of tokens_select(), and then to count occurrences from a dfm on the window-selected tokens.

library("quanteda")
## Package version: 3.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

toks <- tokens(data_corpus_inaugural)

dfmat <- toks %>%
  tokens_select("nuclear", window = 5) %>%
  tokens(remove_punct = TRUE) %>%
  tokens_remove(stopwords("en")) %>%
  dfm()

topfeatures(dfmat)[-1]
##     weapons      threat        work       earth elimination         day 
##           6           3           2           2           2           1 
##         one        free       world 
##           1           1           1

Here I've first done a "conservative" tokenisation to keep everything, then performed the context selection. I then processed that further to remove punctuation and stopwords before tabulating the results in a dfm. This will be large and very sparse but you can summarise the top co-occuring words using topfeatures() or quanteda.textstats::textstat_frequency().

Ken Benoit
  • 14,454
  • 27
  • 50
  • Thanks. I tried the two methods side by side and they gave identical results for the co-occurring frequency within all the documents in each month (with dfm_groups it is easy to aggregate the freq counts by month). I would rather count co-occurence totals per quarter though, if there's any way to do that? The whole corpus is large, so the tokenization of the whole thing takes a while. – Nick Olczak Aug 13 '21 at 12:34
  • 1
    Why not use `tokens_group()` or `dfm_group()` by a quarterly variable instead of monthly? – Ken Benoit Aug 13 '21 at 19:23
  • 1
    Ah, yes, perfect! Not sure why I didn't think of that before. I created a new docvar for quarter, and then grouped the documents with that using dfm_group() and it worked perfectly. There's a slight discrepancy in the counts +-1 from the fcm method before, which doesn't matter I think but I'm curious about the reason for. Anyway, thanks again for your help. – Nick Olczak Aug 14 '21 at 08:46