I am trying to measure the number of times that different words co-occur with a particular term in collections of Chinese newspaper articles from each quarter of a year. To do this, I have been using Quanteda and written several R functions to run on each group of articles. My work steps are:
- Group the articles by quarter.
- Produce a frequency co-occurence matrix (FCM) for the articles in each quarter (Function 1).
- Take the column from this matrix for the 'term' I am interested in and convert this to a data.frame (Function 2)
- Merge the data.frames for each quarter together, then produce a large csv file with a column for each quarter and a row for each co-occurring term.
This seems to work okay. But I wondered if anybody more skilled in R might be able to check what I am doing is correct, or might suggest a more efficient way of doing it?
Thanks for any help!
#Function 1 to produce the FCM
get_fcm <- function(data) {
ch_stop <- stopwords("zh", source = "misc")
corp = corpus(data)
toks = tokens(corp, remove_punct = TRUE) %>% tokens_remove(ch_stop)
fcm = fcm(toks, context = "window", window = 1, tri = FALSE)
return(fcm)
}
>fcm_14q4 <- get_fcm(data_14q4)
>fcm_15q1 <- get_fcm(data_15q1)
#Function 2 to select the column for the 'term' of interest (such as China 中国) and make a data.frame
convert2df <- function(matrix, term){
mat_term = matrix[,term]
df = convert(mat_term, to = "data.frame")
colnames(df)[1] = "Term"
colnames(df)[2] = "Freq"
x = df[order(-df$Freq),]
return(x)
}
>CH14 <- convert2df(fcm_14q4, "中国")
>CH15 <- convert2df(fcm_15q1, "中国")
#Merging the data.frames
df <- merge(x=CH14q4, y=CH15q1, by="Term", all.x=TRUE, all.y=TRUE)
df <- merge(x=df, y=CH15q2, by="Term", all.x=TRUE, all.y=TRUE) #etc for all the dataframes...
UPDATE: Following Ken's advice in the comments below, I have tried doing it a different way, using the window function of tokens_select() and then a document feature matrix. After labelling the corpus documents according to their quarter, the following R function should take the tokenized corpus toks
and then produce a data.frame of the number of times words co-occur within a specified window
of a term
.
COOCdfm <- function(toks, term, window){
ch_stop = stopwords("zh", source = "misc")
cooc_toks = tokens_select(toks, term, window = window)
cooc_toks2 = tokens(cooc_toks, remove_punct = TRUE)
cooc_toks3 = tokens_remove(cooc_toks2, ch_stop)
dfmat = dfm(cooc_toks3)
dfmat_grouped = dfm_group(dfmat, groups = "quarter")
counts = convert(t(dfmat_grouped), to = "data.frame")
colnames(counts)[1] <- "Feature"
return(counts)
}