How to create a quanteda corpus from a data.frame with multiple columns for text?

Question

lets say i have the following:

x10 = data.frame(id = c(1,2,3),vars =c('top','down','top'), 
     text1=c('this is text','so is this','and this is too.'),
     text2=c('we have more text here','and here too','and look at this, more text.'))

I want to create a dfm/corpus in quanteda using the following:

x1 = corpus(x10,docid_field='id',text_field=c(3:4),tolower=T)

Obviously this errors out because text_field only takes a single column. Is there a better way I should go about handling this problem other than just building two corpuses? Can I build 2 then merge on id? Is that a thing?

We've opened [an issue](https://github.com/quanteda/quanteda/issues/1214) for this but which behaviour would be more natural: concatenate the texts, or repeat the id variables to stack the text columns (as in the answer below)? — Ken Benoit, Feb 14 '18 at 21:23
Yeah, I see your point. I think the more straightforward would be repeat ID variables as concatenate could be troublesome. For instance, we have employee surveys with 3 open ended questions (positive exp, neg exp, anything else?) and combining those would really be weird. — Ted Mosby, Feb 14 '18 at 22:33

Ken Benoit · Accepted Answer · 2018-02-06T22:02:04.837

First, let's recreate your data.frame without factoring the character values:

x10 = data.frame(id = c(1,2,3), vars = c('top','down','top'), 
                 text1 = c('this is text', 'so is this', 'and this is too.'),
                 text2 = c('we have more text here', 'and here too', 'and look at this, more text.'),
                 stringsAsFactors = FALSE)

Then we have two options.

Method 1: Reshape to "long" format and create a single corpus

"Melt" the data first so there is a single column, and then import as a corpus. (An alternative is the tidy::gather().)

x10b <- reshape2::melt(x10, id.vars = c("id", "vars"), 
                       measure.vars = c("text1", "text2"),
                       variable.name = "doc_id", value.name = "text")

# because corpus() takes document names from row names, by default 
row.names(x10b) <- paste(x10b$doc_id, x10b$id, sep = "_")

x10b
#         id vars doc_id                         text
# text1_1  1  top  text1                 this is text
# text1_2  2 down  text1                   so is this
# text1_3  3  top  text1             and this is too.
# text2_1  1  top  text2       we have more text here
# text2_2  2 down  text2                 and here too
# text2_3  3  top  text2 and look at this, more text.

x10_corpus <- corpus(x10b)
summary(x10_corpus)
# Corpus consisting of 6 documents:
#     
#    Text Types Tokens Sentences id vars doc_id
# text1_1     3      3         1  1  top  text1
# text1_2     3      3         1  2 down  text1
# text1_3     5      5         1  3  top  text1
# text2_1     5      5         1  1  top  text2
# text2_2     3      3         1  2 down  text2
# text2_3     8      8         1  3  top  text2
# 
# Source:  /Users/kbenoit/Dropbox (Personal)/GitHub/lse-my459/assignment-2/* on x86_64 by kbenoit
# Created: Tue Feb  6 19:06:07 2018
# Notes:

Method 2: Make two corpus objects and combine

Here, we create two corpus objects separately and combine them using the + operator.

x10_corpus2 <- 
    corpus(x10[, -which(names(x10)=="text2")], text_field = "text1") +
    corpus(x10[, -which(names(x10)=="text1")], text_field = "text2")
summary(x10_corpus2)
# Corpus consisting of 6 documents:
#     
#   Text Types Tokens Sentences id vars
#  text1     3      3         1  1  top
#  text2     3      3         1  2 down
#  text3     5      5         1  3  top
# text11     5      5         1  1  top
# text21     3      3         1  2 down
# text31     8      8         1  3  top
# 
# Source:  Combination of corpuses corpus(x10[, -which(names(x10) == "text2")], text_field = "text1") and corpus(x10[, -which(names(x10) == "text1")], text_field = "text2")
# Created: Tue Feb  6 19:14:14 2018
# Notes:

You could also at this stage use docnames(x10_corpus2) <- to reassign the docnames to be more like the first method.

Ah! I thought about building two corpuses and combining. Is there a benefit of the methods? Let's say I had to scale this to 20k+ comments, is there a method that is more scale-able? — Ted Mosby, Feb 06 '18 at 21:40
I'd probably use Method 1. 20k+ will work without any problems. — Ken Benoit, Feb 06 '18 at 21:43
Great! and one last question in Method 1, the rownames(x) <- doesn't really have a purpose other than to just label the row right? I'm not missing anything there right? — Ted Mosby, Feb 06 '18 at 21:51
I just added a comment in the code to explain why. It's because the `corpus()` call will use those automatically for the docnames. — Ken Benoit, Feb 06 '18 at 22:02

How to create a quanteda corpus from a data.frame with multiple columns for text?

1 Answers1

Method 1: Reshape to "long" format and create a single corpus

Method 2: Make two corpus objects and combine