How to transform a list of character vectors into a quanteda tokens object?

Question

I have a list of character vectors that hold tokens for documents.

list(doc1 = c("I", "like", "apples"), doc2 = c("You", "like", "apples", "too"))

I would like to transform this vector into a quanteda tokens (or dfm) object in order to make use of some of quantedas functionalities.

What's the best ay to do this?

I realize I could do something like the following for each document:

tokens(paste0(c("I", "like", "apples"), collapse = " "), what = "fastestword")

Which gives:

Tokens consisting of 1 document.
text1 :
[1] "I"      "like"   "apples"

But this feels like a hack and is also unreliable as I have whitespaces in some of my tokens objects. Is there a way to transfer these data structures more smoothly?

Do you need `lapply(lst1, \(x) tokens(paste(x, collapse=" "), what = "fastestword"))` or is it `tokens(lst1, what = 'fastestword')` — akrun, Jul 18 '21 at 22:28
Can't believe I didn't try this! ```tokens(lst1, what = 'fastestword')``` it is. — jhfodr76, Jul 18 '21 at 23:20

Ken Benoit · Accepted Answer · 2021-07-19T09:20:54.843

You can construct a tokens object from:

a character vector, in which case the object is tokenised with each character element becoming a "document"
a corpus, which is a specially classed character vector, and is tokenised and converted into documents in the tokens object in the same way
a list of character elements, in which case each list element becomes a tokenised document, and each element of that list becomes a token (but is not tokenised further)
a tokens object, which is treated the same as the list of character elements.

It's also possible to convert a list of character elements to a tokens object using as.tokens(mylist). The difference is that with tokens(), you have access to all of the options such as remove_punct. With as.tokens(), the conversion is direct, without options, so would be a bit faster if you do not need the options.

lis <- list(
  doc1 = c("I", "like", "apples"),
  doc2 = c("One two", "99", "three", ".")
)

library("quanteda")
## Package version: 3.0.9000
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.

tokens(lis)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "I"      "like"   "apples"
## 
## doc2 :
## [1] "One two" "99"      "three"   "."
tokens(lis, remove_punct = TRUE, remove_numbers = TRUE)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "I"      "like"   "apples"
## 
## doc2 :
## [1] "One two" "three"

The coercion alternative, without options:

as.tokens(lis)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "I"      "like"   "apples"
## 
## doc2 :
## [1] "One two" "99"      "three"   "."

score 0 · Answer 2 · answered Jul 18 '21 at 23:21

According to ?tokens, the x can be a list.

x - the input object to the tokens constructor, one of: a (uniquely) named list of characters; a tokens object; or a corpus or character object that will be tokenized

So we just need

library(quanteda)
tokens(lst1, what = 'fastestword')

How to transform a list of character vectors into a quanteda tokens object?

2 Answers2