1

Does there exist a method to concatenate two dfm matrices containing different numbers of columns and rows at the same time? It can be done with some additional coding, so I am not interested in an adhoc code but in the general and elegant solution if there exists any.

An example:

dfm1 <- dfm(c(doc1 = "This is one sample text sample."), verbose = FALSE)
dfm2 <- dfm(c(doc2 = "Surprise! This is one sample text sample."), verbose = FALSE)
rbind(dfm1, dfm2)

gives an error.

The 'tm' package can concatenate its dfm matrices out of box; it is too slow for my purposes.

Also recall that 'dfm' from 'quanteda' is a S4 class.

mv_
  • 106
  • 1
  • 8

1 Answers1

4

Should work "out of the box", if you are using the latest version:

packageVersion("quanteda")
## [1] ‘0.9.6.9’

dfm1 <- dfm(c(doc1 = "This is one sample text sample."), verbose = FALSE)
dfm2 <- dfm(c(doc2 = "Surprise! This is one sample text sample."), verbose = FALSE)

rbind(dfm1, dfm2)
## Document-feature matrix of: 2 documents, 6 features.
## 2 x 6 sparse Matrix of class "dfmSparse"
##      is one sample surprise text this
## doc1  1   1      2        0    1    1
## doc2  1   1      2        1    1    1

See also ?selectFeatures where features is a dfm object (there are examples in the help file).

Added:

Note that this will correctly align the two texts in a common feature set, unlike the normal rbind methods for matrices, whose columns must match. For the same reasons, rbind() does not actually work in the tm package for DocumentTermMatrix objects with different terms:

require(tm)
dtm1 <- DocumentTermMatrix(Corpus(VectorSource(c(doc1 = "This is one sample text sample."))))
dtm2 <- DocumentTermMatrix(Corpus(VectorSource(c(doc2 = "Surprise! This is one sample text sample."))))
rbind(dtm1, dtm2)
## Error in f(init, x[[i]]) : Numbers of columns of matrices must match.

This almost gets it, but seems to duplicate the repeated feature:

as.matrix(rbind(c(dtm1, dtm2)))
##     Terms
## Docs one sample sample. text this surprise!
##    1   1      1       1    1    1         0
##    1   1      1       1    1    1         1
Ken Benoit
  • 14,454
  • 27
  • 50
  • thanks! it simplifies the code, a very useful option. – mv_ May 23 '16 at 09:10
  • When using 'tm' you should refer to the dispatching of the function 'c'. This package uses its own code to perform such special binding. The "base R" itself cannot do that (because of 'rbind's limitations). So `c(dtm1, dtm2)` will work correctly. By the way, 'tm' DFM may be created from 'quanteda' DFM directly, but not vice versa, as far as I know. – mv_ May 23 '16 at 12:41