0

What is the syntax in text2vec to vectorize texts and achieve dtm with only the indicated list of words?

How to vectorize and produce document term matrix only on indicated features? And if the features do not appear in the text the variable should stay empty.

I need to produce term document matrices with exactly the same columns as in the dtm I run the modelling on, otherwise I cannot use random forest model on new documents.

Jacek Kotowski
  • 620
  • 16
  • 49
  • You can run text2vec directly on an `fcm` created in **quanteda**, and thereby use all of the **quanteda** feature selection tools. If your question is about how to select items from **text2vec** output, you will need to phrase this part of the question more clearly. Generally good SO questions make the question clear at the beginning, and provide context after, and only if it is necessary for answering the question. A lot of what you ask here distracts from that since I'm not sure what is the part to which you need an answer. – Ken Benoit Jul 28 '17 at 13:14
  • 1
    Sorry. I made it succint. I hope it is acceptable, also my English is not native. – Jacek Kotowski Jul 28 '17 at 13:26
  • @KenBenoit I find it very interesting, that quanteda and text2vec objects can be used interchangeably. On the other hand I could not find a simple and clear example on how to conform the text mining packages to standard dm packages. They involve producing data with features conforming exactly to the features in learning sets. – Jacek Kotowski Jul 28 '17 at 14:07

2 Answers2

2

You can create document term matrix only from specific set of features:

v = create_vocabulary(c("word1", "word2"))
vectorizer = vocab_vectorizer(v)
dtm_test = create_dtm(it, vectorizer)

However I don't recommend to 1) use random forest on such sparse data - it won't work good 2) perform feature selection way you described - you will likely overfit.

Dmitriy Selivanov
  • 4,545
  • 1
  • 22
  • 38
  • After removing stopwords it is not that bad. I am wondering about also removing words that appear only once or twice in the whole dataset (rare occurences). I am also thinking how to remove all features of small variance accross the documents. So the number of features can be reduced even before rf (?) Then i look at the Feature Importance from Random Forest and take only the n top important number of features. Then I rerun RF only with those features. – Jacek Kotowski Jul 28 '17 at 14:14
  • Please check tutorial http://text2vec.org/vectorization.html . It covers topic about creation of another DTM in a given vector space. Also it covers topic about pruning rare words. – Dmitriy Selivanov Jul 28 '17 at 15:29
  • 1
    Thanks I will read it over the weekend. I am also reading on RF being indeed not quite a good idea for sparse matrices, thanks for your remark on that.. For now I will play with RF and then change to a better one. – Jacek Kotowski Jul 28 '17 at 15:31
2

I need to produce term document matrices with exactly the same columns as in the dtm I run the modelling on, otherwise I cannot use random forest model on new documents.

In quanteda you can set the features of a test set identical to that of a training set using dfm_select(). For example, to make dfm1 below have identical features to dfm2:

txts <- c("a b c d", "a a b b", "b c c d e f")

(dfm1 <- dfm(txts[1:2]))
## Document-feature matrix of: 2 documents, 4 features (25% sparse).
## 2 x 4 sparse Matrix of class "dfmSparse"
##        features
## docs    a b c d
##   text1 1 1 1 1
##   text2 2 2 0 0
(dfm2 <- dfm(txts[2:3]))
## Document-feature matrix of: 2 documents, 6 features (41.7% sparse).
## 2 x 6 sparse Matrix of class "dfmSparse"
##        features
## docs    a b c d e f
##   text1 2 2 0 0 0 0
##   text2 0 1 2 1 1 1

dfm_select(dfm1, dfm2, valuetype = "fixed", verbose = TRUE)
## kept 4 features, padded 2 features
## Document-feature matrix of: 2 documents, 6 features (50% sparse).
## 2 x 6 sparse Matrix of class "dfmSparse"
##        features
## docs    a b c d e f
##   text1 1 1 1 1 0 0
##   text2 2 2 0 0 0 0

For feature-context matrixes (what text2vec needs for an input) however, this will not work because the co-occurrences (at least those computed with a window rather than document context) are interdependent across features, so you cannot simply add and remove them in the same way.

Ken Benoit
  • 14,454
  • 27
  • 50
  • 1
    Thanks for your answer. I will experiment with both text2vec and quanteda. I am beginner in the subject of text mining, understand general terms, try to dive into more complicated, so some obvious answers that I read I could overlook and ask about the same looking at an answer. Dory fish I am. – Jacek Kotowski Jul 28 '17 at 15:35