1

There is documentation on creating a DTM (document term matrix) for the text2vec package, for example the following where a TFIDF weighting is applied after building the matrix:

data("movie_review")
N <- 1000
it <- itoken(movie_review$review[1:N], preprocess_function = tolower,
tokenizer = word_tokenizer)
v <- create_vocabulary(it)
vectorizer <- vocab_vectorizer(v)
it <- itoken(movie_review$review[1:N], preprocess_function = tolower,
tokenizer = word_tokenizer)
dtm <- create_dtm(it, vectorizer)
# get tf-idf matrix from bag-of-words matrix
dtm_tfidf <- transformer_tfidf(dtm)

It is common practice to create a DTM based on a training dataset and use that dataset as input to a model. Then, when new data is encountered (a test set) one needs to create the same DTM on the new data (meaning all the same terms that were used in the training set). Is there anyway in the package to transform a new data set in this manner (in scikit we have a transform method for just this type of instance).

Dmitriy Selivanov
  • 4,545
  • 1
  • 22
  • 38
B_Miner
  • 1,840
  • 4
  • 31
  • 66
  • Any guts to explain why downvoted? – B_Miner Aug 26 '16 at 23:40
  • 1
    Doesn't take "guts" to explain. Just hover your your pointer over the down-arrow. Many possibilities for taking offense at that question. No code. No example data. No explanation of what sort of model building or scoring might be anticipated. No explanation of what a "DTM" is.... yeah, I know what it (probably) is, but many programmers might not. So I have to agree it is at the very least "unclear" and also not much `research effort` demonstrated, and those are perfectly valid reasons for a downvote. Also seems rather useless unless is it expanded to address its many deficiencies. – IRTFM Aug 27 '16 at 00:16
  • 1
    Hovering doesnt show me anything? Its frustrating when someone down votes without explanation. This type of question is focused on a specific package, where my assumption is if you have used it and could remotely answer the question, you would know what a document term matrix was. Code is also a tough one....since what I am wondering isnt like here is some code, what is wrong with it - but instead, can the package do 'x'. I will try to expand it somewhat though and I thank you for the comments. – B_Miner Aug 27 '16 at 00:48
  • I wasn't the downvoter. When I downvote I usually leave an explanation. I agree it frustrating not to see an explanation but it's also frustrating to see _many_ questions by people who obviously have not read the help and introductory pages for SO. – IRTFM Aug 27 '16 at 02:27

1 Answers1

4

Actually when I started text2vec I kept that pipeline at the first place. Now we are preparing new release with updated documentation.

For v0.3 following should work:

data("movie_review")
train_rows = 1:1000
prepr = tolower
tok = word_tokenizer

it <- itoken(movie_review$review[train_rows], prepr, tok, ids = movie_review$id[train_rows])
v <- create_vocabulary(it) %>% 
  prune_vocabulary(term_count_min = 5)

vectorizer <- vocab_vectorizer(v)
it <- itoken(movie_review$review[train_rows], prepr, tok)
dtm_train <- create_dtm(it, vectorizer)
# get idf scaling from train data
idf = get_idf(dtm_train)
# create tf-idf
dtm_train_tfidf <- transform_tfidf(dtm_train, idf)

test_rows = 1001:2000
# create iterator
it <- itoken(movie_review$review[test_rows], prepr, tok, ids = movie_review$id[test_rows])
# create dtm using same vectorizer, but new iterator
dtm_test_tfidf <- create_dtm(it, vectorizer) %>% 
  # transform  tf-idf using idf from train data
  transform_tfidf(idf)
Dmitriy Selivanov
  • 4,545
  • 1
  • 22
  • 38