2

I used text2vec to generate custom word embeddings from a corpus of proprietary text data that contains a lot of industry-specific jargon (thus stock embeddings like those available from google won't work). The analogies work great, but I'm having difficulty applying the embeddings to assess new data. I want to use the embeddings that I've already trained to understand relationships in new data. the approach I'm using (described below) seems convoluted, and it's painfully slow. Is there a better approach? Perhaps something already built into the package that I've simply missed?

Here's my approach (offered with the closest thing to reproducible code I can generate given that I'm using a proprietary data source):

d = list containing new data. each element is of class character

vecs = the word vectorizations obtained form text2vec's implementation of glove

  new_vecs <- sapply(d, function(y){             
                    it <- itoken(word_tokenizer(y), progressbar=FALSE) # for each statement, create an iterator punctuation
                    voc <- create_vocabulary(it, stopwords= tm::stopwords()) # for each document, create a vocab 
                    vecs[rownames(vecs) %in% voc$vocab$terms, , drop=FALSE] %>% # subset vecs for the words in the new document, then 
                    colMeans # find the average vector for each document
                    })  %>% t # close y function and sapply, then transpose to return matrix w/ one row for each statement

For my use case, I need to keep the results separate for each document, so anything that involves pasting-together the elements of d won't work, but surely there must be a better way than what I've cobbled together. I feel like I must be missing something rather obvious.

Any help will be greatly appreciated.

user2047457
  • 381
  • 4
  • 13

1 Answers1

8

You need to do it in a "batch" mode using efficient linear algebra matrix operations. The idea is to have document-term matrix for documents d. This matrix will contain information about how many times each word appears in each document. Then need just multiply dtm by matrix of embeddings:

library(text2vec)
# we are interested in words which are in word embeddings
voc = create_vocabulary(rownames(vecs))
# now we will create document-term matrix
vectorizer = vocab_vectorizer(voc)
dtm = itoken(d, tokenizer = word_tokenizer) %>% 
  create_dtm(vectorizer)

# normalize - calculate term frequaency - i.e. divide count of each word 
# in document by total number of words in document. 
# So at the end we will receive average of word vectors (not sum of word vectors!)
dtm = normalize(dtm)
# and now we can calculate vectors for document (average of vecors of words)
# using dot product of dtm and embeddings matrix
document_vecs = dtm %*% vecs
Dmitriy Selivanov
  • 4,545
  • 1
  • 22
  • 38
  • Hi, I hope you don't mind my reopening this discussion. A similar solution of getting document embeddings using pre-trained word embeddings was offered by Silge and Hvitfeldt (https://smltar.com/embeddings.html#exploring-cfpb-word-embeddings), but why exactly does it work? For a term-document matrix M factorized as UΣV^t, word embeddings = UΣ, and document embeddings = ΣV^t. However, according to the solution above, document embeddings = M^t (transposed in order to get document-term matrix) * word embeddings (UΣ). But (UΣV^t)^t * UΣ = V * Σ^2 (and not ΣV^t, as expected). – LocusClassicus Jun 05 '23 at 22:18