0

I am trying to replicate Arora 2017 (https://github.com/PrincetonML/SIF / https://openreview.net/forum?id=SyK00v5xx) using text2vec. The authors compute sentence embeddings by averaging word embeddings and subtracting the first principal component.

Thanks to the author of text2vec I can compute the glove embeddings and average them. Next step is to compute principal components /svd and subtract the first component from the embeddings.

I can compute svd using the irlba package (which I believe is also used in tex2vec), next however I am stuck on how exactly to detract de principal component from the averaged word embeddings.

The python code (https://github.com/PrincetonML/SIF/blob/master/src/SIF_embedding.py) from the paper has the following function

def remove_pc(X, npc=1):
"""
Remove the projection on the principal components
:param X: X[i,:] is a data point
:param npc: number of principal components to remove
:return: XX[i, :] is the data point after removing its projection
"""
pc = compute_pc(X, npc)
if npc==1:
    XX = X - X.dot(pc.transpose()) * pc
else:
    XX = X - X.dot(pc.transpose()).dot(pc)
return XX

My R code is

# get the word vectors
wv_context = glove$components
word_vectors = wv_main + t(wv_context)

# create document term matrix
dtm = create_dtm(it, vectorizer)

# assign the word embeddings
common_terms = intersect(colnames(dtm), rownames(word_vectors) )

# normalise
dtm_averaged <-  text2vec::normalize(dtm[, common_terms], "l1")

If I for example have 1K sentences x 300 variables, I run the irlba function I get three matrices. These have for example 4 components x 1K observations.

How do I transform the output from this function (1K by x variables/components) so I can detract the component from the sentence embeddings (1K x 300 variables)?

Thanks!

1 Answers1

0

Idea is that with truncated SVD/PCA you can reconstruct original matrix with a minimal squared error. So you get SVD in a form (U, D, V) and reconstruction of original matrix is A ~ U * D * t(V). Now we subtract this reconstruction from original matrix - this will be exactly what authors proposed. Here is example:

library(text2vec)
data("movie_review")

it = itoken(movie_review$review, preprocessor = tolower, tokenizer = word_tokenizer)
dtm = create_dtm(it, hash_vectorizer(2**14))

lsa = LSA$new(n_topics = 64)
doc_emb = lsa$fit_transform(dtm)

doc_emb_pc1 = doc_emb_svd$u %*% doc_emb_svd$d %*% t(doc_emb_svd$v)
doc_emb_minus_pc1 = doc_emb - doc_emb_pc1

If you will get a chance to finish your implementation, please consider to contribute it to text2vec - here is ticket for Arora sentence embeddings - https://github.com/dselivanov/text2vec/issues/157.

Dmitriy Selivanov
  • 4,545
  • 1
  • 22
  • 38
  • Thanks very much for your comment. I can't reproduce the example. I think a stem is missing to compute the actual svd, correct? I tried to compute the svd using `doc_emb_svd <- irlba(doc_emb)` . I can obtain the decomposition. I read that the principal components correspond to matrix U, so that would mean I would need to subtract the first column of matrix U from doc_emb. Is that correct? Doc_emb is 64 col X 5000 rows and doc_emb_svd$u[,1] is 1 col x 5000 rows. Can I then just do doc_emb -doc_emb_svd$u[,1] ? Thanks for your help. My matrix operations skills are clearly sub par... – user2300301 Jan 22 '18 at 09:20