I am trying to replicate Arora 2017 (https://github.com/PrincetonML/SIF / https://openreview.net/forum?id=SyK00v5xx) using text2vec. The authors compute sentence embeddings by averaging word embeddings and subtracting the first principal component.
Thanks to the author of text2vec I can compute the glove embeddings and average them. Next step is to compute principal components /svd and subtract the first component from the embeddings.
I can compute svd using the irlba package (which I believe is also used in tex2vec), next however I am stuck on how exactly to detract de principal component from the averaged word embeddings.
The python code (https://github.com/PrincetonML/SIF/blob/master/src/SIF_embedding.py) from the paper has the following function
def remove_pc(X, npc=1):
"""
Remove the projection on the principal components
:param X: X[i,:] is a data point
:param npc: number of principal components to remove
:return: XX[i, :] is the data point after removing its projection
"""
pc = compute_pc(X, npc)
if npc==1:
XX = X - X.dot(pc.transpose()) * pc
else:
XX = X - X.dot(pc.transpose()).dot(pc)
return XX
My R code is
# get the word vectors
wv_context = glove$components
word_vectors = wv_main + t(wv_context)
# create document term matrix
dtm = create_dtm(it, vectorizer)
# assign the word embeddings
common_terms = intersect(colnames(dtm), rownames(word_vectors) )
# normalise
dtm_averaged <- text2vec::normalize(dtm[, common_terms], "l1")
If I for example have 1K sentences x 300 variables, I run the irlba function I get three matrices. These have for example 4 components x 1K observations.
How do I transform the output from this function (1K by x variables/components) so I can detract the component from the sentence embeddings (1K x 300 variables)?
Thanks!