Problem
I am trying to use GloVe to represent entire document. However, GloVe is initially designed to get word embedding. One way to get the document embedding is to take the average of all word embeddings in the document.
I am following the solution posted here to load the GloVe look-up table. However, when I tried to get the document embedding, the runtime is extremely slow (about 1s per document for more than 1 million documents).
I am wondering if there is any way I could accelerate this process.
The GloVe look-up table could be downloaded here and the following is the code I use to get the document embedding. The data is stored in a pd.DataFrame()
, where there is a review
column.
Note there might be some words in the text_processed_list
not present in the look-up table, that is why try...catch...
comes into play.
import numpy as np
import pandas as pd
import string
import csv
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
remove_list = stopwords.words('english') + list(string.punctuation)
X = np.zeros((dataset_size, 300))
glove_model = pd.read_table("glove.42B.300d.txt", sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE)
for iter in range(dataset_size):
text = data.loc[iter, "review"]
text_processed_list = [word for word in word_tokenize(text.lower()) if word not in remove_list]
for word in text_processed_list:
try:
X[iter] += glove_model.loc[word].values
except KeyError:
pass
X[iter] /= len(text_processed_list)