Quick way to get document vector using GloVe

Question

Problem

I am trying to use GloVe to represent entire document. However, GloVe is initially designed to get word embedding. One way to get the document embedding is to take the average of all word embeddings in the document.

I am following the solution posted here to load the GloVe look-up table. However, when I tried to get the document embedding, the runtime is extremely slow (about 1s per document for more than 1 million documents).

I am wondering if there is any way I could accelerate this process.

The GloVe look-up table could be downloaded here and the following is the code I use to get the document embedding. The data is stored in a pd.DataFrame(), where there is a review column.

Note there might be some words in the text_processed_list not present in the look-up table, that is why try...catch... comes into play.

import numpy as np
import pandas as pd
import string
import csv

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

remove_list = stopwords.words('english') + list(string.punctuation)

X = np.zeros((dataset_size, 300))
glove_model = pd.read_table("glove.42B.300d.txt", sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE)
for iter in range(dataset_size):
    text = data.loc[iter, "review"]
    text_processed_list = [word for word in word_tokenize(text.lower()) if word not in remove_list]
    for word in text_processed_list:
        try:
            X[iter] += glove_model.loc[word].values
        except KeyError:
            pass
        X[iter] /= len(text_processed_list)

Quick way to get document vector using GloVe

Problem

0 Answers0