Load Pretrained glove vectors in python

Question

I have downloaded pretrained glove vector file from the internet. It is a .txt file. I am unable to load and access it. It is easy to load and access a word vector binary file using gensim but I don't know how to do it when it is a text file format.

score 98 · Accepted Answer · edited Aug 11 '21 at 20:36

98

glove model files are in a word - vector format. You can open the textfile to verify this. Here is a small snippet of code you can use to load a pretrained glove file:

import numpy as np

def load_glove_model(File):
    print("Loading Glove Model")
    glove_model = {}
    with open(File,'r') as f:
        for line in f:
            split_line = line.split()
            word = split_line[0]
            embedding = np.array(split_line[1:], dtype=np.float64)
            glove_model[word] = embedding
    print(f"{len(glove_model)} words loaded!")
    return glove_model

You can then access the word vectors by simply using the gloveModel variable.

print(gloveModel['hello'])

edited Aug 11 '21 at 20:36

Jules G.M.

3,624
1
21
35

answered Jul 06 '16 at 17:40

Karishma Malkan

2,069
1
16
14

7

I'm wondering if there is a faster way of doing this. I'm using code similar to that above, but it would take around 27 hours to process the whole 6billion token embeddings. Any ideas of how to do this faster? – Edward Burgin Aug 15 '17 at 12:31
1

@EdwardBurgin, it is taking me 1 minute to complete the whole file. please share the "similar code" that u are referring to in your comment. – TheRajVJain Oct 30 '17 at 09:33
$ python test_glove.py Loading Glove Model Done. 400000 words loaded! Traceback (most recent call last): File "test_glove.py", line 16, in print(model['hello']) NameError: name 'model' is not defined – Mona Jalal Apr 23 '18 at 03:56
@MonaJalal Do ```model = loadGloveModel("filename.txt")``` then print statement will work fine. – Ritwik Mar 09 '19 at 16:45
This doesn't work for me on Python 3 using the 2.8B Twitter pretrained GloVe vectors because Python doesn't handle `"\xC2\x85"` properly. – jchook Aug 21 '19 at 20:14
3

@jchook add `f = open(gloveFile,'r', encoding='utf-8')` to read the glove file and it will work – Naveen Dennis Jan 31 '20 at 14:01

score 53 · Answer 2 · edited Aug 30 '17 at 19:41

53

You can do it much faster with pandas:

import pandas as pd
import csv

words = pd.read_table(glove_data_file, sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE)

Then to get the vector for a word:

def vec(w):
  return words.loc[w].as_matrix()

And to find the closest word to a vector:

words_matrix = words.as_matrix()

def find_closest_word(v):
  diff = words_matrix - v
  delta = np.sum(diff * diff, axis=1)
  i = np.argmin(delta)
  return words.iloc[i].name

edited Aug 30 '17 at 19:41

Steven

824
1
8
23

answered Aug 26 '17 at 09:36

Petter

37,121
7
47
62

5

Although, the time to load the model reduces by almost half but the access time increases by 1000x. loc against dict access. I think, personally i would prefer lower access time, coz that will be affecting the training time. since the model making is single time effort, its better to invest the time there and save it once and for all. do correct me if i m wrong. – TheRajVJain Oct 30 '17 at 09:31
3

You should use a couple more arguments in `read_table`: `na_values=None, keep_default_na=False`. Otherwise it will consider many valid strings (e.g. 'null', 'NA', etc) as `nan` floating point values. – Eli Korvigo Feb 16 '18 at 23:19
4

`read_table` is deprecated. Use `read_csv` with the same parameters instead. – Artur Pschybysz Apr 12 '19 at 11:43

score 47 · Answer 3 · answered Nov 24 '17 at 01:45

I suggest using gensim to do everything. You can read the file, and also benefit from having a lot of methods already implemented on this great package.

Suppose you generated GloVe vectors using the C++ program and that your "-save-file" parameter is "vectors". Glove executable will generate you two files, "vectors.bin" and "vectors.txt".

Use glove2word2vec to convert GloVe vectors in text format into the word2vec text format:

from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_input_file="vectors.txt", word2vec_output_file="gensim_glove_vectors.txt")

Finally, read the word2vec txt to a gensim model using KeyedVectors:

from gensim.models.keyedvectors import KeyedVectors
glove_model = KeyedVectors.load_word2vec_format("gensim_glove_vectors.txt", binary=False)

Now you can use gensim word2vec methods (for example, similarity) as you'd like.

It looks like glove2word2vec give warning `This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function`. I guess gensim function needs to be updated — user1700890, Sep 16 '19 at 16:20
This warning is gone in version 3.8.3 of gensim. `glove2word2vec()` is 1000% the way to go. — dmn, Feb 23 '21 at 15:02

score 9 · Answer 4 · answered Aug 29 '18 at 05:45

I found this approach faster.

import pandas as pd

df = pd.read_csv('glove.840B.300d.txt', sep=" ", quoting=3, header=None, index_col=0)
glove = {key: val.values for key, val in df.T.items()}

Save the dictionary:

import pickle
with open('glove.840B.300d.pkl', 'wb') as fp:
    pickle.dump(glove, fp)

score 6 · Answer 5 · answered Mar 20 '18 at 16:34

Here's a one liner if all you want is the embedding matrix

np.loadtxt(path, usecols=range(1, dim+1), comments=None)

where path is path to your downloaded GloVe file and dim is the dimension of the word embedding.

If you want both the words and corresponding vectors you can do

glove = np.loadtxt(path, dtype='str', comments=None)

and seperate the words and vectors as follows

words = glove[:, 0]
vectors = glove[:, 1:].astype('float')

score 5 · Answer 6 · answered Jan 10 '20 at 11:17

Loading word embedding from a text file (in my case the glove.42B.300d embeddings) takes a bit long (147.2s on my machine).

What helps is converting the text file first into two new files: a text file that contains the words only (e.g. embeddings.vocab) and a binary file which contains the embedding vectors as numpy-structure (e.g. embeddings.npy).

Once converted, it takes me only 4.96s to load the same embeddings into the memory. This approach ends a up with exactly the same dictionary as if you load it from the text file. It is as efficient in access time and does not require any additional frameworks, but a lot faster in loading time.

With this code you convert your embedding text file to the two new files:

def convert_to_binary(embedding_path):
    f = codecs.open(embedding_path + ".txt", 'r', encoding='utf-8')
    wv = []

    with codecs.open(embedding_path + ".vocab", "w", encoding='utf-8') as vocab_write:
        count = 0
        for line in f:
            splitlines = line.split()
            vocab_write.write(splitlines[0].strip())
            vocab_write.write("\n")
            wv.append([float(val) for val in splitlines[1:]])
        count += 1

    np.save(embedding_path + ".npy", np.array(wv))

And with this method you load it efficiently into your memory:

def load_word_emb_binary(embedding_file_name_w_o_suffix):
    print("Loading binary word embedding from {0}.vocab and {0}.npy".format(embedding_file_name_w_o_suffix))

    with codecs.open(embedding_file_name_w_o_suffix + '.vocab', 'r', 'utf-8') as f_in:
        index2word = [line.strip() for line in f_in]

    wv = np.load(embedding_file_name_w_o_suffix + '.npy')
    word_embedding_map = {}
    for i, w in enumerate(index2word):
        word_embedding_map[w] = wv[i]

    return word_embedding_map

Disclaimer: This code is shamelessly stolen from https://blog.ekbana.com/loading-glove-pre-trained-word-embedding-model-from-text-file-faster-5d3e8f2b8455. But it might help in this thread.

score 3 · Answer 7 · answered Dec 20 '18 at 01:29

Python3 version which also handles bigrams and trigrams:

import numpy as np


def load_glove_model(glove_file):
    print("Loading Glove Model")
    f = open(glove_file, 'r')
    model = {}
    vector_size = 300
    for line in f:
        split_line = line.split()
        word = " ".join(split_line[0:len(split_line) - vector_size])
        embedding = np.array([float(val) for val in split_line[-vector_size:]])
        model[word] = embedding
    print("Done.\n" + str(len(model)) + " words loaded!")
    return model

could you add a short description about how it handles the bigrams, please? — Srichakradhar, Feb 20 '21 at 00:12

score 0 · Answer 8 · answered Jul 01 '19 at 12:08

import os
import numpy as np

# store all the pre-trained word vectors
print('Loading word vectors...')
word2vec = {}
with open(os.path.join('glove/glove.6B.%sd.txt' % EMBEDDING_DIM)) as f: #enter the path where you unzipped the glove file
  # is just a space-separated text file in the format:
  # word vec[0] vec[1] vec[2] ...
    for line in f:
        values = line.split()
        word = values[0]
        vec = np.asarray(values[1:], dtype='float32')
        word2vec[word] = vec
print('Found %s word vectors.' % len(word2vec))

score 0 · Answer 9 · answered Jun 10 '20 at 07:24

This code takes some time to store glove embeddings on shelf, but loading it is quite faster as compared to other approaches.

import os
import numpy as np
from contextlib import closing
import shelve

def store_glove_to_shelf(glove_file_path,shelf):
    print('Loading Glove')
    with open(os.path.join(glove_file_path)) as f:
        for line in f:
            values = line.split()
            word = values[0]
            vec = np.asarray(values[1:], dtype='float32')
            shelf[word] = vec

shelf_file_name = "glove_embeddings"
glove_file_path = "glove/glove.840B.300d.txt"

# Storing glove embeddings to shelf for faster load
with closing(shelve.open(shelf_file_name + '.shelf', 'c')) as shelf:
    store_glove_to_shelf(glove_file_path,shelf)
    print("Stored glove embeddings from {} to {}".format(glove_file_path,shelf_file_name+'.shelf'))

# To reuse the glove embeddings stored in shelf
with closing(shelve.open(shelf_file_name + '.shelf')) as embeddings_index:
    # USE embeddings_index here , which is a dictionary
    print("Loaded glove embeddings from {}".format(shelf_file_name+'.shelf'))
    print("Found glove embeddings with {} words".format(len(embeddings_index)))

score 0 · Answer 10 · edited Nov 15 '21 at 17:17

Each corpus need to start with a line containing the vocab size and the vector size in that order.

Open the .txt file of the glove model and enter the dimension of the vector at the first line by pressing Enter first:

Example, for glove.6B.50d.txt, just add 400000 50 in the first line.

Then use gensim to transform that raw .txt vector file to gensim vector format:

import gensim

word_vectors = gensim.models.KeyedVectors.load_word2vec_format('path/glove.6B.50d.txt', binary=False)
word_vectors.save('path/glove_gensim.txt')

score 0 · Answer 11 · answered Dec 04 '21 at 20:00

Some of the other approaches here required more storage space (e.g. to split files) or were quite slow to run on my personal laptop. I tried shelf db but it seemed to blow up in storage size. Here's an "in-place" approach with one-time file-read time cost and very low additional storage cost. We treat the original text file as a database and just store the position location for each of the words. This works really well when you're, e.g., investigating properties of word vectors.

# First create a map from words to position in the file
def get_db_mapping(fname):
    char_ct = 0    # cumulative position in file
    pos_map = dict()

    with open(fname + ".txt", 'r', encoding='utf-8') as f:
        for line in tqdm(f):
            new_len = len(line)     # len of line

            # get the word
            splitlines = line.split()
            word = splitlines[0].strip()

            # store and increment counter
            pos_map[word] = char_ct
            char_ct += new_len

    # write dict
    with open(fname + '.db', 'wb') as handle:
        pickle.dump(pos_map, handle)


class Embedding:
"""Small wrapper so that we can use [] notation to fetch word vectors.
It would be better to just have the file pointer and the pos_map as part
of this class, but that's not how I wrote it initially."""
    def __init__(self, emb_fn):
        self.emb_fn = emb_fn

    def __getitem__(self, item):
        return self.emb_fn(item)


def load_db_mapping(fname, cache_size=1000) -> Embedding:
    """Creates a function closure that wraps access to the db mapping
    and the text file that functions as db. Returns them as an
    Embedding object"""
    # get the two state objects: mapping and file pointer
    with open(fname + '.db', 'rb') as handle:
        pos_map = pickle.load(handle)
    f = open(fname + ".txt", 'r', encoding='utf-8')

    @lru_cache(maxsize=cache_size)
    def get_vector(word: str):
        pos = pos_map[word]
        f.seek(pos, 0)

        # special logic needed because of small count errors
        fail_ct = 0
        read_word = ""
        while fail_ct < 5 and read_word != word:
            fail_ct += 1
            l = f.readline()
            try:
                splitlines = l.split()
                read_word = splitlines[0].strip()
            except:
                continue
        if read_word != word:
            raise ValueError('word not found')

        # actually return
        return np.array([float(val) for val in splitlines[1:]])

    return Embedding(get_vector)

# to run
k_glove_vector_name = 'glove.42B.300d'   # omit .txt
get_db_mapping(k_glove_vector_name)      # run only once; creates .db
word_embedding = load_db_mapping(k_glove_vector_name)
word_embedding['hello']

score 0 · Answer 12 · answered Aug 05 '22 at 21:11

0

a tool with an easy implementation of GloVe is zeulgma

https://pypi.org/project/zeugma/

from zeugma.embeddings import EmbeddingTransformer
glove = EmbeddingTransformer('glove')

the implementation is really very easy

answered Aug 05 '22 at 21:11

Thiago Rainmaker

71
6

score 0 · Answer 13 · answered Oct 12 '22 at 04:56

def create_embedding_matrix(word_to_index):
# word_to_index is dictionary containing "word:token" pairs
nb_words = len(word_to_index)+1

embeddings_index = {}
with open('C:/Users/jayde/Desktop/IISc/DLNLP/Assignment1/glove.840B.300d/glove.840B.300d.txt', encoding="utf-8", errors='ignore') as f:
    for line in f:
        values = line.split()
        word = ''.join(values[:-300])
        coefs = np.asarray(values[-300:], dtype='float32')
        embeddings_index[word] = coefs

embedding_matrix = np.zeros((nb_words, 300))

for word, i in word_to_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

return embedding_matrix

emb_matrix = create_embedding_matrix(vocab_to_int)

score -1 · Answer 14 · answered Feb 06 '18 at 16:26

EMBEDDING_LIFE = 'path/to/your/glove.txt'

def get_coefs(word,*arr): 
      return word, np.asarray(arr, dtype='float32')

embeddings_index = dict(get_coefs(*o.strip().split()) for o in open(EMBEDDING_FILE))

all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))

embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))

for word, i in word_index.items():
if i >= max_features: continue
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None: embedding_matrix[i] = embedding_vector

Please provide a comment to your answer. Why is it better than already accepted one ? — NatNgs, Feb 06 '18 at 16:50
this is coming from kaggle and it blows up on some glove files, e.g. 800B.300d — Andrey Vykhodtsev, Feb 28 '18 at 20:09

Load Pretrained glove vectors in python

14 Answers14

Linked