I have downloaded pretrained glove vector file from the internet. It is a .txt file. I am unable to load and access it. It is easy to load and access a word vector binary file using gensim but I don't know how to do it when it is a text file format.
14 Answers
glove model files are in a word - vector format. You can open the textfile to verify this. Here is a small snippet of code you can use to load a pretrained glove file:
import numpy as np
def load_glove_model(File):
print("Loading Glove Model")
glove_model = {}
with open(File,'r') as f:
for line in f:
split_line = line.split()
word = split_line[0]
embedding = np.array(split_line[1:], dtype=np.float64)
glove_model[word] = embedding
print(f"{len(glove_model)} words loaded!")
return glove_model
You can then access the word vectors by simply using the gloveModel variable.
print(gloveModel['hello'])

- 3,624
- 1
- 21
- 35

- 2,069
- 1
- 16
- 14
-
7I'm wondering if there is a faster way of doing this. I'm using code similar to that above, but it would take around 27 hours to process the whole 6billion token embeddings. Any ideas of how to do this faster? – Edward Burgin Aug 15 '17 at 12:31
-
1@EdwardBurgin, it is taking me 1 minute to complete the whole file. please share the "similar code" that u are referring to in your comment. – TheRajVJain Oct 30 '17 at 09:33
-
$ python test_glove.py Loading Glove Model Done. 400000 words loaded! Traceback (most recent call last): File "test_glove.py", line 16, in
print(model['hello']) NameError: name 'model' is not defined – Mona Jalal Apr 23 '18 at 03:56 -
@MonaJalal Do ```model = loadGloveModel("filename.txt")``` then print statement will work fine. – Ritwik Mar 09 '19 at 16:45
-
This doesn't work for me on Python 3 using the 2.8B Twitter pretrained GloVe vectors because Python doesn't handle `"\xC2\x85"` properly. – jchook Aug 21 '19 at 20:14
-
3@jchook add `f = open(gloveFile,'r', encoding='utf-8')` to read the glove file and it will work – Naveen Dennis Jan 31 '20 at 14:01
You can do it much faster with pandas:
import pandas as pd
import csv
words = pd.read_table(glove_data_file, sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE)
Then to get the vector for a word:
def vec(w):
return words.loc[w].as_matrix()
And to find the closest word to a vector:
words_matrix = words.as_matrix()
def find_closest_word(v):
diff = words_matrix - v
delta = np.sum(diff * diff, axis=1)
i = np.argmin(delta)
return words.iloc[i].name
-
5Although, the time to load the model reduces by almost half but the access time increases by 1000x. loc against dict access. I think, personally i would prefer lower access time, coz that will be affecting the training time. since the model making is single time effort, its better to invest the time there and save it once and for all. do correct me if i m wrong. – TheRajVJain Oct 30 '17 at 09:31
-
3You should use a couple more arguments in `read_table`: `na_values=None, keep_default_na=False`. Otherwise it will consider many valid strings (e.g. 'null', 'NA', etc) as `nan` floating point values. – Eli Korvigo Feb 16 '18 at 23:19
-
4`read_table` is deprecated. Use `read_csv` with the same parameters instead. – Artur Pschybysz Apr 12 '19 at 11:43
I suggest using gensim to do everything. You can read the file, and also benefit from having a lot of methods already implemented on this great package.
Suppose you generated GloVe vectors using the C++ program and that your "-save-file" parameter is "vectors". Glove executable will generate you two files, "vectors.bin" and "vectors.txt".
Use glove2word2vec to convert GloVe vectors in text format into the word2vec text format:
from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_input_file="vectors.txt", word2vec_output_file="gensim_glove_vectors.txt")
Finally, read the word2vec txt to a gensim model using KeyedVectors:
from gensim.models.keyedvectors import KeyedVectors
glove_model = KeyedVectors.load_word2vec_format("gensim_glove_vectors.txt", binary=False)
Now you can use gensim word2vec methods (for example, similarity) as you'd like.

- 639
- 7
- 6
-
It looks like glove2word2vec give warning `This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function`. I guess gensim function needs to be updated – user1700890 Sep 16 '19 at 16:20
-
1This warning is gone in version 3.8.3 of gensim. `glove2word2vec()` is 1000% the way to go. – dmn Feb 23 '21 at 15:02
I found this approach faster.
import pandas as pd
df = pd.read_csv('glove.840B.300d.txt', sep=" ", quoting=3, header=None, index_col=0)
glove = {key: val.values for key, val in df.T.items()}
Save the dictionary:
import pickle
with open('glove.840B.300d.pkl', 'wb') as fp:
pickle.dump(glove, fp)

- 1,308
- 3
- 18
- 24
Here's a one liner if all you want is the embedding matrix
np.loadtxt(path, usecols=range(1, dim+1), comments=None)
where path
is path to your downloaded GloVe file and dim
is the dimension of the word embedding.
If you want both the words and corresponding vectors you can do
glove = np.loadtxt(path, dtype='str', comments=None)
and seperate the words and vectors as follows
words = glove[:, 0]
vectors = glove[:, 1:].astype('float')

- 665
- 1
- 7
- 17
Loading word embedding from a text file (in my case the glove.42B.300d embeddings) takes a bit long (147.2s on my machine).
What helps is converting the text file first into two new files: a text file that contains the words only (e.g. embeddings.vocab) and a binary file which contains the embedding vectors as numpy-structure (e.g. embeddings.npy).
Once converted, it takes me only 4.96s to load the same embeddings into the memory. This approach ends a up with exactly the same dictionary as if you load it from the text file. It is as efficient in access time and does not require any additional frameworks, but a lot faster in loading time.
With this code you convert your embedding text file to the two new files:
def convert_to_binary(embedding_path):
f = codecs.open(embedding_path + ".txt", 'r', encoding='utf-8')
wv = []
with codecs.open(embedding_path + ".vocab", "w", encoding='utf-8') as vocab_write:
count = 0
for line in f:
splitlines = line.split()
vocab_write.write(splitlines[0].strip())
vocab_write.write("\n")
wv.append([float(val) for val in splitlines[1:]])
count += 1
np.save(embedding_path + ".npy", np.array(wv))
And with this method you load it efficiently into your memory:
def load_word_emb_binary(embedding_file_name_w_o_suffix):
print("Loading binary word embedding from {0}.vocab and {0}.npy".format(embedding_file_name_w_o_suffix))
with codecs.open(embedding_file_name_w_o_suffix + '.vocab', 'r', 'utf-8') as f_in:
index2word = [line.strip() for line in f_in]
wv = np.load(embedding_file_name_w_o_suffix + '.npy')
word_embedding_map = {}
for i, w in enumerate(index2word):
word_embedding_map[w] = wv[i]
return word_embedding_map
Disclaimer: This code is shamelessly stolen from https://blog.ekbana.com/loading-glove-pre-trained-word-embedding-model-from-text-file-faster-5d3e8f2b8455. But it might help in this thread.

- 2,310
- 1
- 24
- 24
Python3 version which also handles bigrams and trigrams:
import numpy as np
def load_glove_model(glove_file):
print("Loading Glove Model")
f = open(glove_file, 'r')
model = {}
vector_size = 300
for line in f:
split_line = line.split()
word = " ".join(split_line[0:len(split_line) - vector_size])
embedding = np.array([float(val) for val in split_line[-vector_size:]])
model[word] = embedding
print("Done.\n" + str(len(model)) + " words loaded!")
return model

- 79
- 1
- 8
-
1could you add a short description about how it handles the bigrams, please? – Srichakradhar Feb 20 '21 at 00:12
import os
import numpy as np
# store all the pre-trained word vectors
print('Loading word vectors...')
word2vec = {}
with open(os.path.join('glove/glove.6B.%sd.txt' % EMBEDDING_DIM)) as f: #enter the path where you unzipped the glove file
# is just a space-separated text file in the format:
# word vec[0] vec[1] vec[2] ...
for line in f:
values = line.split()
word = values[0]
vec = np.asarray(values[1:], dtype='float32')
word2vec[word] = vec
print('Found %s word vectors.' % len(word2vec))
This code takes some time to store glove embeddings on shelf, but loading it is quite faster as compared to other approaches.
import os
import numpy as np
from contextlib import closing
import shelve
def store_glove_to_shelf(glove_file_path,shelf):
print('Loading Glove')
with open(os.path.join(glove_file_path)) as f:
for line in f:
values = line.split()
word = values[0]
vec = np.asarray(values[1:], dtype='float32')
shelf[word] = vec
shelf_file_name = "glove_embeddings"
glove_file_path = "glove/glove.840B.300d.txt"
# Storing glove embeddings to shelf for faster load
with closing(shelve.open(shelf_file_name + '.shelf', 'c')) as shelf:
store_glove_to_shelf(glove_file_path,shelf)
print("Stored glove embeddings from {} to {}".format(glove_file_path,shelf_file_name+'.shelf'))
# To reuse the glove embeddings stored in shelf
with closing(shelve.open(shelf_file_name + '.shelf')) as embeddings_index:
# USE embeddings_index here , which is a dictionary
print("Loaded glove embeddings from {}".format(shelf_file_name+'.shelf'))
print("Found glove embeddings with {} words".format(len(embeddings_index)))

- 1
- 3
Each corpus need to start with a line containing the vocab size and the vector size in that order.
Open the .txt file of the glove model and enter the dimension of the vector at the first line by pressing Enter first:
Example, for glove.6B.50d.txt
, just add 400000 50
in the first line.
Then use gensim to transform that raw .txt vector file to gensim vector format:
import gensim
word_vectors = gensim.models.KeyedVectors.load_word2vec_format('path/glove.6B.50d.txt', binary=False)
word_vectors.save('path/glove_gensim.txt')
Some of the other approaches here required more storage space (e.g. to split files) or were quite slow to run on my personal laptop. I tried shelf db but it seemed to blow up in storage size. Here's an "in-place" approach with one-time file-read time cost and very low additional storage cost. We treat the original text file as a database and just store the position location for each of the words. This works really well when you're, e.g., investigating properties of word vectors.
# First create a map from words to position in the file
def get_db_mapping(fname):
char_ct = 0 # cumulative position in file
pos_map = dict()
with open(fname + ".txt", 'r', encoding='utf-8') as f:
for line in tqdm(f):
new_len = len(line) # len of line
# get the word
splitlines = line.split()
word = splitlines[0].strip()
# store and increment counter
pos_map[word] = char_ct
char_ct += new_len
# write dict
with open(fname + '.db', 'wb') as handle:
pickle.dump(pos_map, handle)
class Embedding:
"""Small wrapper so that we can use [] notation to fetch word vectors.
It would be better to just have the file pointer and the pos_map as part
of this class, but that's not how I wrote it initially."""
def __init__(self, emb_fn):
self.emb_fn = emb_fn
def __getitem__(self, item):
return self.emb_fn(item)
def load_db_mapping(fname, cache_size=1000) -> Embedding:
"""Creates a function closure that wraps access to the db mapping
and the text file that functions as db. Returns them as an
Embedding object"""
# get the two state objects: mapping and file pointer
with open(fname + '.db', 'rb') as handle:
pos_map = pickle.load(handle)
f = open(fname + ".txt", 'r', encoding='utf-8')
@lru_cache(maxsize=cache_size)
def get_vector(word: str):
pos = pos_map[word]
f.seek(pos, 0)
# special logic needed because of small count errors
fail_ct = 0
read_word = ""
while fail_ct < 5 and read_word != word:
fail_ct += 1
l = f.readline()
try:
splitlines = l.split()
read_word = splitlines[0].strip()
except:
continue
if read_word != word:
raise ValueError('word not found')
# actually return
return np.array([float(val) for val in splitlines[1:]])
return Embedding(get_vector)
# to run
k_glove_vector_name = 'glove.42B.300d' # omit .txt
get_db_mapping(k_glove_vector_name) # run only once; creates .db
word_embedding = load_db_mapping(k_glove_vector_name)
word_embedding['hello']

- 103
- 7
a tool with an easy implementation of GloVe is zeulgma
https://pypi.org/project/zeugma/
from zeugma.embeddings import EmbeddingTransformer
glove = EmbeddingTransformer('glove')
the implementation is really very easy

- 71
- 6
def create_embedding_matrix(word_to_index):
# word_to_index is dictionary containing "word:token" pairs
nb_words = len(word_to_index)+1
embeddings_index = {}
with open('C:/Users/jayde/Desktop/IISc/DLNLP/Assignment1/glove.840B.300d/glove.840B.300d.txt', encoding="utf-8", errors='ignore') as f:
for line in f:
values = line.split()
word = ''.join(values[:-300])
coefs = np.asarray(values[-300:], dtype='float32')
embeddings_index[word] = coefs
embedding_matrix = np.zeros((nb_words, 300))
for word, i in word_to_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
return embedding_matrix
emb_matrix = create_embedding_matrix(vocab_to_int)

- 1
- 1
EMBEDDING_LIFE = 'path/to/your/glove.txt'
def get_coefs(word,*arr):
return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.strip().split()) for o in open(EMBEDDING_FILE))
all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
for word, i in word_index.items():
if i >= max_features: continue
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None: embedding_matrix[i] = embedding_vector

- 49
- 1
- 3
-
Please provide a comment to your answer. Why is it better than already accepted one ? – NatNgs Feb 06 '18 at 16:50
-
this is coming from kaggle and it blows up on some glove files, e.g. 800B.300d – Andrey Vykhodtsev Feb 28 '18 at 20:09