Can not read glove.6B.300d.txt in a pandas dataframe

Question

I am trying to read glove.6B.300d.txt file into a Pandas dataframe. (The file can be downloaded from here: https://github.com/stanfordnlp/GloVe)

Here are the exceptions I am getting:

glove = pd.read_csv(filename, sep = ' ')
ParserError: Error tokenizing data. C error: EOF inside string starting at line 8

glove = pd.read_csv(filename, sep = ' ', engine = 'python')
ParserError: field larger than field limit (131072)

score 0 · Answer 1 · answered May 01 '19 at 17:12

I suggest you that read the glove file into a dictionary. It's more convenient and efficient to use this pretrained embedding.

embeddings_index = {}
f = open(os.path.join(filename), encoding='utf8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

If your task need the dataframe version, you can convert it into dataframe by iterate the key, val in the dictionary.

score 0 · Answer 2 · answered Jun 18 '19 at 04:32

sample code for loading glove Embeddings as dict.

def load_glove_index():
    EMBEDDING_FILE = '../input/embeddings/glove.840B.300d/glove.840B.300d.txt'
    def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')[:300]
    embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))
    return embeddings_index

glove_embedding_index = load_glove_index()

score -1 · Answer 3 · answered Jul 31 '20 at 13:35

-1

It's better to download and unzip from here: https://nlp.stanford.edu/projects/glove

After extracting you will get your glove file from the above link.

answered Jul 31 '20 at 13:35

Palash Mondal

468
4
10

Welcome to stack overflow. Consider posting this as a comment instead of an answer. After you have gained a certain amount of reputation points, you will be able to post comments to other questions/answers. – Rahul Bhobe Jul 31 '20 at 18:00

Can not read glove.6B.300d.txt in a pandas dataframe

3 Answers3