0

Can anyone please suggest me how to resolve this error. I am simply loading the glove vector and when trying to iterate , it is showing this error

embeddings_index = dict()
f = open('/content/drive/My Drive/lstm donor/lstm_glove_vectors')
for line in f:
    values = line.split()
    word = values[0]
    coefs = asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()


---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-79-3373015fdc0b> in <module>()
      1 embeddings_index = dict()
      2 with open('/content/drive/My Drive/lstm donor/lstm_glove_vectors','r',encoding='utf-8') as f:
----> 3   for line in f:
      4           values = line.split()
      5           word = values[0]

/usr/lib/python3.6/codecs.py in decode(self, input, final)
    319         # decode input (taking the buffer into account)
    320         data = self.buffer + input
--> 321         (result, consumed) = self._buffer_decode(data, self.errors, final)
    322         # keep undecoded input until the next call
    323         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
Christopher Manning
  • 9,360
  • 34
  • 46
Ashu
  • 75
  • 1
  • 5

2 Answers2

0

This is a decoding Issue, i.e. it can not decode the data that you want to read. If you try to read a csv, add csv to your file name.

MacOS
  • 1,149
  • 1
  • 7
  • 14
0

It looks like it's saying the first byte of the file (position 0) is 0x80, unless that means position 0 at some point during the decoding of an individual character. At any rate, this means it isn't a valid utf-8 file. I don't recognize the name 'lstm_glove_vectors' so someone has trained their own vectors or done something (at least renaming, maybe more processing) to the original distributed vectors. Most likely this file most likely isn't a plain text file. It might be a gzipped or zip file? Or vectors in binary encoding as numbers?

I'd just try looking at the contents with something like the more or less command and seeing what seems to be there.

Final possibility: The very first release of the Common Crawl-derived GloVe vectors did have a few Unicode errors in them, so this could occur if you're using a very old data file. But that problem was fixed in 2015.

Christopher Manning
  • 9,360
  • 34
  • 46