1

I am trying to run a GloVe word embedding on a Bengali news dataset. Now the original GloVe source doesn't have any supported language other than English but I found this which has word vectors pretrained for 30 non-English languages.

I am running this notebook on text classification using GloVe embeddings. My question is

  1. Can I use the pre-trained Bengali word vectors with my custom Bengali dataset, and run on this model?

  2. this pretrained Bengali word vector is in tsv format. Using the following code I cannot seem to parse it into word-vector lists.

     embeddings_index = {}
     f = open(root_path + 'bn.tsv')
     for line in f:
         values = line.split('\t')
         word = values[1] ## The first entry is the word
         coefs = np.asarray(values[1:], dtype='float32') ## These are the vecotrs representing the embedding for the word
         embeddings_index[word] = coefs
     f.close()
    
     print('GloVe data loaded')
    

and I get the error

---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-39-3a4cb8d8dfb0> in <module>()
          4     values = line.split('\t')
          5     word = values[1] ## The first entry is the word
    ----> 6     coefs = np.asarray(values[1:], dtype='float32') ## These are the vecotrs representing the embedding for the word
          7     embeddings_index[word] = coefs
          8 f.close()

    /usr/local/lib/python3.6/dist-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
         83 
         84     """
    ---> 85     return array(a, dtype, copy=False, order=order)
         86 
         87 

    ValueError: could not convert string to float: 'এবং'
afsara_ben
  • 542
  • 1
  • 11
  • 30
  • 1
    '.tsv' is not the usual format for sharing trained word-vectors, so I'd be wary of any source providing vectors in that format. (And, your error suggests that a non-numeric string is still present in your `values[1:]` parameter. Are you sure the contents of `values` are what you expect?) If your main aim it to acquire reusable vectors, Facebook themselves have provided vectors in many languages, including Bengali, at https://fasttext.cc/docs/en/crawl-vectors.html – gojomo Jul 14 '20 at 00:21
  • 1
    And, you don't necessarily need to provide your own parsing code, or store vectors in a raw Python dict. A library like `gensim` includes support classes for loading & working with pre-trained vectors in various formats. – gojomo Jul 14 '20 at 00:23
  • Thanks. But do you know if GloVe has any non-English pretrained word vectors like this? I want to cover all of the pretrained word vectors available out there – afsara_ben Jul 14 '20 at 08:31
  • 1
    Your search would be as good as mine on that issue. But most people who want to use pre-trained vectors just want the vectors to be useful, without caring too much about the technique used to train them. GloVe is a little harder to train on giant vocabularies, & doesn't have the OOV benefits of FastText. So I'm not sure what project *would* go through the same exercise, of pre-training comparable vectors across many languages, that Facebook has. So if you specifically need GloVE vectors for some evaluative purpose, you may need to train them yourself. – gojomo Jul 14 '20 at 15:45

0 Answers0