3

Can I use fasttext word vector like the ones here: https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md in a tensorflow script as an embedding vector instead of word2vec or glove without using the library fasttext

Aggounix
  • 251
  • 5
  • 15

1 Answers1

9

When you use pre-trained word vector, you can use gensim libarary.

For your reference. https://blog.manash.me/how-to-use-pre-trained-word-vectors-from-facebooks-fasttext-a71e6d55f27

In [1]: from gensim.models import KeyedVectors

In [2]: jp_model = KeyedVectors.load_word2vec_format('wiki.ja.vec')

In [3]: jp_model.most_similar('car')
Out[3]: 
[('cab', 0.9970724582672119),
 ('tle', 0.9969051480293274),
 ('oyc', 0.99671471118927),
 ('oyt', 0.996662974357605),
 ('車', 0.99665766954422),
 ('s', 0.9966464638710022),
 ('新車', 0.9966358542442322),
 ('hice', 0.9966053366661072),
 ('otg', 0.9965877532958984),
 ('車両', 0.9965814352035522)]

EDIT

I created a new branch forked from cnn-text-classification-tf. Here is the link. https://github.com/satojkovic/cnn-text-classification-tf/tree/use_fasttext

In this branch, there are three modifications to use fasttext.

  1. Extract the vocab and the word_vec from fasttext. (util_fasttext.py)
model = KeyedVectors.load_word2vec_format('wiki.en.vec')
vocab = model.vocab
embeddings = np.array([model.word_vec(k) for k in vocab.keys()])

with open('fasttext_vocab_en.dat', 'wb') as fw:
    pickle.dump(vocab, fw, protocol=pickle.HIGHEST_PROTOCOL)
np.save('fasttext_embedding_en.npy', embeddings)
  1. Embedding layer

    W is initialized by zeros, and then an embedding_placeholder is set up to receive the word_vec, and finally W is assigned. (text_cnn.py)

W_ = tf.Variable(
    tf.constant(0.0, shape=[vocab_size, embedding_size]),
    trainable=False,
    name='W')

self.embedding_placeholder = tf.placeholder(
    tf.float32, [vocab_size, embedding_size],
    name='pre_trained')

W = tf.assign(W_, self.embedding_placeholder)
  1. Use the vocab and the word_vec

    The vocab is used to build the word-id maps, and the word_vec is fed into the embedding_placeholder.

with open('fasttext_vocab_en.dat', 'rb') as fr:
    vocab = pickle.load(fr)
embedding = np.load('fasttext_embedding_en.npy')

pretrain = vocab_processor.fit(vocab.keys())
x = np.array(list(vocab_processor.transform(x_text)))
feed_dict = {
    cnn.input_x: x_batch,
    cnn.input_y: y_batch,
    cnn.dropout_keep_prob: FLAGS.dropout_keep_prob,
    cnn.embedding_placeholder: embedding
}

Please try it out.

satojkovic
  • 689
  • 1
  • 4
  • 15
  • 1
    how can I use jp_model in a tensorflow script as a pretrained vector? – Aggounix Jul 03 '17 at 09:42
  • I added some information. Please check my answer for more details. (EDIT section) – satojkovic Jul 10 '17 at 23:11
  • I'm grad it was helpful – satojkovic Jul 12 '17 at 02:21
  • 4
    FastText should extract vectors for out-of-vocabulary words using character n-grams. But in your code, you extract the vocabulary dictionary first and feed it to the model as embedding. I think for a new word, model will fail to generate a vector. – Kerem Apr 23 '18 at 13:12
  • is there a smaller version of the fasttext vector? I'm hesitant to load a 6.1gb file :( – kRazzy R Jul 17 '18 at 20:56
  • @kRazzyR Please refer the following site. https://fasttext.cc/docs/en/english-vectors.html (for example, wiki-news-300d-1M.vec file is 2.2gb) – satojkovic Feb 11 '19 at 04:18
  • 1
    It seems in tensor flow 2, .placeholder are removed! What is the fix to this please? When I follow your method to use a fastext as embedding in tf2 i get the error: AttributeError: module 'tensorflow' has no attribute 'placeholder'. – chikitin Oct 31 '19 at 09:57