0

I'm new to DL and NLP, and recently started using a pre-trained fastText embedding model (cc.en.300.bin) through gensim.

I would like to be able to calculate vectors for out-of-vocabulary words myself, by splitting the word to n-grams and looking up the vector for every n-gram.

I could not find a way to export the n-gram vectors that are part of the model. I realize they are hashed, but perhaps there's a way (not necessarily using gensim) to get them?

Any insight will be appreciated!

R Sorek
  • 3
  • 2

2 Answers2

0

You can look exactly at how the gensim code creates FastText word-vectors for out-of-vocabulary words by examining its source code for its FastTextKeyedVectors class word_vec() method directly:

https://github.com/RaRe-Technologies/gensim/blob/3aeee4dc460be84ee4831bf55ca4320757c72e7b/gensim/models/keyedvectors.py#L2069

(Note that this source code in gensim's develop branch may reflect recent FastText fixes that wouldn't match what your installed package up through gensim version 3.7.1 does; you may want to consult your installed package's local source code, or wait for these fixes to be in an official release.)

Because Python doesn't protect any part of the relevant objects from external access (with things like enforced 'private' designations), you can perform the exact same operations from outside the class.

Note particularly that, in the current code (which matches the behavior of Facebook's original implementation), n-gram vectors will be pulled from the buckets in the hashtable ngram_weights structure whether or not your current n-grams were truly known in the training data or not. In the cases where those n-grams were known and meaningful in the training data, that should help the OOV vector a bit. In the cases where it's getting an arbitrary other vector instead, such randomness shouldn't hurt much.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Thanks gojomo! I am able to see the embedding vector for a sepcific n-gram (ngram_weights) but I would like to be able to get this for *all* n-grams, so I could create a 'dictionary' of all n-grams. I think I can even access all vectors at once through model.wv.wv.syn0_vocab, but this is hashed and I don't know how to link the vectors to the subwords they represent... Any ideas? – R Sorek Mar 14 '19 at 16:16
  • Why do you want a dictionary? Facebook's FT implementation chose its data structure, a hashtable that doesn't care about bucket collisions, for good reasons: it saves space, avoids maintaining a list of all "known" n-grams, & it doesn't hurt much to return a random/arbitrary vector for previously-unseen n-grams (or occasionally the same vector for more than 1 unique n-gram). And, if you want to calculate the same vectors as FB FT, you'll have to make the same tradeoffs. So I'm wondering what would drive the need for a more precise (n-gram)->(vector) mapping. – gojomo Mar 14 '19 at 19:33
  • 1
    That said, while the `FastText` model structures don't include any explicit list of known n-grams, or precise map of (exact n-gram)->(exact vector), you can synthesize the same. You'd iterate over the list of known full-words, splitting them up into n-grams of the appropriate lengths. Then, look up those n-grams in the hashtable. You could create your own (string n-gram)->(vector) dict from that. It'd take up more space than the native structure, and wouldn't handle never-before-seen n-grams in a compatible way. – gojomo Mar 14 '19 at 19:36
  • Thanks again gojomo! The reason for this is that my application can only use limited amount of RAM, and loading a multiple giga *.bin model is just not feasible. What we did so far is load a *.vec version of the model to disk. This works well in terms of latency but does not support subwords... Is this making more sense? do you have a better solution? thanks again! – R Sorek Mar 17 '19 at 13:15
  • The n-grams hashtable size is controlled by a model setup parameter called `bucket` – default value 2,000,000. Unless `bucket` was far larger than necessary, slimming a model down to just the "known n-grams" is unlikely to save any memory. (Most bucket slots should be used by known n-grams, & for a fullish hashtable, doing hash-to-slot at lookup-time uses *less* memory than maintaining a (string)->(slot) list. This looks like a common-crawl-derived model with a giant vocab; I'd expect nearly all buckets to be relevant, but you could check that via the synthesize-n-grams trick mentioned above.) – gojomo Mar 17 '19 at 20:16
  • If buying/renting more RAM isn't a possibility, & real need is "wish this FT model was smaller", then two main options I see: (1) training your own FT model to have smaller surviving vocabulary (via larger `min_count`) and/or fewer n-gram buckets (via smaller `buckets`); (2) performing surgery on the existing model to slim it & re-save it. There's no built-in support for this but by looking at source it might not be too hard. You'd discard some full words (from less-frequent tail end of known words); you can *probably* replace hashtable with one of 1/2, 1/4, etc size by combining slots. – gojomo Mar 17 '19 at 20:36
  • (This hypothetical hashtable-shrinkage might make n-gram performance arbitrarily worse, by increasing the number of n-gram collisions. Essentially, you'd change the 2,000,000 slots to something like 250,000 – 1/8th the size – by combining the existing values at all slots with the same `slot MOD 250000` value.) – gojomo Mar 17 '19 at 20:39
0

I recently encountered this issue myself and had to write a script to reduce the model size. The fasttext C code includes a handy function "threshold" to reduce dictionary size, but it's not exposed to the python bindings. After the dictionary reduction you also need to re-build the input matrix, including the ngram buckets that come after the main word vectors. After saving the model in this way, all word vectors will be generated from the subword information alone (except for the dictionary words that remain)

for word similarity search, output_ and model_ are not used. To save more memory you can also just comment out the part that writes output_ in saveModel

note that the ngram entries by themselves are about 2Gb in the english pretrained model, so that's about the smallest you can make the model even if all dictionary words are removed.

/* note: some dict_ members are public for easier access */
void FastText::quantize(const Args& qargs) {
  /*if (args_->model != model_name::sup) {
    throw std::invalid_argument(
        "For now we only support quantization of supervised models");
  }*/
  args_->input = qargs.input;
  args_->qout = qargs.qout;
  args_->output = qargs.output;
  std::shared_ptr<DenseMatrix> input =
      std::dynamic_pointer_cast<DenseMatrix>(input_);
  std::shared_ptr<DenseMatrix> output =
      std::dynamic_pointer_cast<DenseMatrix>(output_);
  bool normalizeGradient = (args_->model == model_name::sup);

  if (qargs.cutoff > 0 && qargs.cutoff < input->size(0)) {
    /*auto idx = selectEmbeddings(qargs.cutoff);
    dict_->prune(idx);*/
    int32_t rows = dict_->size_+args_->bucket;
    dict_->threshold(2000, 2000);
    std::cerr << "words:  " << dict_->size_ << std::endl;
    std::cerr << "rows:  " << rows << std::endl;
    /*std::shared_ptr<DenseMatrix> ninput =
        std::make_shared<DenseMatrix>(idx.size(), args_->dim);*/
    int32_t new_rows = dict_->size_+args_->bucket;
    std::shared_ptr<DenseMatrix> ninput = std::make_shared<DenseMatrix>(dict_->size_+args_->bucket, args_->dim);
    for (auto i = 0; i < dict_->size_; i++) {
      for (auto j = 0; j < args_->dim; j++) {
        int32_t index = dict_->getId(dict_->words_[i].word);
        ninput->at(i, j) = input->at(index, j);
      }
    }

    int32_t offset = rows-new_rows;
    for (auto i = dict_->size_; i < new_rows; i++) {
      for (auto j = 0; j < args_->dim; j++) {
        ninput->at(i, j) = input->at(i+offset, j);
      }
    }
    /*input = ninput;*/
    input_ = ninput;
    if (qargs.retrain) {
      args_->epoch = qargs.epoch;
      args_->lr = qargs.lr;
      args_->thread = qargs.thread;
      args_->verbose = qargs.verbose;
      auto loss = createLoss(output_);
      model_ = std::make_shared<Model>(input, output, loss, normalizeGradient);
      startThreads();
    }
  }

  /*input_ = std::make_shared<QuantMatrix>(
      std::move(*(input.get())), qargs.dsub, qargs.qnorm);*/

  /*if (args_->qout) {
    output_ = std::make_shared<QuantMatrix>(
        std::move(*(output.get())), 2, qargs.qnorm);
  }
*/
  /*quant_ = true;*/
  auto loss = createLoss(output_);
  model_ = std::make_shared<Model>(input_, output_, loss, normalizeGradient);
}
Jack000
  • 157
  • 1
  • 10