4

I am building a machine learning model which will process documents and extract some key information from it. For this, I need to use word embedding for OCRed output. I have several different options for the embedding (Google's word2vec, Stanford's, Facebook's fastText) but my main concern is OOV words, as the OCR output will have a lot of misspelled words. For example, I want the embeddings such that the output for Embedding and Embdding (e missed by the OCR) should have a certain level of similarity. I don't care much about the associated contextual information.

I chose Facebook's fastText as it gives embeddings for OOV words as well. My only concern is the size of the embeddings. The vector size of fastText's model is 300. Is there a way to reduce the size of the returned word vector? I am thinking of using PCA or any other dimensionality reduction technique, but given the size of word vectors, it can be a time-consuming task.

Michael Mior
  • 28,107
  • 9
  • 89
  • 113
ironhide012
  • 85
  • 1
  • 2
  • 7
  • 1
    You can specify a smaller vector size than 300 dimensions when training the model – and the model will be proportionately smaller. But, why is the size a concern? (Have you hit system resource limits when using usual sizes?) – gojomo Nov 19 '19 at 22:41
  • Also, while FastText's sensitivity to subword fragments (character n-grams) might help a bit with OCR errors, it alone might not be enough. You might instead want to apply some other spell-checking process, to replace words that don't exist in other non-glitchy corpuses with other best-guess (based on edit-distance, relative word-frequencies, or context-words) real words. – gojomo Nov 19 '19 at 22:44

1 Answers1

4
import fasttext
import fasttext.util

ft = fasttext.load_model('cc.en.300.bin')
print(ft.get_dimension())

fasttext.util.reduce_model(ft, 100)
print(ft.get_dimension())

This code should reduce your 300 vector embedding lenght to 100.

Link to official documentation: https://fasttext.cc/docs/en/crawl-vectors.html

Michael Mior
  • 28,107
  • 9
  • 89
  • 113
Pedro Muñoz
  • 590
  • 5
  • 11
  • 1
    Here is the direct link to the installation instructions: https://github.com/facebookresearch/fastText/tree/master/python#installation – Gearoid Murphy Mar 19 '20 at 00:09