How to find list of Out Of Vocabulary (OOV) words from my domain spectific pdf while using FastText model? I need to fine tune FastText with my domain specific words.
-
1Welcome to StackOverflow, which you should not is _not_ a code-writing service. Read [tour] and [ask]. This is a useful checklist when you want to create a question https://meta.stackoverflow.com/questions/260648/stack-overflow-question-checklist – DisappointedByUnaccountableMod Jul 26 '21 at 07:34
1 Answers
A FastText model will already be able to generate vectors for OOV words.
So there's not necessarily any need to either list the specifically OOV words in your PDF, nor 'fine tune' as FastText model.
You just ask it for vectors, it gives them back. The vectors for full in-vocabulary words, that were trained from relevant training material, will likely be best, while vectors synthesized for OOV words from word-fragments (character n-grams) shared with training material will just be rough guesses - better than nothing, but not great.
(To train a good word-vector requires many varied examples of a word's use, interleaved with similarly good examples of its many 'peer' words – and traditionally, in one unified, balanced training session.)
If you think you need to do more, you should expand your questin with more details about why you think that's necessary, and what existing precedents (in docs/tutorials/papers) you're trying to match.
I've not seen a well-documented way to casually fine-tune, or incrementally expand the known-vocabulary of, an existing FastText model. There would be a lot of expert tradeoffs required, and in many cases simply training a new model with sufficient data is likely to be a safer approach.
Anyone seeking such fine-tuning should have a clear idea of:
- what their incremental data might be able to add to an existing model
- what process/code will they be using, and why that process/code might be expected to give meaningful results with their specific starting model & new data
- how the results of any such process can be evaluated to ensure the extra fine-tuning steps are beneficial compared to alternatives

- 52,260
- 14
- 86
- 115