Questions tagged [oov]

"Out of Vocabulary" words, terms, n-grams, etc in the fields of computational linguistics and natural language processing. The term for encountering items in input which do not previously exist in a dictionary, database, corpus, etc.

12 questions
4
votes
2 answers

Part of speech tagging : tagging unknown words

In the part of speech tagger, the best probable tags for the given sentence is determined using HMM by P(T*) = argmax P(Word/Tag)*P(Tag/TagPrev) T But when 'Word' did not appear in the training corpus, P(Word/Tag) produces ZERO…
user1599171
3
votes
4 answers

Efficient way of resolving unknown words to known words?

I am designing a text processing program that will generate a list of keywords from a long itemized text document, and combine entries for words that are similar in meaning. There are metrics out there, however I have a new issue of dealing with…
Slater Victoroff
  • 21,376
  • 21
  • 85
  • 144
2
votes
1 answer

How to deal with very uncommon terms in tf-idf?

I'm implementing a naive "keyword extraction algorithm". I'm self-taught though so I lack some terminology and maths common in the online literature. I'm finding "most relevant keywords" of a document thus: I count how often each term is used in…
hippietrail
  • 15,848
  • 18
  • 99
  • 158
1
vote
1 answer

Find most similar words for OOV word

I am looking for the most similar words for out-of-vocab OOV words using gensim. Something like this: def get_word_vec(self, model, word): try: if word not in model.wv.vocab: mostSimWord = model.wv.similar_by_word(word) …
N0rA
  • 612
  • 1
  • 7
  • 27
1
vote
1 answer

voice recognition on iOS - convert OOV words to phonemes on iOS?

I’ve tried, as suggested on StackOverflow, Openears sucessfully, and generate custom vocabularies from arrays of NSSTRINGS. However, we also need to recognize names from the addressbook, and here the fallback method inevitably fails miserably very…
ranavision
  • 11
  • 1
0
votes
1 answer

How to tune FastText parameter for OOV word?

I already heard that FastText is generating OOV word vectors using its n-gram's. It is already automatically built-in at FastText architecture or we should like to tune specific parameters to it? like an oov_tokens in Keras tokenizer. I already…
0
votes
1 answer

How to handle out of vocab words with bag of words

I am attempting to use BoW before ML on my text based dataset. But, I do not want my training set to influence my test set (i.e., data leakage). I want to deploy BoW on the train set before the test set. But, then my test set has different features…
Kim S.
  • 47
  • 5
0
votes
1 answer

Cannot reproduce pre-trained word vectors from its vector_ngrams

Just curiosity, but I was debugging gensim's FastText code for replicating the implementation of Out-of-Vocabulary (OOV) words, and I'm not being able to accomplish it. So, the process i'm following is training a tiny model with a toy corpus, and…
threepwood
  • 13
  • 3
0
votes
3 answers

Handling OOV words in GoogleNews-vectors-negative300.bin

I need to calculate the word vectors for each word of a sentence that is tokenized as follows: ['my', 'aunt', 'give', 'me', 'a', 'teddy', 'ruxpin']. If I was using the pretrained [fastText][1] Embeddings: cc.en.300.bin.gz by facebook. I could get…
chikitin
  • 762
  • 6
  • 28
0
votes
2 answers

fasttext: is there a way export ngrams?

I'm new to DL and NLP, and recently started using a pre-trained fastText embedding model (cc.en.300.bin) through gensim. I would like to be able to calculate vectors for out-of-vocabulary words myself, by splitting the word to n-grams and looking up…
R Sorek
  • 3
  • 2
0
votes
2 answers

Part of speech for unknown and known words

what are the different between part of speech tagging for unknown words and part of speech tagging for known words. Is there any tool that can predict part of speech tagging for the words ..
S Gaber
  • 1,536
  • 7
  • 24
  • 43
-1
votes
1 answer

Find list of Out Of Vocabulary (OOV) words from my domain spectific pdf while using FastText model

How to find list of Out Of Vocabulary (OOV) words from my domain spectific pdf while using FastText model? I need to fine tune FastText with my domain specific words.