Code to create a reliable Language model from my own corpus

Question

I have a corpus of sentences in a specific domain. I am looking for an open-source code/package, that I can give the data and it will generate a good, reliable language model. (Meaning, given a context, know the probability for each word).

Is there such a code/project?

I saw this github repo: https://github.com/rafaljozefowicz/lm, but it didn't work.

@StefanFalk do you have an example on how to use it on my own data? couldn't find any — Cranjis, Oct 15 '18 at 14:25
It's not very clear what you are looking for actually. Are we talking here about translation, speech recognition, word embeddings or something else? — Stefan Falk, Oct 15 '18 at 16:50
@StefanFalk I have a specific domain (corpus of sentences) and I want to model it to know probabilities of words given context. Meaning, after training, I want to know probabilities of p(w | context), for various words and contextes (from my own corpus) — Cranjis, Oct 16 '18 at 14:39
There are several ways to do that. One thing I can think of is word2vec here. [See this answer](https://stackoverflow.com/a/42187104/826983) for example. You might want to try [`gensim's word2vec`](https://radimrehurek.com/gensim/models/word2vec.html) for this. — Stefan Falk, Oct 16 '18 at 15:00
If you want to understand word2vec a bit better I recommend [this paper](https://arxiv.org/pdf/1411.2738.pdf) which is a bit easier to understand than [the original](https://arxiv.org/pdf/1310.4546.pdf) imho. — Stefan Falk, Oct 16 '18 at 15:05
@StefanFalk word2vec is embedding. I want probabilities to a sentence. — Cranjis, Oct 16 '18 at 16:17
It's not clear to me what you want. You just said you want p(w | context) which is what word2vec is doing/trying to model. See [this answer](https://datascience.stackexchange.com/a/10417/25289) which basically explains how [this](https://datascience.stackexchange.com/a/17992/25289) works. — Stefan Falk, Oct 16 '18 at 16:48
@StefanFalk I would rather use a model that was validated and evaluated, preferabbly a LSTM model.. is such exists? — Cranjis, Oct 16 '18 at 17:29
That depends now on what you actually want. E.g. there are pre-trained word vectors for word2vec. — Stefan Falk, Oct 17 '18 at 07:53
@StefanFalk I want to train it on my on domain (word in another language) — Cranjis, Oct 17 '18 at 13:20

inkalchemist1994 · Answer 1 · 2018-10-14T21:51:39.790

I recommend writing your own basic implementation. First, we need some sentences:

import nltk
from nltk.corpus import brown
words = brown.words()
total_words = len(words)
sentences = list(brown.sents())

sentences is now a list of lists. Each sublist represents a sentence with each word as an element. Now you need to decide whether or not you want to include punctuation in your model. If you want to remove it, try something like the following:

punctuation = [",", ".", ":", ";", "!", "?"]
for i, sentence in enumerate(sentences.copy()):
    new_sentence = [word for word in sentence if word not in punctuation]
    sentences[i] = new_sentence

Next, you need to decide whether or not you care about capitalization. If you don't care about it, you could remove it like so:

for i, sentence in enumerate(sentences.copy()):
    new_sentence = list()
    for j, word in enumerate(sentence.copy()):
        new_word = word.lower() # Lower case all characters in word
        new_sentence.append(new_word)
    sentences[i] = new_sentence

Next, we need special start and end words to represent words that are valid at the beginning and end of sentences. You should pick start and end words that don't exist in your training data.

start = ["<<START>>"]
end = ["<<END>>"]
for i, sentence in enumerate(sentences.copy()):
    new_sentence = start + sentence + end
    sentences[i] = new_sentence

Now, let's count unigrams. A unigram is a sequence of one word in a sentence. Yes, a unigram model is just a frequency distribution of each word in the corpus:

new_words = list()
for sentence in sentences:
    for word in sentence:
        new_words.append(word)
unigram_fdist = nltk.FreqDist(new_words)

And now it's time to count bigrams. A bigram is a sequence of two words in a sentence. So, for the sentence "i am the walrus", we have the following bigrams: "<> i", "i am", "am the", "the walrus", and "walrus <>".

bigrams = list()
for sentence in sentences:
    new_bigrams = nltk.bigrams(sentence)
    bigrams += new_bigrams

Now we can create a frequency distribution:

bigram_fdist = nltk.ConditionalFreqDist(bigrams)

Finally, we want to know the probability of each word in the model:

def getUnigramProbability(word):
    if word in unigram_fdist:
        return unigram_fdist[word]/total_words
    else:
        return -1 # You should figure out how you want to handle out-of-vocabulary words

def getBigramProbability(word1, word2):
    if word1 not in bigram_fdist:
        return -1 # You should figure out how you want to handle out-of-vocabulary words
    elif word2 not in bigram_fdist[word1]:
        # i.e. "word1 word2" never occurs in the corpus
        return getUnigramProbability(word2)
    else:
        bigram_frequency = bigram_fdist[word1][word2]
        unigram_frequency = unigram_fdist[word1]
        bigram_probability = bigram_frequency / unigram_frequency
        return bigram_probability

While this isn't a framework/library that just builds the model for you, I hope seeing this code has demystified what goes on in a language model.

thanks, but I'd rather used more complicated model with LSTM etc. — Cranjis, Oct 15 '18 at 12:04

score 0 · Answer 2 · answered May 05 '19 at 08:36

0

You might try word_language_model from PyTorch examples. There just might be an issue if you have a big corpus. They load all data in memory.

answered May 05 '19 at 08:36

Tomas P

403
3
9

Code to create a reliable Language model from my own corpus

2 Answers2