Have a database
with thousands/millions of valid sentences.
Create a dictionary where each word represents a number (reserve 0 for "nothing", 1 for "start of sentence" and 2 for "end of sentence").
word_dic = { "_nothing_": 0, "_start_": 1, "_end_": 2, "word1": 3, "word2": 4, ...}
reverse_dic = {v:k for k,v in word_dic.items()}
Remember to add "_start_"
and "_end_"
at the beginning and end of all sentences in the database, and "_nothing_"
after the end to complete the desired length capable of containing all sentences. (Ideally, work with sentences with 10 or less words, so your model wont't try to create bigger sentences).
Transform all your sentences into sequences of indices:
#supposing you have an array of shape (sentences, length) as string:
indices = []
for word in database.reshape((-1,)):
indices.append(word_dic[word])
indices = np.array(indices).reshape((sentences,length))
Transform this into categorical words with the keras function to_categorical()
cat_sentences = to_categorical(indices) #shape (sentences,length,dictionary_size)
Hint: keras has lots of useful text preprocessing functions here.
Separate training input and output data:
#input is the sentences except for the last word
x_train = cat_sentences[:,:-1,:]
y_train = cat_sentences[:,1:,:]
Let's create an LSTM based model that will predict the next words from the previous words:
model = Sequential()
model.add(LSTM(dontKnow,return_sequences=True,input_shape=(None,dictionary_size)))
model.add(.....)
model.add(LSTM(dictionary_size,return_sequences=True,activation='sigmoid'))
#or a Dense(dictionary_size,activation='sigmoid')
Compile and fit this model with x_train
and y_train
:
model.compile(....)
model.fit(x_train,y_train,....)
Create an identical model using stateful=True
in all LSTM
layers:
newModel = ......
Transfer the weights from the trained model:
newModel.set_weights(model.get_weights())
Create your bags in a categorical way, shape (10, dictionary_size)
.
Use the model to predict one word from the _start_
word.
#reset the states of the stateful model before you start a 10 word prediction:
newModel.reset_states()
firstWord = newModel.predict(startWord) #startword is shaped as (1,1,dictionary_size)
The firstWord
will be a vector with size dictionary_size
telling (sort of) the probabilities of each existing word. Compare to the words in the bag. You can choose the highest probability, or use some random selecting if the probabilities of other words in the bag are also good.
#example taking the most probable word:
firstWord = np.array(firstWord == firstWord.max(), dtype=np.float32)
Do the same again, but now input firstWord
in the model:
secondWord = newModel.predict(firstWord) #respect the shapes
Repeat the process until you get a sentence. Notice that you may find _end_
before the 10 words in the bag are satisfied. You may decide to finish the process with a shorter sentence then, especially if other word probabilities are low.