10

Is it possible to use n-grams in Keras?

E.g., sentences contain in X_train dataframe with "sentences" column.

I use tokenizer from Keras in the following manner:

tokenizer = Tokenizer(lower=True, split=' ')
tokenizer.fit_on_texts(X_train.sentences)
X_train_tokenized = tokenizer.texts_to_sequences(X_train.sentences)

And later I pad the sentences thus:

X_train_sequence = sequence.pad_sequences(X_train_tokenized)

Also I use a simple LSTM network:

model = Sequential()
model.add(Embedding(MAX_FEATURES, 128))
model.add(LSTM(32, dropout=0.2, recurrent_dropout=0.2,
               activation='tanh', return_sequences=True))
model.add(LSTM(64, dropout=0.2, recurrent_dropout=0.2, activation='tanh'))
model.add(Dense(number_classes, activation='sigmoid'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop',
              metrics=['accuracy'])

In this case, tokenizer execution. In Keras docs: https://keras.io/preprocessing/text/ I see character processing is possible, but that is not appropriate for my case.

My main question: Can I use n-grams for NLP tasks (not only Sentiment Analysis but rather any NLP task)

For clarification: I'd like to consider not just words but combination of words. I'd like to try and see if it helps to model my task.

Veltzer Doron
  • 934
  • 2
  • 10
  • 31
Simplex
  • 1,723
  • 2
  • 17
  • 26
  • 1
    This is a mighty strange NN model you are using there, son – Veltzer Doron Sep 13 '18 at 10:58
  • @VeltzerDoron I am thinking of using bi-grams as well. I am using keras to train a feed forward network using bag of words feature data. So, I'm not using sequence data or a sequence model (RNNs, etc.), so bi-grams make sense. – JoAnn Alvarez Nov 30 '20 at 16:29

2 Answers2

4

Unfortunately, Keras Tokenizer() does not support n-grams. You should create a workaround and tokenize on your own the documents, and then feed them to the neural network.

Alex
  • 1,447
  • 7
  • 23
  • 48
4

If you are not aware, you can use sklearn modules like CountVectorizer or TfidfVectorizer to generate n-grams which you can then feed to the network.

Alexander Rossa
  • 1,900
  • 1
  • 22
  • 37
Satheesh K
  • 501
  • 1
  • 3
  • 16