I am doing a Twitter sentiment analysis project. It has been demonstrated from some literature that the use of information from emoji and emoticon could improve the performance of a sentiment classifier on Twitter data(such as a work done by IBM Sentiment Expression via Emoticons on Social Media in 2015). Moreover, the emoji2vec project emoji2vec which could create the representation of each emoji based on emoji descriptions emoji description is really helpful for Twitter sentiment analysis.
Now, I am using Keras to construct the sequential model to do this sentiment classification. But my question is since before constructing all the sequential models, you should pass your text data to the Tokenizer API first:
tokenizer = Tokenizer(num_words= vocabulary_size)
tokenizer.fit_on_texts(df['Phrase'])
sequences = tokenizer.texts_to_sequences(df['Phrase'])
data = pad_sequences(sequences, maxlen=50)
where df
is my pandas dataframe. Hence, is it possible to add emoji to the Tokenizer(since the Tokenizer API first select the top vocabulary size
most frequent words and construct the word-index pair)? The emojis are apparently less frequent than the words and they are quite significant features in sentiment classification. Hence, I want to add emojis to the keras Tokenizer API and create emojis' emoji-index pair.
When it comes to model, I am constructing a BiLSTM model with pre-trained embedding(such as trained by FastText). How could I combine the emoji representation and the word representation in this task? The following code shows my BiLSTM model:
# BiLSTM model with Conv1D and fasttext word embedding
def get_bi_lstm_model(embedding_matrix):
model = Sequential()
model.add(Embedding(input_dim=vocabulary_size, output_dim=dim, input_length=input_length,
weights=[embedding_matrix], trainable=False, name='embedding_1'))
model.add(Dropout(0.2, name='dropout_1'))
model.add(Conv1D(64, 5, activation='relu', name='conv1d_1'))
model.add(MaxPooling1D(pool_size=4, name='maxpooling_1'))
model.add(Bidirectional(LSTM(lstm_output_dim, dropout=0.2, recurrent_dropout=0.2, return_sequences=True), merge_mode='concat',
name='bidirectional_1'))
model.add(Flatten(name = 'flatten_1'))
model.add(Dense(3, activation='softmax', name='dense_1'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy', f1_score])
return model
Any help and insights would be appreciated! Thanks! Merry Christmas!