I am classifying sentiment of reviews - 0
or 1
- using gensim
Doc2Vec and CNN in Tensorflow 2.2.0
:
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_dim,
input_length=maxlen,
embeddings_initializer=Constant(embedding), trainable=False),
tf.keras.layers.Conv1D(128, 5, activation='relu'),
tf.keras.layers.GlobalMaxPooling1D(),
tf.keras.layers.Dense(10, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',
optimizer=tf.keras.optimizers.Adam(1e-4),
metrics=['accuracy'])
history = model.fit(X_train, y_train,
epochs=8,
validation_split=0.3,
batch_size=10)
I then make predictions and convert my sigmoid probability to 0
or 1
using np.round()
:
predicted = model.predict(X_test)
predicted = np.round(predicted,1).astype(np.int32)
I get great results (~96% accuracy) indicating that the threshold of 0.5
is working as expected...
However, when I try to predict on a set of new data, the model seems to separate bad reviews from good ones but across approx 0.0
:
# Example sigmoid outputs for new test reviews:
good_review_1: 0.000052
good_review_2: 0.000098
bad_review_1: 0.112334
bad_review_2: 0.214934
Mind you, the model never saw X_test
during training and it is able to predict just fine. It's only when I introduce a new set of review text strings, I run into incorrect predictions. For new reviews, the only preprocessing that I do before calling model.predict()
is feeding them through the same tokenizer used for model training:
s = 'This is a sample bad review.'
tokenizer.texts_to_sequences(pd.Series(s))
s = pad_sequences(s, maxlen=maxlen, padding='pre', truncating='pre')
model.predict(s)
I've been trying to make sense of this conundrum but I'm making little progress. I ran into post and it indicates
Some sigmoid functions will have this at 0, while some will have it set to a different 'threshold'.
But this still doesn't explain why my model was able to predict on np.round()
's 0.5
threshold for X_test
dataset (which the model never learned on) and then unable to predict on new dataset at the same 0.5
threshold...