Arbitrary threshold for sigmoid activation function for CNN binary classification?

Question

I am classifying sentiment of reviews - 0 or 1 - using gensim Doc2Vec and CNN in Tensorflow 2.2.0:

model = tf.keras.Sequential([
      tf.keras.layers.Embedding(vocab_size, embedding_dim, 
                                input_length=maxlen, 
                                embeddings_initializer=Constant(embedding), trainable=False),
      tf.keras.layers.Conv1D(128, 5, activation='relu'),
      tf.keras.layers.GlobalMaxPooling1D(),
      tf.keras.layers.Dense(10, activation='relu'),
      tf.keras.layers.Dense(1, activation='sigmoid')
    ])

model.compile(loss='binary_crossentropy',
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

history = model.fit(X_train, y_train,
                    epochs=8,
                    validation_split=0.3,
                    batch_size=10)

I then make predictions and convert my sigmoid probability to 0 or 1 using np.round():

predicted = model.predict(X_test)
predicted = np.round(predicted,1).astype(np.int32)

I get great results (~96% accuracy) indicating that the threshold of 0.5 is working as expected...

However, when I try to predict on a set of new data, the model seems to separate bad reviews from good ones but across approx 0.0:

# Example sigmoid outputs for new test reviews:
good_review_1: 0.000052
good_review_2: 0.000098

bad_review_1: 0.112334
bad_review_2: 0.214934

Mind you, the model never saw X_test during training and it is able to predict just fine. It's only when I introduce a new set of review text strings, I run into incorrect predictions. For new reviews, the only preprocessing that I do before calling model.predict() is feeding them through the same tokenizer used for model training:

s = 'This is a sample bad review.'
tokenizer.texts_to_sequences(pd.Series(s))
s = pad_sequences(s, maxlen=maxlen, padding='pre', truncating='pre')

model.predict(s)

I've been trying to make sense of this conundrum but I'm making little progress. I ran into post and it indicates

Some sigmoid functions will have this at 0, while some will have it set to a different 'threshold'.

But this still doesn't explain why my model was able to predict on np.round()'s 0.5 threshold for X_test dataset (which the model never learned on) and then unable to predict on new dataset at the same 0.5 threshold...

Can you try with `model.predict(X_test).round()` and check if there is any difference between the outputs. — , Nov 11 '20 at 15:36

Arbitrary threshold for sigmoid activation function for CNN binary classification?

0 Answers0