6

I'm building a CNN to perform sentiment analysis on Keras. Everything is working perfectly, the model is trained and ready to be launched to production.

However, when I try to predict on new unlabelled data by using the method model.predict() it only outputs the associated probability. I tried to use the method np.argmax() but it always outputs 0 even when it should be 1 (on test set, my model achieved 80% of accuracy).

Here is my code to pre-process the data:

# Pre-processing data
x = df[df.Sentiment != 3].Headlines
y = df[df.Sentiment != 3].Sentiment

# Splitting training, validation, testing dataset
x_train, x_validation_and_test, y_train, y_validation_and_test = train_test_split(x, y, test_size=.3,
                                                                                      random_state=SEED)
x_validation, x_test, y_validation, y_test = train_test_split(x_validation_and_test, y_validation_and_test,
                                                                  test_size=.5, random_state=SEED)

tokenizer = Tokenizer(num_words=NUM_WORDS)
tokenizer.fit_on_texts(x_train)

sequences = tokenizer.texts_to_sequences(x_train)
x_train_seq = pad_sequences(sequences, maxlen=MAXLEN)

sequences_val = tokenizer.texts_to_sequences(x_validation)
x_val_seq = pad_sequences(sequences_val, maxlen=MAXLEN)

sequences_test = tokenizer.texts_to_sequences(x_test)
x_test_seq = pad_sequences(sequences_test, maxlen=MAXLEN)

And here is my model:

MAXLEN = 25
NUM_WORDS = 5000
VECTOR_DIMENSION = 100

tweet_input = Input(shape=(MAXLEN,), dtype='int32')

tweet_encoder = Embedding(NUM_WORDS, VECTOR_DIMENSION, input_length=MAXLEN)(tweet_input)

# Combinating n-gram to optimize results
bigram_branch = Conv1D(filters=100, kernel_size=2, padding='valid', activation="relu", strides=1)(tweet_encoder)
bigram_branch = GlobalMaxPooling1D()(bigram_branch)
trigram_branch = Conv1D(filters=100, kernel_size=3, padding='valid', activation="relu", strides=1)(tweet_encoder)
trigram_branch = GlobalMaxPooling1D()(trigram_branch)
fourgram_branch = Conv1D(filters=100, kernel_size=4, padding='valid', activation="relu", strides=1)(tweet_encoder)
fourgram_branch = GlobalMaxPooling1D()(fourgram_branch)
merged = concatenate([bigram_branch, trigram_branch, fourgram_branch], axis=1)

merged = Dense(256, activation="relu")(merged)
merged = Dropout(0.25)(merged)
output = Dense(1, activation="sigmoid")(merged)

optimizer = optimizers.adam(0.01)

model = Model(inputs=[tweet_input], outputs=[output])
model.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=['accuracy'])
model.summary()

# Training the model
history = model.fit(x_train_seq, y_train, batch_size=32, epochs=5, validation_data=(x_val_seq, y_validation))

I also tried to change the number of activations on the final Dense layer from 1 to 2, but I get an error:

Error when checking target: expected dense_12 to have shape (2,) but got array with shape (1,)
E_net4
  • 27,810
  • 13
  • 101
  • 139
RFTexas
  • 280
  • 4
  • 7
  • Welcome to Stack Overflow! The output is a single activation, so it seems to be the probability of a single binary class. Just take an operating point threshold (e.g. 0.5) and predict *true* if the probability is equal or larger. Indeed, there's very likely another question in this site which will be useful to you, but may be hard to find at the moment. – E_net4 Aug 25 '18 at 15:34

1 Answers1

10

You are doing binary classification. So you have a Dense layer consisting of one unit with an activation function of sigmoid. Sigmoid function outputs a value in range [0,1] which corresponds to the probability of the given sample belonging to positive class (i.e. class one). Everything below 0.5 is labeled with zero (i.e. negative class) and everything above 0.5 is labeled with one. So to find the predicted class you can do the following:

preds = model.predict(data)
class_one = preds > 0.5

The true elements of class_one correspond to samples labeled with one (i.e. positive class).

Bonus: to find the accuracy of your predictions you can easily compare class_one with the true labels:

acc = np.mean(class_one == true_labels)

Note that I have assumed that true_labels consists of zeros and ones.


Further, if your model were defined using Sequential class, then you could easily use predict_classes method:

pred_labels = model.predict_classes(data)

However, since you are using Keras functional API to construct your model (which is a very good thing to do so, in my opinion), you can't use predict_classes method since it is ill-defined for such models.

today
  • 32,602
  • 8
  • 95
  • 115
  • Thanks for your answer! It's clearer now. However I have a bigger problem I think. When I try to predict unlabelled data with the model, I always get highly positive answer even when the data are obviously negative. I thought first that my model has overfitted the training data. So I tried to classify a text that is in my test. It is considered highly negative when I evaluate the model but when I try to predict it, it is highly positive. I se the same tokenizer as the one with which my model is trained. – RFTexas Aug 26 '18 at 00:46
  • @RFTexas Could you please clarify what do you mean by saying "evaluate the model" and "try to predict it"? For the latter I guess you use `predict` method, but I can't understand what you mean by "evaluate" here. – today Aug 26 '18 at 04:59
  • First I train my model and I optimize it on validation set. Then I use the method 'evaluate' to see how my model is performing on the test set. When I come up with a satisfying accuracy, I want to use the model to predict new data. The problem is that when a sentence like "Price falls vertical after failed IPO" is in the test set it is labelled by my model as negative (obviously), with a probability around 0. But when I try to label this same sentence with my model, it says that it is highly positive (around 1). – RFTexas Aug 26 '18 at 16:23
  • @RFTexas I can't understand these parts: "...it is labelled by **my model** as **negative**..." and "... try to label with **my model**, it says that it is highly **positive**". How can the model predict both negative and positive given the same data? – today Aug 26 '18 at 16:37
  • That's exactly what I'm trying to figure out!! It's weird! I thought at first that it was because of the tokenizer. – RFTexas Aug 26 '18 at 19:53
  • @RFTexas You should call `predict` on your data. `pred = model.predict(mydata)` and that's it. `pred` would be one value in range [0,1]. It cannot be both 0 and 1; even if you call `predict` multiple times given the same data, the result would be the same. – today Aug 27 '18 at 06:27
  • @today Hello. So we can return the 0/1 predictions, but how do we predict on real samples? e.g. I want to predict if an airline customer is satisfied or not based on service ratings and cabin class etc. The original dataset has all string values e.g. `class: business class` and rankings like `wifi: 3`. But we always transform our data to dummies for ML purposes. So how do we predict 0/1 from values that are now dummies? And do we need to use every single feature in the set or can we omit features? Let's say in a real-world test, the row doesn't have a `wifi` label? what then? – Edison Jul 11 '22 at 09:20