Why 2 almost equal Keras CNN returns 2 quite different results

Question

I'm addressing a sentence-level binary classification task. My data consists of 3 subarrays of tokens: left context, core, and right context.

I used Keras to devise several alternatives of Convolutional Neural Networks and validate which one best fit my problem.

I'm a newbie in Python and Keras and I decided to start with simpler solutions in order to test which changes improve my metrics (accuracy, precision, recall, f1 and auc-roc). The first simplification was regarding input data: I decided to ignore contexts to create a Sequential model of Keras:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 500)               0         
_________________________________________________________________
masking_1 (Masking)          (None, 500)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 500, 100)          64025600  
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 497, 128)          51328     
_________________________________________________________________
average_pooling1d_1 (Average (None, 62, 128)           0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 62, 128)           0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 61, 256)           65792     
_________________________________________________________________
dropout_2 (Dropout)          (None, 61, 256)           0         
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 54, 32)            65568     
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                528       
_________________________________________________________________
dropout_3 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 34        
=================================================================

As you can see, I use a fixed size of inputs so I applied a padding preprocessing. I also used an embedding layer with a Word2Vec model.

This model returns the following results:

P       0.875457875
R       0.878676471
F1      0.87706422
AUC-ROC 0.906102654

I wished to implement how to select a subarray of input data inside my CNN by means of Lambda layers. I use the following definition of my Lambda layer:

Lambda(lambda x: x[:, 1], output_shape=(500,))(input)

And this is the summary of my new CNN (as you can see it's almost the same than the prior):

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 3, 500)            0         
_________________________________________________________________
lambda_1 (Lambda)            (None, 500)               0         
_________________________________________________________________
masking_1 (Masking)          (None, 500)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 500, 100)          64025600  
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 497, 128)          51328     
_________________________________________________________________
average_pooling1d_1 (Average (None, 62, 128)           0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 62, 128)           0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 61, 256)           65792     
_________________________________________________________________
dropout_2 (Dropout)          (None, 61, 256)           0         
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 54, 32)            65568     
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                528       
_________________________________________________________________
dropout_3 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 34        
=================================================================

But the results were disgusting because accuracy stops at 60% and obviously, precision, recall and f1 were too low (< 0.10) regarding the first model results.

I don't know what's happening and I don't know if these networks are more different that I thought.

Any clue regarding this issue?

Tom Walker · Answer 1 · 2017-09-29T16:07:50.987

0

Two initial questions (I would comment but don't have sufficient rep yet):

(1) What's the motivation for using a CNN? These are good at picking out local features in a 2-d array of input values - for example, if you imagine a black and white picture as a 2-d array of integers where the integers represent the greyscale, they might pick out clumps of pixels that represented things like edges, corners or diagonal white lines. Unless you have a reason to expect your data to, like a picture, have such locally clustered features, and for points that are nearer to eachother both horizontally and vertically in your input arrays to be more relevant you may be better with dense layers where there are no assumptions as to which input features are relevant to which others. Start with say 2 layers, and see where that gets you.

(2) Assuming you are confident about the shape of your architecture, have you tried lowering the learning rate? That's the first thing to try in any NN which is not converging well.

(3) Depending on the task, you may be better using a dictionary and one-hot encoding for your words, esp if it's relatively simple classification and context isn't too much of a big deal. Word2Vec means you are encoding the words as numbers, which has implications for gradient descent. Hard to say without knowing what you are trying to achieve, but if you don't have some reasonable idea why using word2vec is a good idea it may not be...

This link explains the difference between CNNs and dense layers well, so may help you judge.

edited Sep 29 '17 at 16:07

answered Sep 29 '17 at 15:48

Tom Walker

837
1
8
12

I rephrased my question to clarify it. Because I don't think you understood it. Nevertheless, I'm going to answer your questions. – Fernando Ortega Sep 30 '17 at 07:58
1.- In my task the local features regarding close near tokens are quite important so CNN works well for this kind of task. You can find several examples of CNN for NLP in the literature: https://dl.acm.org/citation.cfm?id=2969342, http://www.anthology.aclweb.org/N/N15/N15-1011.pdf, http://anthology.aclweb.org/C/C14/C14-1008.pdf, http://emnlp2014.org/papers/pdf/EMNLP2014181.pdf. However, I will try only Dense layers in order to compare results. – Fernando Ortega Sep 30 '17 at 07:58
2.- I'm comparing two exactly equal architectures of CNN with the difference of the Labmda layer (equal conv layers, equal dense layers, equal pooling, equal dropout, equal learning rate). One of them performs quite well and the other performs quite bad. 3.- I need to take semantic features into account. That's the main reason to use a Word Embedding solution to vectorise the tokens. One hot vectors doesn't help me too much with this problem. I really appreciate your answer. – Fernando Ortega Sep 30 '17 at 08:02
Yes, it makes sense to use CNNs in a sentence NLP context, where the "nearby" words are more relevant to eachother. – Tom Walker Sep 30 '17 at 08:44
I see - you are trying to understand why the lambda layer breaks things.- this is clear now. What is the shape of the data going into the lambda layer and the intended output? Looks like you are trying to extract the first column from the inputs , which would imply you are throwing away a large chunk of the input (unless it only has 1 column). Perhaps start with an identity lambda, prove this is equal, then tweak hte lambda from there? FWIW as you are using Keras, the docs say the output_shape param is only relevant when using Theano. – Tom Walker Sep 30 '17 at 09:02
Also, ignore what I said re dense layers given your explanation, sounds like you just need to debug the lambda layer as you note – Tom Walker Sep 30 '17 at 10:22
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/155654/discussion-between-tom-walker-and-fernando-ortega). – Tom Walker Sep 30 '17 at 10:44
I finally found the problem: I didn't set any activation to the last Dense layer so, it took the linear activation instead of sigmoid. Once I changed that, the results were as good as the first model. Thanks for your answers ;) – Fernando Ortega Oct 02 '17 at 15:50
Glad you solved it! No replies earlier from me as my entire system has been down... – Tom Walker Oct 04 '17 at 18:39

Why 2 almost equal Keras CNN returns 2 quite different results

1 Answers1