I'm addressing a sentence-level binary classification task. My data consists of 3 subarrays of tokens: left context, core, and right context.
I used Keras to devise several alternatives of Convolutional Neural Networks and validate which one best fit my problem.
I'm a newbie in Python and Keras and I decided to start with simpler solutions in order to test which changes improve my metrics (accuracy, precision, recall, f1 and auc-roc). The first simplification was regarding input data: I decided to ignore contexts to create a Sequential model of Keras:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 500) 0
_________________________________________________________________
masking_1 (Masking) (None, 500) 0
_________________________________________________________________
embedding_1 (Embedding) (None, 500, 100) 64025600
_________________________________________________________________
conv1d_1 (Conv1D) (None, 497, 128) 51328
_________________________________________________________________
average_pooling1d_1 (Average (None, 62, 128) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 62, 128) 0
_________________________________________________________________
conv1d_2 (Conv1D) (None, 61, 256) 65792
_________________________________________________________________
dropout_2 (Dropout) (None, 61, 256) 0
_________________________________________________________________
conv1d_3 (Conv1D) (None, 54, 32) 65568
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 32) 0
_________________________________________________________________
dense_1 (Dense) (None, 16) 528
_________________________________________________________________
dropout_3 (Dropout) (None, 16) 0
_________________________________________________________________
dense_2 (Dense) (None, 2) 34
=================================================================
As you can see, I use a fixed size of inputs so I applied a padding preprocessing. I also used an embedding layer with a Word2Vec model.
This model returns the following results:
P 0.875457875
R 0.878676471
F1 0.87706422
AUC-ROC 0.906102654
I wished to implement how to select a subarray of input data inside my CNN by means of Lambda layers. I use the following definition of my Lambda layer:
Lambda(lambda x: x[:, 1], output_shape=(500,))(input)
And this is the summary of my new CNN (as you can see it's almost the same than the prior):
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 3, 500) 0
_________________________________________________________________
lambda_1 (Lambda) (None, 500) 0
_________________________________________________________________
masking_1 (Masking) (None, 500) 0
_________________________________________________________________
embedding_1 (Embedding) (None, 500, 100) 64025600
_________________________________________________________________
conv1d_1 (Conv1D) (None, 497, 128) 51328
_________________________________________________________________
average_pooling1d_1 (Average (None, 62, 128) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 62, 128) 0
_________________________________________________________________
conv1d_2 (Conv1D) (None, 61, 256) 65792
_________________________________________________________________
dropout_2 (Dropout) (None, 61, 256) 0
_________________________________________________________________
conv1d_3 (Conv1D) (None, 54, 32) 65568
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 32) 0
_________________________________________________________________
dense_1 (Dense) (None, 16) 528
_________________________________________________________________
dropout_3 (Dropout) (None, 16) 0
_________________________________________________________________
dense_2 (Dense) (None, 2) 34
=================================================================
But the results were disgusting because accuracy stops at 60% and obviously, precision, recall and f1 were too low (< 0.10) regarding the first model results.
I don't know what's happening and I don't know if these networks are more different that I thought.
Any clue regarding this issue?