0

Disclaimer: newbie to Keras and Python.

Hello everybody, I am trying to implement a neural network in Keras following the specs presented in this paper: https://arxiv.org/pdf/1605.09507.pdf.

First of all, I have some doubts regarding the network architecture (section III, subsection B of the paper). In fact the output shape of my network does not match the one reported in Table I of the paper, even if I followed the specs written in the subsection B.

Here is the code of my network:

filtersNumber = 32
filtersReceptiveField = (3, 3)
filtersStride = (1, 1)
maxpoolSize = (3, 3)
maxpoolStride = (1, 1)
layersNumber = 4
myActivation = 'relu'
inputShape = (128,43,1)
classesNumber = 11

def myActivationFunction(model, activation):
    if activation == 'tanh' or activation == 'relu':
        model.add(Activation(activation))
    elif activation == 'prelu':
        model.add(PReLU())
    elif activation == 'lrelu_0.01':
        model.add(LeakyReLU(alpha=0.01))
    elif activation == 'lrelu_0.33':
        model.add(LeakyReLU(alpha=0.33))
    return model

model = Sequential()
for index in range(layersNumber):
    if index == 0: 
        model.add(Conv2D(filtersNumber,filtersReceptiveField,strides=filtersStride,padding='same',input_shape=inputShape))
    else:
        model.add(Conv2D(filtersNumber,filtersReceptiveField,strides=filtersStride, padding='same'))
    model = myActivationFunction(model, myActivation)
    model.add(Conv2D(filtersNumber,filtersReceptiveField,strides=filtersStride, padding='same'))
    model = myActivationFunction(model,myActivation)

    if index != (layersNumber-1):
        model.add(MaxPooling2D(pool_size=maxpoolSize,strides=maxpoolStride))
        model.add(Dropout(0.25))
        filtersNumber = filtersNumber*2
    else: 
        model.add(GlobalMaxPooling2D())
        model.add(Dense(1024))
        model = myActivationFunction(model, myActivation)
        model.add(Dropout(0.50))
        model.add(Dense(classesNumber))
        model.add(Activation('sigmoid'))
model.summary()

And here is the model.summary():

Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_1 (Conv2D)            (None, 128, 43, 32)       320       
_________________________________________________________________
activation_1 (Activation)    (None, 128, 43, 32)       0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 128, 43, 32)       9248      
_________________________________________________________________
activation_2 (Activation)    (None, 128, 43, 32)       0         
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 126, 41, 32)       0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 126, 41, 32)       0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 126, 41, 64)       18496     
_________________________________________________________________
activation_3 (Activation)    (None, 126, 41, 64)       0         
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 126, 41, 64)       36928     
_________________________________________________________________
activation_4 (Activation)    (None, 126, 41, 64)       0         
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 124, 39, 64)       0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 124, 39, 64)       0         
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 124, 39, 128)      73856     
_________________________________________________________________
activation_5 (Activation)    (None, 124, 39, 128)      0         
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 124, 39, 128)      147584    
_________________________________________________________________
activation_6 (Activation)    (None, 124, 39, 128)      0         
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 122, 37, 128)      0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 122, 37, 128)      0         
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 122, 37, 256)      295168    
_________________________________________________________________
activation_7 (Activation)    (None, 122, 37, 256)      0         
_________________________________________________________________
conv2d_8 (Conv2D)            (None, 122, 37, 256)      590080    
_________________________________________________________________
activation_8 (Activation)    (None, 122, 37, 256)      0         
_________________________________________________________________
global_max_pooling2d_1 (Glob (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1024)              263168    
_________________________________________________________________
activation_9 (Activation)    (None, 1024)              0         
_________________________________________________________________
dropout_4 (Dropout)          (None, 1024)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 11)                11275     
_________________________________________________________________
activation_10 (Activation)   (None, 11)                0         
=================================================================
Total params: 1,446,123
Trainable params: 1,446,123
Non-trainable params: 0
_________________________________________________________________

After some attempts, I've managed to get the exact same dimensions of Table I with the following code. As you can see, I had to insert an additional zero padding layer before each convolution layer in addition to "padding = same", and I had to remove the max-pool stride (it means that max-pool stride will default to pool size according to keras documentation).

filtersNumber = 32
filtersReceptiveField = (3, 3)
filtersStride = (1, 1)
maxpoolSize = (3, 3)
maxpoolStride = (1, 1)
zeroPadding = (1, 1)
layersNumber = 4
myActivation = 'relu'
inputShape = (128,43,1)
classesNumber = 11

def myActivationFunction(model, activation):
    if activation == 'tanh' or activation == 'relu':
        model.add(Activation(activation))
    elif activation == 'prelu':
        model.add(PReLU())
    elif activation == 'lrelu_0.01':
        model.add(LeakyReLU(alpha=0.01))
    elif activation == 'lrelu_0.33':
        model.add(LeakyReLU(alpha=0.33))
    return model

model = Sequential()
for index in range(layersNumber):
    if index == 0: # for the first layer, specify input shape
        model.add(ZeroPadding2D(zeroPadding,input_shape=inputShape))
    else:
        model.add(ZeroPadding2D(zeroPadding))
    model.add(Conv2D(filtersNumber, filtersReceptiveField, strides=filtersStride, padding='same'))
    model = myActivationFunction(model, myActivation)
    model.add(ZeroPadding2D(zeroPadding))
    model.add(Conv2D(filtersNumber,filtersReceptiveField,strides=filtersStride, padding='same'))
    model = myActivationFunction(model, myActivation)
    if index != (layersNumber-1):
        model.add(MaxPooling2D(pool_size=maxpoolSize))
        model.add(Dropout(0.25))
        filtersNumber = filtersNumber*2
    else: # for the last layer
        model.add(GlobalMaxPooling2D())
        model.add(Dense(1024))
        model = myActivationFunction(model, myActivation)
        model.add(Dropout(0.50))
        model.add(Dense(classesNumber))
        model.add(Activation('sigmoid'))
model.summary()

First question: Shouldn't "padding=same" be enough for zero padding, considering that the author says "the input for each convolution layer is zero-padded with 1 × 1 to preserve spatial resolution"? Max-pool stride=1 is an author's mistake or am I missing something?

By the way, using these new specs I tried to train the network, but unfortunately the loss and the val_loss didn't change, and "the train was stopped because the val_loss did not decrease for more than three epochs", as stated in the paper.

Train on 5699 samples, validate on 1006 samples
Epoch 1/1000
5699/5699 [==============================] - 559s 98ms/step - loss: 2.4453 - acc: 0.0635 - val_loss: 2.3979 - val_acc: 0.0447
Epoch 2/1000
5699/5699 [==============================] - 583s 102ms/step - loss: 2.9140 - acc: 0.0602 - val_loss: 3.4699 - val_acc: 0.0447
Epoch 3/1000
5699/5699 [==============================] - 571s 100ms/step - loss: 3.4037 - acc: 0.0604 - val_loss: 3.4699 - val_acc: 0.0447
Epoch 4/1000
5699/5699 [==============================] - 592s 104ms/step - loss: 4.2809 - acc: 0.0598 - val_loss: 4.5773 - val_acc: 0.0447  

Here is my training code (specs taken from Section III subsection C):

import numpy as np
import os
import keras
from keras.models import Sequential
from keras.layers import ZeroPadding2D,Conv2D,Activation,MaxPooling2D,Dropout,GlobalMaxPooling2D,Dense,PReLU,LeakyReLU
from keras.callbacks import EarlyStopping

def myActivationFunction(model, convAct): 
    if convAct == 'tanh' or convAct == 'relu':
        model.add(Activation(convAct))
    elif convAct == 'prelu':
        model.add(PReLU())
    elif convAct == 'lrelu_0.01':
        model.add(LeakyReLU(alpha=0.01))
    elif convAct == 'lrelu_0.33':
        model.add(LeakyReLU(alpha=0.33))
    return model


def buildCNN(inputShape, classesNumber, myActivation):
    # Paper: section III, subsection B: Network Architecture
    filtersNumber = 32
    filtersReceptiveField = (3, 3)
    filtersStride = (1, 1)
    zeroPadding = (1, 1)
    maxpoolSize = (3, 3)
    maxpoolStride = (1, 1)
    layersNumber = 4

    model = Sequential()
    for index in range(layersNumber):
        if index == 0:  
            model.add(ZeroPadding2D(zeroPadding, input_shape=inputShape))
        else:
            model.add(ZeroPadding2D(zeroPadding))
        model.add(Conv2D(filtersNumber, filtersReceptiveField, strides=filtersStride, padding='same'))
        model = myActivationFunction(model, myActivation)
        model.add(ZeroPadding2D(zeroPadding))
        model.add(Conv2D(filtersNumber, filtersReceptiveField, strides=filtersStride, padding='same'))
        model = myActivationFunction(model, myActivation)
        if index != (layersNumber - 1):
            model.add(MaxPooling2D(
                pool_size=maxpoolSize)) 
            model.add(Dropout(0.25))
            filtersNumber = filtersNumber * 2
        else: 
            model.add(GlobalMaxPooling2D())
            model.add(Dense(1024))
            model = myActivationFunction(model, myActivation)
            model.add(Dropout(0.50))
            model.add(Dense(classesNumber))
            model.add(Activation('sigmoid'))
    return model



if __name__ == '__main__':

    import argparse
    parser = argparse.ArgumentParser(description="Trains the network using training dataset")
    parser.add_argument("-w", "--window", type=float, default=3.0, choices=[0.5, 1.0, 1.5, 3.0],
                        help="Analysis window size. Choose from 0.5, 1.0, 1.5, 3.0. Default: 1.0")
    parser.add_argument("-t", "--threshold", type=float, default=0.55, choices=[0.20, 0.25, 0.30, 0.35, 0.40, 0.45,
                                                                 0.50, 0.55, 0.60, 0.65, 0.70, 0.75, 0.80],
                        metavar="[0.20:0.05:0.80]",
                        help="Identification threshold. Choose from 0.20 to 0.80 (step size 0.05). Default: 0.55")
    parser.add_argument("-a", metavar=" ", default="relu",
                        choices=["tanh", "relu", "prelu", "lrelu_0.01", "lrelu_0.33"],
                        help="activation function. Choose from tanh, relu, prelu, lrelu_0.01, lrelu_0.33. Default: relu")
    parser.add_argument("-p", "--path", default="Preproc", help="path of preprocessed files (default: Preproc)")
    args = parser.parse_args()

    X_train = np.load(args.path+"/X_train_"+str(args.window)+"s.npy")
    Y_train = np.load(args.path+"/Y_train_"+str(args.window)+"s.npy")


    batchSize = 128
    epochsNum = 1000 

    model = buildCNN((X_train.shape[1],X_train.shape[2],X_train.shape[3]),Y_train.shape[1], args.a)
    # model.summary()

    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

    earlyStopping = EarlyStopping(monitor='val_loss', patience=3) 

    model.fit(X_train, Y_train, batch_size=batchSize, epochs=epochsNum, validation_split=0.15, callbacks=[earlyStopping])

At this point I thought that maybe there is something wrong in my training data preprocessing, but after several reviews I couldn't manage to find any error. Here is my preprocessing code (you can find the specs in Section III, subsection A of the paper):

import librosa
import librosa.display
import numpy as np
import os
import shutil
import keras
# import matplotlib.pyplot as plt


# Paper: section III, subsection A: Audio Preprocessing
def preprocess_dataset(input_path, output_path):
    for root, directories, filenames in os.walk(input_path):
        for directory in directories:  
            if not os.path.exists(output_path + os.path.join(os.path.relpath(root, input_path), directory)):  
                os.makedirs(output_path + os.path.join(os.path.relpath(root, input_path), directory)) 
            else:
                return

        for filename in filenames:  
            if filename.endswith(".wav"):  
                audio_signal, sample_rate = librosa.load(
                    os.path.join(root, filename))  # audio is mixed to mono and resampled to 22050 Hz
                normalized_audio_signal = librosa.util.normalize(audio_signal)  # audio normalization by its max value
                # Compute mel-spectrogram with the following specs:
                # - STFT window lenght: 1024 samples
                # - hop size: 512 samples
                # - mel frequency bins: 128
                mel_spect = librosa.feature.melspectrogram(normalized_audio_signal, sample_rate, n_fft=1024, hop_length=512,
                                                          n_mels=128)

                log_mel_spect = np.log(np.maximum(1e-10, mel_spect)) # add a threshold to avoid -inf results
                log_mel_spect = log_mel_spect[:,:,np.newaxis]  # add new axis for keras channel last mode


                filename, fileExtension = os.path.splitext(filename)  # split file name from extension
                np.save(output_path + os.path.join(os.path.relpath(root, input_path), filename), log_mel_spect)  # save as .npy file
                # librosa.display.specshow(log_mel_spect, y_axis='mel', x_axis='time')
                # plt.show()
            elif filename.endswith(".txt"):  # copy files containing testing labels
                shutil.copy(os.path.join(root, filename), output_path + os.path.join(os.path.relpath(root, input_path), filename))


def training_vectors_init(training_path, chunks_numb):
    classes_names = sorted(os.listdir(training_path))  
    total_classes = len(classes_names)  
    audio_path = training_path + classes_names[0] + '/'  
    infilename = os.listdir(audio_path)[0]  
    melgram = np.load(audio_path + infilename)  
    melgram_dimensions = melgram.shape  
    for dirpath, dirnames, filenames in os.walk(training_path):
        total_training_files = total_training_files + len(filenames) 

    melgram_chunk_length = int(melgram_dimensions[1] / chunks_numb) 
    x_train = np.zeros(((total_training_files * chunks_numb), melgram_dimensions[0], melgram_chunk_length, melgram_dimensions[2])) 
    y_train = np.zeros(((total_training_files * chunks_numb), total_classes)) 
    return classes_names,total_classes,x_train,y_train,melgram_chunk_length


def shuffle_xy(x, y): 
    assert x.shape[0] == y.shape[0], "Dimensions problem"
    idx = np.array(range(y.shape[0]))  
    np.random.shuffle(idx)  
    new_x = np.copy(x)  
    new_y = np.copy(y)
    for i in range(len(idx)):  
        new_x[i] = x[idx[i], :, :, :]
        new_y[i] = y[idx[i], :]
    return new_x, new_y


def build_training_dataset(preproc_path, training_win_len):
    training_path = preproc_path + "Training/"
    training_audio_length = 3  # training audio length (seconds)
    chunks_numb = int(training_audio_length / training_win_len)
    classes_names,total_classes,x_train,y_train,melgram_chunk_length = training_vectors_init(training_path, chunks_numb)
    count = 0
    for class_index, class_name in enumerate(classes_names):  
        one_hot_label = keras.utils.to_categorical(class_index,
                                           num_classes=total_classes) 
        file_names = os.listdir(training_path + class_name) 

        for file_name in file_names:  
            audio_path = training_path + class_name + '/' + file_name  
            mel = np.load(audio_path)  

            for i in range(chunks_numb):  
                x_train[count,:,:,:] = mel[:,(melgram_chunk_length*i):(melgram_chunk_length*(i+1)),:]  
                y_train[count,:] = one_hot_label 
                count = count + 1

    x_train, y_train = shuffle_xy(x_train, y_train)  
    np.save(preproc_path + "X_train_" + str(training_win_len) + 's', x_train)  
    np.save(preproc_path + "Y_train_" + str(training_win_len) + 's', y_train)
    return melgram_chunk_length,classes_names


if __name__ == '__main__':
    import argparse
    parser = argparse.ArgumentParser(
        description="preprocess_data: convert samples to .npy data format for faster loading")
    parser.add_argument("-i", "--inpath", help="input directory for audio samples (default: IRMAS-Sample)",
                        default="IRMAS-Sample")
    parser.add_argument("-o", "--outpath", help="output directory for preprocessed files (default: Preproc)",
                        default="Preproc")
    args = parser.parse_args()

    preprocess_dataset(args.inpath + '/', args.outpath + '/')

    winLengths = [0.5, 1.0, 1.5, 3.0]
    for winLen in winLengths:
        melChunkLen, classesNames = build_training_dataset(args.outpath + '/', winLen)
    ...

Second question: what could be the problem of the network? I tried also to train the network with few samples and use the same samples as validation data, but val_loss remains constant, as you can see here.

Epoch 1/1000
3/3 [==============================] - 1s 256ms/step - loss: 2.3653 - acc: 0.3333 - val_loss: 2.1726 - val_acc: 0.3333
Epoch 2/1000
3/3 [==============================] - 0s 108ms/step - loss: 2.0382 - acc: 0.3333 - val_loss: 1.5727 - val_acc: 0.3333
Epoch 3/1000
3/3 [==============================] - 0s 104ms/step - loss: 1.3635 - acc: 0.3333 - val_loss: 1.1036 - val_acc: 0.6667
Epoch 4/1000
3/3 [==============================] - 0s 109ms/step - loss: 1.1281 - acc: 0.3333 - val_loss: 1.0986 - val_acc: 0.3333
Epoch 5/1000
3/3 [==============================] - 0s 102ms/step - loss: 1.0986 - acc: 0.6667 - val_loss: 1.0986 - val_acc: 0.3333
Epoch 6/1000
3/3 [==============================] - 0s 104ms/step - loss: 1.0986 - acc: 0.3333 - val_loss: 1.0986 - val_acc: 0.3333

Does anyone know what is going on on this network?

  • Just below the table in the article, there is this: " The input for each convolution layer is zero-padded with 1 x 1 to preserve the spatial resolution regardless of input window size," – Daniel Möller May 11 '18 at 13:55
  • Yes, that's exactly what I wrote near **First question**. What I don't understand is why the option "padding=same" of Conv2D is not sufficient to do zero padding. I had to add both ZeroPadding2D and "padding=same" to get the right dimensions. – skateskate May 11 '18 at 14:17
  • "Padding = same" is correct, it creates a zero padding and returns an image with the **same size**. I don't know why the article is "increasing" the size of the images, but that is certainly not how convolutions work. --- Normal convolutions "decrease" size. Convolutions with "padding='same'" "maintain" size. Now, convolutions that "increase" size is something new that I have only seen in this article. So, yes, if you want to increase size you need extra padding. – Daniel Möller May 11 '18 at 15:54

1 Answers1

0

I'm author of this paper. Regarding "padding", it is true that increasing the size of data might be unnecessary when we consider this particular case only. In this paper, we used extra padding to keep the network architecture constant as possible for various input sizes (0.5s, 2.0s, 3.0s), in front of every conv. layers.

Another thing is max-pooling stride should be just default size, same to the pool size. I think it is my mistake to say stride =1 on the paper :(

  • Thank you very much for your explanation. May I ask you if 'categorical_crossentropy' is the right choice for this particular loss function in keras? That's because changing it to 'binary_crossentropy' seems to fix my second problem. – skateskate May 18 '18 at 13:46
  • Yes, it should be categorical cross-entropy since it is multi-class problem. – Yoonchang Jun 07 '18 at 08:35