Why the number of images getting reduced after obtaining the bottleneck features?

Question

I am trying to build a simple 5-class object detector by extracting the bottleneck features using a pre-trained vgg16 (trained on image net). I have 10000 images for training - 2000 for each class AND 2500 for testing - 500 for each class. However, once I extract the bottleneck features, the validation tensor has a size of 2496, but the expected size should be 2500. I have checked the folder data folder and found that the total number of validation images is 2500. But I am still getting an error when I try to execute my code. The error says - "ValueError: Input arrays should have the same number of samples as target arrays. Found 2496 input samples and 2500 target samples". I have attached the code below, can anyone make me understand why the number of input samples is getting reduced to 2496?

I just checked the number of images present in the train and test data to be really sure that no images were missing. It turns out that no images were actually missing.

This is the code to get the bottleneck features.

global_start=dt.now()

#Dimensions of our flicker images is 256 X 256
img_width, img_height = 256, 256

#Declaration of parameters needed for training and validation
train_data_dir = 'data/train'
validation_data_dir = 'data/validation'
epochs = 50
batch_size = 16

#Get the bottleneck features by  Weights.T * Xi
def save_bottlebeck_features():
    datagen = ImageDataGenerator(rescale=1./255)

    #Load the pre trained VGG16 model from Keras, we will initialize only the convolution layers and ignore the top layers.
    model = applications.VGG16(include_top=False, weights='imagenet')

    generator_tr = datagen.flow_from_directory(train_data_dir,
                                            target_size=(img_width, img_height),
                                            batch_size=batch_size,
                                            class_mode=None, #class_mode=None means the generator won't load the class labels.
                                            shuffle=False) #We won't shuffle the data, because we want the class labels to stay in order.
    nb_train_samples = len(generator_tr.filenames) #10000. 2000 training samples for each class
    bottleneck_features_train = model.predict_generator(generator_tr, nb_train_samples // batch_size)
    np.save('weights/vgg16bottleneck_features_train.npy',bottleneck_features_train) #bottleneck_features_train is a numpy array

    generator_ts = datagen.flow_from_directory(validation_data_dir,
                                            target_size=(img_width, img_height),
                                            batch_size=batch_size,
                                            class_mode=None,
                                            shuffle=False)
    nb_validation_samples = len(generator_ts.filenames) #2500. 500 training samples for each class
    bottleneck_features_validation = model.predict_generator(generator_ts, nb_validation_samples // batch_size)
    np.save('weights/vgg16bottleneck_features_validation.npy',bottleneck_features_validation)
    print("Got the bottleneck features in time: ",dt.now()-global_start)

    num_classes = len(generator_tr.class_indices)

    return nb_train_samples,nb_validation_samples,num_classes,generator_tr,generator_ts

nb_train_samples,nb_validation_samples,num_classes,generator_tr,generator_ts=save_bottlebeck_features()

This is the output of the above code snippet:

Found 10000 images belonging to 5 classes.
Found 2500 images belonging to 5 classes.
Got the bottleneck features in time:  1:56:44.166846

Now, if I do validation_data.shape, I am getting (2496, 8, 8, 512) whereas the expected output should be (2500, 8, 8, 512). The train_data output is fine. What might be wrong? I am new to debugging in Keras and I am not really able to figure out what exactly is causing this problem.

Any help would be highly appreciated!

2496 is divisible by 16, the batch size? It’s maybe rounding and feeding a number of batches? I had a similar problem with a tensorflow set up once. Also, check your threading because sometimes if it’s multithreaded, the output vector isn’t in the same order as the input images. — Pam, May 04 '19 at 07:19
Is it because of the batch size I have chosen? 10000 is perfectly divided by 16 whereas 2500 is not. Is this the reason why it's behaving like that? Right now, I have changed the batch size to 20. Let's see if this works! Will update in another 2 hours. — Saugata Paul, May 04 '19 at 07:22
I changed the batch size to 20 and it seems to have solved the problem. Thanks for your feedback! The problem was occuring because the total test size was not perfectly divisible by 16, so it was rounding up to 2496. — Saugata Paul, May 05 '19 at 05:59
Yeah, I had this problem, too. It doesn’t matter for training, but it does for testing. Make sure your test results "make sense". I don’t think you’re multithreading but if you are, predictions won’t be in the same order as input. — Pam, May 05 '19 at 11:25

Why the number of images getting reduced after obtaining the bottleneck features?

0 Answers0