ResNet50-LSTM activity classification fail to connect AvgPool layer with LSTM

Question

I am trying to do a Resnet50-LSTM model (Resnet50 as Feature Extractor, feed into LSTM and then softmax classification) that classify on 3 categories of human activity.

I extracted 16 frames for each sample, total of 60 samples processed with batch 4. resolution at 112. (this is just a subset. the actual set is larger with 8 class)

Hence the input tensor size is (4, 16, 112, 112, 3) (b, seq, w, h, c) The label for each batch is one hot vector of size (4, 3)

The output of Resnet50 have the shape (4, 16, 4, 4, 2048). It is flattened to (4, 16, 32768) and passed into LSTM with 256 units. I am actually expecting the output to be (4, 16, 256) which then I can pass into 1dAvgPool and finally a dense layer to classify with result (4,3).

Then the error code below shown when I am trying to compile the model: Error:Input 0 of layer "global_average_pooling1d" is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: (None, 256)

which LSTM seems to be output with shape (batch, 256)

But when I try to remove the avgpool layer, the model build as below:

network architecture

which result in another problem which the output size (4,16,3) still hold the sequence information, fail to match with label of size (4,3) to calculate the cost in training.

Following is the code I used to build the network

resnet = ResNet50(weights='imagenet', include_top=False, input_shape=(112, 112, 3))

# Freeze ResNet50 layers
for layer in resnet.layers:
    layer.trainable = False

# Extract features using ResNet50
inputs = Input(shape=INPUT_SHAPE)
x = TimeDistributed(resnet)(inputs)
x = TimeDistributed(Flatten())(x)

# LSTM layer
lstm = LSTM(256, return_sequences=True)(x)

# Average pooling layer
avg_pool = TimeDistributed(GlobalAveragePooling1D())(lstm)

# Output layer
outputs = Dense(NUM_CLASSES, activation='softmax')(avg_pool)

# Create model
model = Model(inputs, outputs)
model.summary()

Do my setup of model actually making sense? I should try to rectify the network to output the result (4,3), or to modify the label size to (4,16,3)? I personally feel that the first idea is more accurate. Sincerely appreciate any input. TQVM.

ResNet50-LSTM activity classification fail to connect AvgPool layer with LSTM

0 Answers0