Pytorchvideo Models Resnet Input shape

Question

I am using the following code to load resnet50 but since this is a video. I am not sure what is the expected input. Is it ([batch_size, channels, frames,img1,img2])?

Any help would be fantastic.

import pytorchvideo.models.resnet

def resnet():
  return pytorchvideo.models.resnet.create_resnet(
      input_channel=3,     # RGB input from Kinetics
      model_depth=50,      # For the tutorial let's just use a 50 layer network
      model_num_class=400, # Kinetics has 400 classes so we need out final head to align
      norm=nn.BatchNorm3d,
      activation=nn.ReLU,
  )

Shouldn't the input be `[batch_size, frames, 3, height, width]`? since it is usually of the form `[batch_size, num_frames, channels, height, width]`. — Memristor, May 01 '23 at 21:29

Toyo · Answer 1 · 2023-04-27T01:20:48.563

0

The shape of the input tensor should be (B, C, T, H, W)

Source: https://pytorchvideo.readthedocs.io/en/latest/models.html#resnet-models-for-video-classification

One example of usage from the documentation.

import pytorchvideo.models as models

resnet = models.create_resnet()
B, C, T, H, W = 2, 3, 8, 224, 224
input_tensor = torch.zeros(B, C, T, H, W)
output = resnet(input_tensor)

edited Apr 27 '23 at 01:20

answered Apr 27 '23 at 01:14

Toyo

667
1
5
22

score 0 · Answer 2 · answered May 01 '23 at 19:35

The input tensor shape should be [batch_size, channels, frames, height, width]. Where:

Channels: 3 for RGB images,
Frames: the number of frames per video clip,
Height and Width: the spatial dimensions of the frames.

In your case (Kinetics 400), the expected input tensor shape should be [batch_size, 3, frames, height, width].

This is a small example on how to load a video:

import torch
import torchvision.transforms as transform
from pytorchvideo.data.encoded_video import EncodedVideo

def load_video_clip(video_path, frames, height, width):
    video = EncodedVideo.from_path(video_path)
    
    trans = transform.Compose([
        transform.Resize((height, width)),
        transform.ToTensor(),
    ])

    video_frames = []
    for frame in video.get_clip(start_sec=0, end_sec=video.duration):
        video_frames.append(trans(frame))

    indices = torch.linspace(0, len(video_frames) - 1, steps=frames).long()
    video_frames = torch.stack([video_frames[idx] for idx in indices])

    video_clip = video_frames.unsqueeze(0)
    return video_clip

video_path = 'video.mp4'
frames = 16
height = 224
width = 224

video_clip = load_video_clip(video_path, frames, height, width)

Here, we load the video and apply the transform to each frame sample frames. Lastly, we sample the frame number uniformly from the video and add batch dimension.

Pytorchvideo Models Resnet Input shape

2 Answers2