3

I am using the following code to load resnet50 but since this is a video. I am not sure what is the expected input. Is it ([batch_size, channels, frames,img1,img2])?

Any help would be fantastic.

import pytorchvideo.models.resnet

def resnet():
  return pytorchvideo.models.resnet.create_resnet(
      input_channel=3,     # RGB input from Kinetics
      model_depth=50,      # For the tutorial let's just use a 50 layer network
      model_num_class=400, # Kinetics has 400 classes so we need out final head to align
      norm=nn.BatchNorm3d,
      activation=nn.ReLU,
  )
Jordy
  • 1,802
  • 2
  • 6
  • 25
Rayanxv
  • 31
  • 1
  • Shouldn't the input be `[batch_size, frames, 3, height, width]`? since it is usually of the form `[batch_size, num_frames, channels, height, width]`. – Memristor May 01 '23 at 21:29

2 Answers2

0

The shape of the input tensor should be (B, C, T, H, W)

Source: https://pytorchvideo.readthedocs.io/en/latest/models.html#resnet-models-for-video-classification

One example of usage from the documentation.

import pytorchvideo.models as models

resnet = models.create_resnet()
B, C, T, H, W = 2, 3, 8, 224, 224
input_tensor = torch.zeros(B, C, T, H, W)
output = resnet(input_tensor)
Toyo
  • 667
  • 1
  • 5
  • 22
0

The input tensor shape should be [batch_size, channels, frames, height, width]. Where:

  • Channels: 3 for RGB images,
  • Frames: the number of frames per video clip,
  • Height and Width: the spatial dimensions of the frames.

In your case (Kinetics 400), the expected input tensor shape should be [batch_size, 3, frames, height, width].

This is a small example on how to load a video:

import torch
import torchvision.transforms as transform
from pytorchvideo.data.encoded_video import EncodedVideo

def load_video_clip(video_path, frames, height, width):
    video = EncodedVideo.from_path(video_path)
    
    trans = transform.Compose([
        transform.Resize((height, width)),
        transform.ToTensor(),
    ])

    video_frames = []
    for frame in video.get_clip(start_sec=0, end_sec=video.duration):
        video_frames.append(trans(frame))

    indices = torch.linspace(0, len(video_frames) - 1, steps=frames).long()
    video_frames = torch.stack([video_frames[idx] for idx in indices])

    video_clip = video_frames.unsqueeze(0)
    return video_clip

video_path = 'video.mp4'
frames = 16
height = 224
width = 224

video_clip = load_video_clip(video_path, frames, height, width)

Here, we load the video and apply the transform to each frame sample frames. Lastly, we sample the frame number uniformly from the video and add batch dimension.

Hamzah
  • 8,175
  • 3
  • 19
  • 43