The input tensor shape should be [batch_size, channels, frames, height, width]. Where:
- Channels: 3 for RGB images,
- Frames: the number of frames per video clip,
- Height and Width: the spatial dimensions of the frames.
In your case (Kinetics 400), the expected input tensor shape should be [batch_size, 3, frames, height, width].
This is a small example on how to load a video:
import torch
import torchvision.transforms as transform
from pytorchvideo.data.encoded_video import EncodedVideo
def load_video_clip(video_path, frames, height, width):
video = EncodedVideo.from_path(video_path)
trans = transform.Compose([
transform.Resize((height, width)),
transform.ToTensor(),
])
video_frames = []
for frame in video.get_clip(start_sec=0, end_sec=video.duration):
video_frames.append(trans(frame))
indices = torch.linspace(0, len(video_frames) - 1, steps=frames).long()
video_frames = torch.stack([video_frames[idx] for idx in indices])
video_clip = video_frames.unsqueeze(0)
return video_clip
video_path = 'video.mp4'
frames = 16
height = 224
width = 224
video_clip = load_video_clip(video_path, frames, height, width)
Here, we load the video and apply the transform to each frame sample frames. Lastly, we sample the frame number uniformly from the video and add batch dimension.