I'm trying to perform some basic action recognition using the KTH dataset.
I'm using the 3DSIFT feature extractor from UCF link. Which extracts a SIFT descriptor from a given x, y and z coordinate.
For feature detection I am using selective-STIPS link, that has shown to be very effective for action recognition. According to the source code provided by the author, it produces the following output:
@output : corner_points, P X 4 matrix, where P is the number of interest
% point found in the image_stack and each interest point contains
% 4 values :: [X,Y] coordinate of the interest point, frame
% number, scale at which it is detected.
Am I right to assume that the frame number provided here is also the Z-coordinate required by 3DSIFT?
I extracted STIPS from a video clip and got the required output but I am getting multiple X
and Y
values on every frame:
[71,24,1]
[54,26,1]
[86,29,1]
...
..
.
Is this expected output and accepted input for SIFT3D?