-1

I'm trying to perform some basic action recognition using the KTH dataset.

I'm using the 3DSIFT feature extractor from UCF link. Which extracts a SIFT descriptor from a given x, y and z coordinate.

For feature detection I am using selective-STIPS link, that has shown to be very effective for action recognition. According to the source code provided by the author, it produces the following output:

    @output : corner_points, P X 4 matrix, where P is the number of interest
%           point found in the image_stack and each interest point contains
%           4 values :: [X,Y] coordinate of the interest point, frame
%           number, scale at which it is detected.

Am I right to assume that the frame number provided here is also the Z-coordinate required by 3DSIFT?

I extracted STIPS from a video clip and got the required output but I am getting multiple X and Y values on every frame:

[71,24,1]
[54,26,1]
[86,29,1]
...
..
.

Is this expected output and accepted input for SIFT3D?

StuckInPhDNoMore
  • 2,507
  • 4
  • 41
  • 73
  • 1
    From what I can gather you are asking about 3rd party toolboxes or pieces of code without at a minimum linking to them. How is anyone suppose to know how these things work without seeing the code & knowing what version of something you are running – Aero Engy Nov 08 '17 at 17:05
  • @AeroEngy I didnt feel the need to link as I felt this was a general question, not specifically related to any of the tool boxes but general video recognition. But I have linked to the scripts now, if that helps – StuckInPhDNoMore Nov 08 '17 at 17:13

1 Answers1

1

Yes, from what I can tell following through 3dsift Z is equivalent to frame number when dealing with video. So the x,y, frame output from stips should work as the x,y,z input to 3dsift.

Aero Engy
  • 3,588
  • 1
  • 16
  • 27