Very late to the party here but I thought this might help others searching for the same thing.
In terms of basic image analysis this area has moved forward in leaps and bounds recently and there are a lot of people offering this functionality. The quality varies quite a bit and depends on how well trained and how big a corpora the provider has. A couple of examples I have worked with are IBM and Clarifai but it's a booming area.
What they won't give is the type of context you are after. Not yet anyway. They are unlikely to differentiate between two men hugging and two men wrestling (hey who can tell the difference as a human sometimes anyway?). They may however pick out a desk, a cup of coffee, a book etc.
Video search and contextualisation is another challenge entirely and it is in its infancy. There is one company at least making big inroads in this area (full disclaimer, I work there). Movida Labs analyses and indexes many factors in a video to provide a lot of context so in your example it could very likely tell that this was a video with two men wrestling. I have to admit that this is not because of some sort of breakthrough in technology (although it is very advanced) but because the video in its entirety provides that context.