4

I want to do gesture recognition in python with kinect.

After reading up on some theory, I think one of the best method is unsupervised learning with Hidden Markov Model (HMM) (baum welch or some EM method) with some known gesture data, to achieve a set of trained HMM (one for each gesture that I want to recognize).

I would then do the recognition matching the max log likelihood (with viterbi?) of observed data with the HMM in the trained set.

For example, I have data (coordinate x,y,z of the right hand) recorded with the kinect device of some gestures (saying hello, kick a punch, do a circle with the hand) and I do some training:

# training
known_datas = [
tuple( load_data('punch.mat'),                'PUNCH' ),
tuple( load_data('say_hello.mat'),            'HELLO' ), 
tuple( load_data('do_circle_with_hands.mat'), 'CIRCLE' )
]

gestures = set()
for x, name in known_datas:
    m = HMM()
    m.baumWelch(x)
    gestures.add(m)

then I perform recognition of observed new data performing the max loglik and choose the gesture saved before that has the max loglik for each trained HMM:

# recognition
observed = load_data('new_data.mat')
logliks = [m.viterbi(observed) for m in gestures]

print 'observed data is ', gestures[logliks.index(max(logliks))]

My questions are:

  • Is this something totally stupid?
  • How many training set for a real case?
  • How many states for each HMM?
  • Is it possible to do it in realtime?
Shawn Chin
  • 84,080
  • 19
  • 162
  • 191
nkint
  • 11,513
  • 31
  • 103
  • 174

2 Answers2

2

First of all: This is a very special question, you'll need a machine learning expert here. Unfortunately there's no ML equivalent here among the stack exchange sites yet ... maybe there'll be one some day. :)

I guess your approach is valid, just some remarks:

  • The HMM class which you just instantiate with HMM() here needs to be crafted so that the HMM's structure can represent sth similar to a gesture. HMMs have states and transitions between them, so how would you define an HMM for a gesture? I'm positive that this is possible (and even think it's a good approach) but it requires some thinking. Maybe the states are just the corners of a 3D cube, and for each observed point of your recognized gesture you pick the closest corner of this cube. The BW algorithm can then approximate the transition likelihoods through your training data. But you may need to pick a more fine-grained state model, maybe an n * n * n voxel grid.

  • The Viterbi algorithm gives you not the likelihood of a model but the most likely sequence of states for a given sequence observation. IIRC you'd pick the forward algorithm to get probability of a given observation sequence for a given model.

I assume that, given a well-trained and not too complex HMM, you should be able to recognize gestures in real-time, but that's just an educated guess. :)

Johannes Charra
  • 29,455
  • 6
  • 42
  • 51
1

It has already been applied successfully in many variations: http://scholar.google.co.il/scholar?hl=en&q=HMM+Gesture+Recognition .

Remarks:

Community
  • 1
  • 1
cyborg
  • 9,989
  • 4
  • 38
  • 56