I want to do gesture recognition in python with kinect.
After reading up on some theory, I think one of the best method is unsupervised learning with Hidden Markov Model (HMM) (baum welch or some EM method) with some known gesture data, to achieve a set of trained HMM (one for each gesture that I want to recognize).
I would then do the recognition matching the max log likelihood (with viterbi?) of observed data with the HMM in the trained set.
For example, I have data (coordinate x,y,z of the right hand) recorded with the kinect device of some gestures (saying hello, kick a punch, do a circle with the hand) and I do some training:
# training
known_datas = [
tuple( load_data('punch.mat'), 'PUNCH' ),
tuple( load_data('say_hello.mat'), 'HELLO' ),
tuple( load_data('do_circle_with_hands.mat'), 'CIRCLE' )
]
gestures = set()
for x, name in known_datas:
m = HMM()
m.baumWelch(x)
gestures.add(m)
then I perform recognition of observed new data performing the max loglik and choose the gesture saved before that has the max loglik for each trained HMM:
# recognition
observed = load_data('new_data.mat')
logliks = [m.viterbi(observed) for m in gestures]
print 'observed data is ', gestures[logliks.index(max(logliks))]
My questions are:
- Is this something totally stupid?
- How many training set for a real case?
- How many states for each HMM?
- Is it possible to do it in realtime?