3

Problem

In an on-line process consisting of different steps I have data of people that complete the process and the people that drop out. The each user, the data consists of a sequence of process steps per time interval, let's say a second.

An example of such a sequence of a completed user would be [1,1,1,1,2,2,2,3,3,3,3....-1] where the user is in step 1 for four seconds, followed by step 2 for three seconds and step 3 for four seconds etc before reaching the end of the process (denoted by -1). An example of a drop out would be [1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2] where the user would be spending an excessive timespan in step 1, then 5 seconds in step 2 and then closing the webpage (so not reaching the end (-1))

Based on a model I would like to be able to predict/classify online (as in in 'real-time') the probability of the user completing the process or dropping out.

Approach

I have read about HMMs and I would to apply the following principle:

  • train one model using the sequences of people of that completed the process

  • train another model using the sequences of people that did not complete the process

  • collect the stream of incoming data of an unseen user and at each timestep use the forward algorithm on each of the models to see which of the two models is most likely to output this stream. The corresponding model represents then the label associated to this stream.

What is your opinion? Is this doable? I have been looking at the Python libraries hmmlearn and pomegranate, but I can not seem to create a small working example to test with. Some test code of mine can be found below with some artificial data:

from pomegranate import *
import numpy as np

# generate data of some sample sequences of length 4
# mean and std of each step in sequence
means = [1,2,3,4] 
stds = [0.1, 0.1, 0.1, 0.1]
num_data = 100

data = []

for mean, std in zip(means, stds):
    d = np.random.normal(mean, std, num_data)
    data.append(d)

data = np.array(data).T
# create model (based on sample code of pomegranate https://github.com/jmschrei/pomegranate/blob/master/tutorials/Tutorial_3_Hidden_Markov_Models.ipynb)
s1 = State( NormalDistribution( 1, 1 ), name="s1" )
s2 = State( NormalDistribution( 2, 1 ), name="s2" )

model = HiddenMarkovModel()
model.add_states( [s1, s2] )
model.add_transition( model.start, s1, 0.5, pseudocount=4.2 )
model.add_transition( model.start, s2, 0.5, pseudocount=1.3 )

model.add_transition( s1, s2, 0.5, pseudocount=5.2 )
model.add_transition( s2, s1, 0.5, pseudocount=0.9 )
model.bake()
#model.plot()
# fit model
model.fit( data, use_pseudocount=False, algorithm = 'baum-welch', verbose=False )
# get probability of very clean sequence (mean of each step)
p = model.probability([1,2,3,4])
print p # 3.51e^-112

I would expect here that the probability of this very clean sequence would be close to 1, since the values are the means of each of the distributions of the steps. How can I make this example better and eventually apply it for my application?

Concerns

I am not sure what states and transitions my model should comprise of. What is a 'good' model? How can you know that you need to add more states to the model to add more expressive data given the data. The tutorials of pomegranate are nice but insufficient for me to apply HMM's in this context.

Seanny123
  • 8,776
  • 13
  • 68
  • 124
Marcel
  • 31
  • 1
  • 2
  • I am a student of basic machine learning and I am not qualified at all to answer you but here is a thought. What if you create a training set which looks something like this - step1Time, step2Time, ... stepFinalTime, lable. And then two sample rows will look like (4,3,4... -1, Passed) AND (11,5,0,0... 0, Failed) and then teach a Neural Net with these training data and then finally feed the cross validation and test data to see how this is working. Does that sound something doable or right? – SRC Nov 15 '16 at 15:34
  • thanks for the input but in your setting, how can I incorporate the fact I have streaming data in order to act in 'real-time'? Also in your setting the instances with label 'Failed' will always have 0 in one or more of the final features (= endpoints of steps in the process) so the ML classifier will be exploiting this – Marcel Nov 15 '16 at 15:41
  • ah ok. I understand what you say. Sorry, my model was not apt for what you are trying to achieve. As I said, I have started this subject and I am not an expert at all. – SRC Nov 15 '16 at 15:43
  • no problem, thanks for input – Marcel Nov 15 '16 at 15:51

1 Answers1

0

Yes, the HMM is a viable way to do this, although it's a bit of overkill, since the FSM is a simple linear chain. The "model" can also be built from mean and variation for each string length, and you can simply compare the distance of the partial string to each set of parameters, rechecking at each desired time point.

The states are simple enough:

1 ==> 2 ==> 3 ==> ... ==> done

Each state has a loop back to itself; this is the most frequent choice. There is also a transition to "failed" from any state.

Thus, the Markov Matrices will be sparse, something like

 1   2   3   4 done failed
0.8 0.1  0   0   0  0.1
 0  0.8 0.1  0   0  0.1
 0   0  0.8 0.1  0  0.1
 0   0   0  0.8 0.1 0.1
 0   0   0   0  1.0  0 
 0   0   0   0   0  1.0
Prune
  • 76,765
  • 14
  • 60
  • 81
  • Thanks! The FSM is indeed linear but my initial goal was to be able to tell whether a user (filling in each of the steps in the process) is more likely to complete the process or drop out _based on the time spent in each of the steps_. The proposed model then does not satisfy the Markovian property I guess. – Marcel Nov 16 '16 at 08:37
  • Perhaps I could create a chain of 500 states (each representing a second), and each state has certain emission probabilities. The symbols to emit are then the steps of the process. By fitting this model with data, then it could learn that for people that complete the process symbol 'step1' has a large probability of being emitted from state_1 (start) to state_100 (100 seconds after start). If I then test an unseen sequence of symbols that would still have "step1"'s after 100 seconds, the sequence is unlikely to be corresponding to a user that would complete the process. Greets – Marcel Nov 16 '16 at 08:38
  • Yes, that works from a different perspective of measurement. – Prune Nov 16 '16 at 17:40