7

I have a data frame representing a time series like for example:

timestamp: 1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28...

value: 0|0|3|6|3|3|6|3|3|6 |3 |0 |0 |0 |1 |3 |7 |0 |0 |1 |3 |7 |1 |3 |7 |3 |6 |3 ...

The goal is to classify different patterns (which can be at random positions) and label the values. This means to find the patterns:

  1. 3-6-3
  2. 1-3-7
  3. 0

and to extend the data frame to

timestamp: 1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28...

value: 0|0|3|6|3|3|6|3|3|6 |3 |0 |0 |0 |1 |3 |7 |0 |0 |1 |3 |7 |1 |3 |7 |3 |6 |3 ...

label: c|c|a|a|a|a|a|a|a|a |a |c |c |c |b |b |b |c |c |b |b |b |b |b |b |a |a |a ...

Note that there is no identical length of such a pattern.

The question is what kind of algorithms can be used for this unsupervised learning problem and maybe additionally what libraries/frameworks could be useful to implement such a task.

Thanks in advance!

user6188360
  • 79
  • 1
  • 3
  • Which algorithms have you already looked at? – Joey Ciechanowicz Apr 11 '16 at 13:27
  • I read something about Self organizing maps, Hidden Markov Models and LongShortTermMemories (LSTMs) but nevertheless its not really clear to me if they are suitable for time series pattern recognition – user6188360 Apr 11 '16 at 13:36
  • You should add that to your question so other users have more information of what you've tried/looked into already – Joey Ciechanowicz Apr 11 '16 at 13:45
  • Some people call a pattern in a time series a Shapelet. Maybe that helps during your research. http://alumni.cs.ucr.edu/~lexiangy/Shapelet/kdd2009shapelet.pdf – hellomichibye Apr 13 '16 at 12:36

1 Answers1

1

My answer deals with question of pattern matching. After successful matching the following algorithm gives the starting and position of a matched sequence as output. You can then use this information for labelling as you described in your question.

I recommend the SPRING algorithm as introduced in the paper:

Stream Monitoring under the Time Warping Distance (Sakurai, Faloutsos, Yamamuro) http://www.cs.cmu.edu/~christos/PUBLICATIONS/ICDE07-spring.pdf

The algorithms core is the DTW Distance (dynamic time warping) and the resulting distance is the DTW distance. The only difference to DTW is optimization as every position of the "stream"(sequence in which your are looking for a match) is a possible starting point for a match - opposed to DTW which computes the total distance matrix for each starting point. You provide a template, a threshold and a stream (the concept is developed for matching from a datastream, yet you can apply the algorithm onto your dataframe by simply looping through it)

Note:

  • Choice of threshold is not trivial and could pose a siginificant challenge - but this would be another question. ( threshold = 0, if you only want to match the exact same sequences, threshold >0, if you want to match similiar sequences as well)
  • If you are looking for several templates/target pattern, then you would have to create several instances of the SPRING class, each for one target pattern.

The following code is my (messy) implementation, which lacks a lot things (such a class definition and so on). Yet it works and should help guiding you to your answer.

I implemented as follows:

#HERE DEFINE 
#1)template consisting of numerical data points 
#2)stream consisting of numerical data points
template = [1, 2, 0, 1, 2]
stream = [1, 1, 0, 1, 2, 3, 1, 0, 1, 2, 1, 1, 1, 2 ,7 ,4 ,5]

#the threshold for the matching process has to be chosen by the user - yet in reality the choice of threshold is a non-trivial problem regarding the quality of the matching process
#Getting Epsilon from the user 
epsilon = input("Please define epsilon: ")
epsilon = float(epsilon)

#SPRING
#1.Requirements
n = len(template)
D_recent = [float("inf")]*(n)
D_now=[0]*(n)
S_recent=[0]*(n)
S_now=[0]*(n)
d_rep=float("inf")
J_s=float("inf")
J_e=float("inf")
check=0

#check/output
matches=[]

#calculation of accumulated distance for each incoming value
def accdist_calc (incoming_value, template,Distance_new, Distance_recent):
    for i in range (len(template)):
        if i == 0:
            Distance_new[i] = abs(incoming_value-template[i])
        else:
            Distance_new[i] = abs(incoming_value-template[i])+min(Distance_new[i-1], Distance_recent[i], Distance_recent[i-1])
    return Distance_new

#deduce starting point for each incoming value
def startingpoint_calc (template_length, starting_point_recent, starting_point_new, Distance_new, Distance_recent):
    for i in range (template_length):
            if i == 0:
                #here j+1 instead of j, because of the programm counting from 0 instead of from 1
                starting_point_new[i] = j+1
            else:
                if Distance_new[i-1] == min(Distance_new[i-1], Distance_recent[i], Distance_recent[i-1]):
                    starting_point_new[i] = starting_point_new[i-1]                    
                elif Distance_recent[i] == min(Distance_new[i-1], Distance_recent[i], Distance_recent[i-1]):
                    starting_point_new[i] = starting_point_recent[i]                    
                elif Distance_recent[i-1] == min(Distance_new[i-1], Distance_recent[i], Distance_recent[i-1]):
                    starting_point_new[i] = starting_point_recent[i-1]                    
    return starting_point_new     

#2.Calculation for each incoming point x.t - simulated here by simply calculating along the given static list
for j in range (len(stream)):

    x = stream[j]    
    accdist_calc (x,template,D_now,D_recent)
    startingpoint_calc (n, S_recent, S_now, D_now, D_recent) 

    #Report any matching subsequence
    if D_now[n-1] <= epsilon:
        if D_now[n-1] <= d_rep:
            d_rep = D_now[n-1]
            J_s = S_now[n-1]            
            J_e = j+1
            print "REPORT: Distance "+str(d_rep)+" with a starting point of "+str(J_s)+" and ending at "+str(J_e)              

    #Identify optimal subsequence
    for i in range (n):
        if D_now[i] >= d_rep or S_now[i] > J_e:
            check = check+1
    if check == n:
        print "MATCH: Distance "+str(d_rep)+" with a starting point of "+str(J_s)+" and ending at "+str(J_e)
        matches.append(str(d_rep)+","+str(J_s)+","+str(J_e))
        d_rep = float("inf")
        J_s = float("inf")
        J_e = float("inf")
        check = 0 
    else:
        check = 0

    #define the recently calculated distance vector as "old" distance
    for i in range (n):
        D_recent[i] = D_now[i]
        S_recent[i] = S_now[i] 
Nikolas Rieble
  • 2,416
  • 20
  • 43