My answer deals with question of pattern matching. After successful matching the following algorithm gives the starting and position of a matched sequence as output. You can then use this information for labelling as you described in your question.
I recommend the SPRING algorithm as introduced in the paper:
Stream Monitoring under the Time Warping Distance (Sakurai, Faloutsos, Yamamuro)
http://www.cs.cmu.edu/~christos/PUBLICATIONS/ICDE07-spring.pdf
The algorithms core is the DTW Distance (dynamic time warping) and the resulting distance is the DTW distance. The only difference to DTW is optimization as every position of the "stream"(sequence in which your are looking for a match) is a possible starting point for a match - opposed to DTW which computes the total distance matrix for each starting point.
You provide a template, a threshold and a stream (the concept is developed for matching from a datastream, yet you can apply the algorithm onto your dataframe by simply looping through it)
Note:
- Choice of threshold is not trivial and could pose a siginificant challenge - but this would be another question.
( threshold = 0, if you only want to match the exact same sequences, threshold >0, if you want to match similiar sequences as well)
- If you are looking for several templates/target pattern, then you would have to create several instances of the SPRING class, each for one target pattern.
The following code is my (messy) implementation, which lacks a lot things (such a class definition and so on). Yet it works and should help guiding you to your answer.
I implemented as follows:
#HERE DEFINE
#1)template consisting of numerical data points
#2)stream consisting of numerical data points
template = [1, 2, 0, 1, 2]
stream = [1, 1, 0, 1, 2, 3, 1, 0, 1, 2, 1, 1, 1, 2 ,7 ,4 ,5]
#the threshold for the matching process has to be chosen by the user - yet in reality the choice of threshold is a non-trivial problem regarding the quality of the matching process
#Getting Epsilon from the user
epsilon = input("Please define epsilon: ")
epsilon = float(epsilon)
#SPRING
#1.Requirements
n = len(template)
D_recent = [float("inf")]*(n)
D_now=[0]*(n)
S_recent=[0]*(n)
S_now=[0]*(n)
d_rep=float("inf")
J_s=float("inf")
J_e=float("inf")
check=0
#check/output
matches=[]
#calculation of accumulated distance for each incoming value
def accdist_calc (incoming_value, template,Distance_new, Distance_recent):
for i in range (len(template)):
if i == 0:
Distance_new[i] = abs(incoming_value-template[i])
else:
Distance_new[i] = abs(incoming_value-template[i])+min(Distance_new[i-1], Distance_recent[i], Distance_recent[i-1])
return Distance_new
#deduce starting point for each incoming value
def startingpoint_calc (template_length, starting_point_recent, starting_point_new, Distance_new, Distance_recent):
for i in range (template_length):
if i == 0:
#here j+1 instead of j, because of the programm counting from 0 instead of from 1
starting_point_new[i] = j+1
else:
if Distance_new[i-1] == min(Distance_new[i-1], Distance_recent[i], Distance_recent[i-1]):
starting_point_new[i] = starting_point_new[i-1]
elif Distance_recent[i] == min(Distance_new[i-1], Distance_recent[i], Distance_recent[i-1]):
starting_point_new[i] = starting_point_recent[i]
elif Distance_recent[i-1] == min(Distance_new[i-1], Distance_recent[i], Distance_recent[i-1]):
starting_point_new[i] = starting_point_recent[i-1]
return starting_point_new
#2.Calculation for each incoming point x.t - simulated here by simply calculating along the given static list
for j in range (len(stream)):
x = stream[j]
accdist_calc (x,template,D_now,D_recent)
startingpoint_calc (n, S_recent, S_now, D_now, D_recent)
#Report any matching subsequence
if D_now[n-1] <= epsilon:
if D_now[n-1] <= d_rep:
d_rep = D_now[n-1]
J_s = S_now[n-1]
J_e = j+1
print "REPORT: Distance "+str(d_rep)+" with a starting point of "+str(J_s)+" and ending at "+str(J_e)
#Identify optimal subsequence
for i in range (n):
if D_now[i] >= d_rep or S_now[i] > J_e:
check = check+1
if check == n:
print "MATCH: Distance "+str(d_rep)+" with a starting point of "+str(J_s)+" and ending at "+str(J_e)
matches.append(str(d_rep)+","+str(J_s)+","+str(J_e))
d_rep = float("inf")
J_s = float("inf")
J_e = float("inf")
check = 0
else:
check = 0
#define the recently calculated distance vector as "old" distance
for i in range (n):
D_recent[i] = D_now[i]
S_recent[i] = S_now[i]