I'm facing a problem and I feel like there's a solution in Graph theory or Graph databases. My knowledge in these fields is very limited. I'm hoping someone can recognise my problem and perhaps point me to the name of a technique used to solve it.
Simplified Example:
I am dealing with time-series of states. A simple example, where there are only two states:
TS State
t0 T
t1 F
t2 F
t3 F
t4 T
t5 T
t6 T
t7 F
t... ...
I could convert this into some graph with two nodes (T and F) and where the "dwell time" in the state is an attribue (in brackets):
T(1) -> F(3) -> T(3) -> F(1)
An example of my problem is to write a "query" that extracts any sub-sequence matching this pattern F(>=2) -> T(<10)
.
In my example above, my query would extract the sub-sequence:
F(3) -> T(3)
But if it were present in the dataset, the query could also extract sequences like:
F(2) -> T(8)
F(20) -> T(3)
The example I've put up is simplified: there are more than two states, and more advanced queries would allow loops, where these loops could be constrained in either overall time spent in the loop, or number of loops that can be done:E.g.
`T(>2) -> [loops of F(1)->T(1)] -> T(<10)`
Where my loop could perhaps be constrained not to take more than 10 iterations, or not more than 10 time units. The icing on the cake would be to find sequences like this
T(n)->F(<n)
Which translates as: sequences that start with T (and stay in T for n time-units), followed by the F state where it stays in F for less than n (i.e., F is shorter than the preceding T)
What I tried:
I originally thought of converting this to a string, and using a RegEx to extract matches. Regex could do all I need, but fall short of comprehending arithmetic like "greater than". I guess I could keep my raw time-series of states (TFFFTTTF
) and do a regex on this... but it seems pretty ugly.
The fields of natural Language Processing, Graph Theory, Graph databases come to mind, as ones that would have similar problems. I don't know how I would encode the "duration of state" attribute in my graph. I don't know if there's some sort of "industry-standard" query language for sub-sequence searches in graph databases.
Questions:
-Is there a framework to solve these sub-sequence extraction problems, if so, how is it called? Is there a "best practice"? How should I structure my data? Is there a query language to query sub-sequences in a database of sequences?