Query sub-sequences in time-series sequence data

Question

I'm facing a problem and I feel like there's a solution in Graph theory or Graph databases. My knowledge in these fields is very limited. I'm hoping someone can recognise my problem and perhaps point me to the name of a technique used to solve it.

Simplified Example:

I am dealing with time-series of states. A simple example, where there are only two states:

TS    State
t0    T
t1    F
t2    F
t3    F
t4    T
t5    T
t6    T
t7    F
t...  ...

I could convert this into some graph with two nodes (T and F) and where the "dwell time" in the state is an attribue (in brackets):

T(1) -> F(3) -> T(3) -> F(1)

An example of my problem is to write a "query" that extracts any sub-sequence matching this pattern F(>=2) -> T(<10). In my example above, my query would extract the sub-sequence: F(3) -> T(3)

But if it were present in the dataset, the query could also extract sequences like:

F(2) -> T(8)
F(20) -> T(3)

The example I've put up is simplified: there are more than two states, and more advanced queries would allow loops, where these loops could be constrained in either overall time spent in the loop, or number of loops that can be done:E.g.

`T(>2) -> [loops of F(1)->T(1)] -> T(<10)`

Where my loop could perhaps be constrained not to take more than 10 iterations, or not more than 10 time units. The icing on the cake would be to find sequences like this

T(n)->F(<n)

Which translates as: sequences that start with T (and stay in T for n time-units), followed by the F state where it stays in F for less than n (i.e., F is shorter than the preceding T)

What I tried:

I originally thought of converting this to a string, and using a RegEx to extract matches. Regex could do all I need, but fall short of comprehending arithmetic like "greater than". I guess I could keep my raw time-series of states (TFFFTTTF) and do a regex on this... but it seems pretty ugly.

The fields of natural Language Processing, Graph Theory, Graph databases come to mind, as ones that would have similar problems. I don't know how I would encode the "duration of state" attribute in my graph. I don't know if there's some sort of "industry-standard" query language for sub-sequence searches in graph databases.

Questions:

-Is there a framework to solve these sub-sequence extraction problems, if so, how is it called? Is there a "best practice"? How should I structure my data? Is there a query language to query sub-sequences in a database of sequences?

score 1 · Answer 1 · answered Feb 27 '21 at 13:37

I might flip the problem around. You've indicated that this is time series data. Given that, I might create a new state node every time the state changes. I would then encode the "dwell" time in the previous node and link the new node to the previous state node creating a linked list in the graph database. With this structure, your pattern query becomes simple.

Objectivity/DB is a schema-based object/graph database with a complete set of graph navigational query capabilities. It has its own query language called Declarative Objectivity, or DO.

We start with a schema definition:

UPDATE SCHEMA { 
    CREATE CLASS State{
        label       : String,
        dwellTime   : INTEGER { Storage: B32 },               
        prev        : Reference { referenced: State, Inverse: next },
        next        : Reference { referenced: State, Inverse: prev}
    }           
};

Then we can execute a DO query like the following:

MATCH p = (:State {label == 'T' AND dwellTime > 5})
           -->(:State {label == 'F' AND dwellTime > 5})
           -->(:State {label == 'T' AND dwellTime < 2})
           -->(:State {label == 'T' AND dwellTime > 100})
           -->(:State {label == 'F' AND dwellTime > 100})
           RETURN p;

This kind of query will find all of the "TFTTF" patterns that meet the specified dwell times.

Query sub-sequences in time-series sequence data

1 Answers1