Detection of similar sequences in ordered event lists

Question

I have logs from a bunch (millions) of small experiments.

Each log contains a list (tens to hundreds) of entries. Each entry is a timestamp and an event ID (there are several thousands of event IDs, each of may occur many times in logs):

1403973044 alpha
1403973045 beta
1403973070 gamma
1403973070 alpha
1403973098 delta

I know that one event may trigger other events later.

I am researching this dataset. I am looking for "stable" sequences of events that occur often enough in the experiments.

Is there a way to do this without writing too much code and without using proprietary software? The solution should be scalable enough, and work on large datasets.

I think that this task is similar to what bioinformatics does — finding sequences in a DNA and such. Only my task includes many more than four letters in an alphabet... (Update, thanks to @JayInNyc: proteomics deals with larger alphabets than mine.)

(Note, BTW, that I do not know beforehand how stable and similar I want my sequences, what is the minimal sequence length etc. I'm researching the dataset, and will have to figure this out on the go.)

Anyway, any suggestions on the approaches/tools/libraries I could use?

Update: Some answers to the questions in comments:

Stable sequences: found often enough across the experiments. (How often is enough? Don't know yet. Looks like I need to calculate a top of the chains, and discard rarest.)

Similar sequences: sequences that look similar. "Are the sequences 'A B C D E' and 'A B C E D' (minor difference in sequence) similar according to you? Are the sequences 'A B C D E' and 'A B C 1 D E' (sequence of occurrence of selected events is same) also similar according to you?" — Yes to both questions. More drastic mutations are probably also OK. Again, I'd like to be able to calculate a top and discard the most dissimilar...

Timing: I can discard timing information for now (but not order). But it would be cool to have it in a similarity index formula.

Update 2: Expected output.

In the end I would like to have a rating of most popular longest stablest chains. A combination of all three factors should have effect in the calculation of the rating score.

A chain in such rating is, obviously, rather a cluster of similar enough chains.

A synthetic example of a chain-cluster:

alpha
beta
gamma
[garbage]
[garbage]
delta

another:

alpha
beta
gamma|zeta|epsilon
delta

(or whatever variant did not came to my mind right now.)

So, the end output would be something like that (numbers are completely random in this example):

Chain cluster ID | Times found | Time stab. factor | Chain stab. factor | Length | Score
A                | 12345       | 123               | 3                  | 5      | 100000
B                | 54321       | 12                | 30                 | 3      | 700000

*I do not know beforehand how stable and similar I want my sequences* - What does stable mean?; Are two sequences *similar* even if they are not identical? — ArjunShankar, Jun 28 '14 at 17:07
Does only the order of events matter in deciding if they are similar or does the time between them matter as well? Are the sequences 'A B C D E' and 'A B C E D' (minor difference in sequence) similar according to you? Are the sequences 'A B C D E' and 'A B C 1 D E' (sequence of occurrence of selected events is same) also similar according to you? If you do not know right now, do you still think you *might* want to decide that these sequences are similar, later on? — ArjunShankar, Jun 28 '14 at 17:09
many compression algorithms do essentially the same: looking for repeating pattern, the longer they are, and the more frequently they occur, the better - as their replacement against a token representing them yields better saving. Therefore I'd look at common open source compression, and let me get inspired by those portions of code which builds and examines the tree of patterns. — Deleted User, Jun 28 '14 at 22:21
@Bushmills compression algorithms have the right to discard sequence information if they are constrained by resources... I would prefer to keep everything I can. — Alexander Gladysh, Jun 30 '14 at 06:15
@ArjunShankar I've updated the question with answers to your questions. — Alexander Gladysh, Jun 30 '14 at 06:15
@AlexanderGladysh Thanks for the update! I just wanted to help make the question more objectively answerable. — ArjunShankar, Jun 30 '14 at 10:30
By the way: I expect spell-checkers might use the kind of algorithms you would want (although the tokens forming a sequence for a spell-checker are alphabets, and in your problem the tokens are event names). — ArjunShankar, Jun 30 '14 at 10:33

score 0 · Answer 1 · answered Jun 30 '14 at 09:53

I have thought about this setup for the past day or so -- how to do it in a sane scalable way in bash, etc.. The answer is really driven by the relational information you are wanting to draw from the data and the apparent size of the dataset you currently have. The xleanest solution will be to load you datasets into a relational database (MariaDB would by my recommendation)

Since your data already exists in an fairly clean format, your options for getting the data into a database are 2. (1) if the files have the data in a usable rowxcol setup, then you can simply use LOAD DATA INFILE to bring your data into the database; or (2) parse the files with bash in a while read line; do scenario, parse the data to get the data in the table format you desire, and use mysql batch mode to directly load the information into mysql in a single pass. The general form of the bash command would be mysql -uUser -hHost database -Bse "your insert command".

Once in a relational database, you then have the proper tool for the job of being able to run flexible queries against your data in a sane manner instead of continually writing/re-writing bash snippets to handle your data in a different way each time. That is probably the best scalable solution you are looking for. A little more work up-front, but a lot better setup going forward.

Using relational DB is OK, thanks. Can you suggest an approach on how to solve my problem with SQL without writing too much (SQL) code? — Alexander Gladysh, Jun 30 '14 at 14:05

score 0 · Answer 2 · answered Jun 30 '14 at 10:09

Wikipedia defines algorithm as 'a precise list of precise steps': 'I am looking for "stable" sequences of events that occur often enough in the experiments.' "Stable" and "often enough" without definition makes the task of giving you an algorithm impossible.

So I give you the trivial one to calculate the frequency of sequences of length 2. I will ignore the time stamp. Here is the awk code (pW stands for previous Word, pcs stands for pair counters):

#!/usr/bin/awk -f

BEGIN { getline; pW=$2; }

{ pcs[pW, $2]++; pW=$2; }

END {
    for (i in pcs)
        print i, pcs[i];
}

I duplicated your sample to show something meaningful looking

1403973044 alpha
1403973045 beta
1403973070 gamma
1403973070 alpha
1403973098 delta
1403973044 alpha
1403973045 beta
1403973070 gamma
1403973070 beta
1403973098 delta
1403973044 alpha
1403973045 beta
1403973070 gamma
1403973070 beta
1403973098 delta
1403973044 alpha
1403973045 beta
1403973070 gamma
1403973070 beta
1403973098 delta

Running the code above on it gives:

gammaalpha 1
alphabeta 4
gammabeta 3
deltaalpha 3
betagamma 4
alphadelta 1
betadelta 3

which can be interpreted as alpha followed by beta and beta followed by gamma are the most frequent length two sequences each occurring 4 times in the sample. I guess that would be your definition of stable sequence occurring often enough.

What's next?

(1) You can easily adopt the code above to sequences of length N and to find sequences occurring often enough you can sort (-k2nr) the output on the second column.

(2) To put a limit on N you can stipulate that no event triggers itself, that provides you with a cut-off point. Or you can place a limit on the timestamp ie the difference between consecutive events.

(3) So far those sequences were really strings and I used exact matching between them (CLRS terminology). Nothing prevents you from using your favourite similarity measure instead:

{ pcs[CLIFY(pW, $2)]++; pW=$2; }

CLIFY would be a function which takes k consecutive events and puts them into a bin ie maybe you want ABCDE and ABDCE to go to the same bin. CLIFY could of course take as an additional argument the set of bins so far.

The choice of awk is for convenience. It wouldn't fly, but you can easily run them in parallel.

It is unclear what you want to use this for but a google search for Markov chains, Mark V Shaney would probably help.

Detection of similar sequences in ordered event lists

2 Answers2