I have logs from a bunch (millions) of small experiments.
Each log contains a list (tens to hundreds) of entries. Each entry is a timestamp and an event ID (there are several thousands of event IDs, each of may occur many times in logs):
1403973044 alpha 1403973045 beta 1403973070 gamma 1403973070 alpha 1403973098 delta
I know that one event may trigger other events later.
I am researching this dataset. I am looking for "stable" sequences of events that occur often enough in the experiments.
Is there a way to do this without writing too much code and without using proprietary software? The solution should be scalable enough, and work on large datasets.
I think that this task is similar to what bioinformatics does — finding sequences in a DNA and such. Only my task includes many more than four letters in an alphabet... (Update, thanks to @JayInNyc: proteomics deals with larger alphabets than mine.)
(Note, BTW, that I do not know beforehand how stable and similar I want my sequences, what is the minimal sequence length etc. I'm researching the dataset, and will have to figure this out on the go.)
Anyway, any suggestions on the approaches/tools/libraries I could use?
Update: Some answers to the questions in comments:
Stable sequences: found often enough across the experiments. (How often is enough? Don't know yet. Looks like I need to calculate a top of the chains, and discard rarest.)
Similar sequences: sequences that look similar. "Are the sequences 'A B C D E' and 'A B C E D' (minor difference in sequence) similar according to you? Are the sequences 'A B C D E' and 'A B C 1 D E' (sequence of occurrence of selected events is same) also similar according to you?" — Yes to both questions. More drastic mutations are probably also OK. Again, I'd like to be able to calculate a top and discard the most dissimilar...
Timing: I can discard timing information for now (but not order). But it would be cool to have it in a similarity index formula.
Update 2: Expected output.
In the end I would like to have a rating of most popular longest stablest chains. A combination of all three factors should have effect in the calculation of the rating score.
A chain in such rating is, obviously, rather a cluster of similar enough chains.
A synthetic example of a chain-cluster:
alpha beta gamma [garbage] [garbage] delta
another:
alpha beta gamma|zeta|epsilon delta
(or whatever variant did not came to my mind right now.)
So, the end output would be something like that (numbers are completely random in this example):
Chain cluster ID | Times found | Time stab. factor | Chain stab. factor | Length | Score A | 12345 | 123 | 3 | 5 | 100000 B | 54321 | 12 | 30 | 3 | 700000