1

I would like to match patterns from a given pattern library, returning the longest detected patterns.

However I only have the interleaved result of multiple parallel tasks in a log file, e.g. from multiple cores of a processor.

Is this a known application in data mining?

I thought of one solution with regex similar as Regex subsequence matching. However having a kind of distance metric to allow some fuzzyness would be nice, e.g. if one activity in a sequence would be missing.

sequence example

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
Sheldon
  • 574
  • 2
  • 4
  • 13
  • 1
    I suspect that this is an NP-hard problem. Would you prefer a greedy solution, or a computationally infeasible one? – btilly Aug 18 '20 at 16:50
  • 3
    Regex can do many things, but unbaking a cake isn't one of them. The real solution here is to create separate log files per thread and match against each. If the interleaving can't be dealt with, then please provide more detail on the tokens as I suspect the examples you gave above are severely abstracted to the point that any answer here involving regex will be useless to you. – kerasbaz Aug 18 '20 at 20:06
  • (1). Can you give more clarity on `distance metric` you expect? (2). You are expecting the 'longest' matching pattern, then why there are 2 results in `expected detected pattern`, shouldn't it be only `ABC`? – Liju Aug 19 '20 at 08:10
  • If you cannot create separate log files you could, of course, instead try to extend the log-messages with the relevant info (Resource #, Core #) for each message. – Hans Olsson Aug 19 '20 at 14:20

3 Answers3

1

As others have pointed out, it would help if we understood the semantics of what you are trying to accomplish. I am making a guess here that the patterns in your pattern library all pertain to

  • a single resource (or)
  • a set of resources

If that is the case, I would suggest you add that information to your pattern library first to make it explicit. For eg, your pattern library would look like:

1: A
1: AB
1: ABC
2: AD
2: C
2: D

If you want to cover patterns for a set of resources, it could look like this contrived example:

3: X
4: Y
3,4: Z

Now, you can separate out the log records pertaining to each resource from the interleaved log file (assuming that the log file does have the resource identifier). You can then apply the pattern matching to uncover the longest pattern.

In essence, separate your concerns and apply the solution for each sub-problem.

vvg
  • 1,010
  • 7
  • 25
0

If we have the log file and the pattern library, we can solve the problem with stacks. We start to read from log file. If the new log with a stack create an existing pattern in the pattern library, we push it to the stack. Unless, we put it in a new stack. Please send your comments to complete the details of the answer.

0

You have a problem that is easy to describe, and it would be good if we knew your constraints. How fast does this need to run?

In Python, you would have a single iterator over your resources, pushing to a separate generator per resource to do the pattern matching. That is, the iterator yields (resource 1, A) and is pushed into the generator for resource 1 to see if it matches a pattern yet. The generator occasionally kicks out the matched pattern.

In practice, you probably just want a Splunk plug-in or to throw everything into a database. This type of analysis is used for common issues like "Find all customers you have had three sessions in the past two weeks but abandoned carts with one common item in the cart over 75% of the total. Send this on-the-fence customer a 5% discount good for 24 hours."

Charles Merriam
  • 19,908
  • 6
  • 73
  • 83