3

I am working on a problem of finding similar content in a log file. Let's say I have a log file which looks like this:

 show version
 Operating System (OS) Software

 Software
 BIOS:      version 1.0.10
 loader:    version N/A
 kickstart: version 4.2(7b)
 system:    version 4.2(7b)
 BIOS compile time:       01/08/09
 kickstart image file is: bootflash:/m9500-sf2ek9-kickstart-mz.4.2.7b.bin
 kickstart compile time:  8/16/2010 13:00:00 [09/29/2010 23:10:48]
 system image file is:    bootflash:/m9500-sf2ek9-mz.4.2.7b.bin
 system compile time:     8/16/2010 13:00:00 [09/30/2010 00:46:36]`

 Hardware
 xxxx MDS 9509 (9 Slot) Chassis ("xxxxxxx/xxxxx-2")
 xxxxxxx, xxxx with 1033100 kB of memory.
 Processor Board ID xxxx

 Device name: xxx-xxx-1 
 bootflash:    1000440 kB 
 slot0:              0 kB (expansion flash)

For a human eye, it can easily be understood that "Software" and the data below is a section and "Hardware" and the data below is another section. Is there a way I can model using machine learning or some other technique to cluster similar sections based on a pattern? Also, I have shown 2 similar kinds of pattern but the patterns between sections might vary and hence should identify as different section. I have tried to find similarity using cosine similarity but it doesn't help much because the words aren't similar but the pattern is.

1 Answers1

1

I see actually two separate machine learning problems:

1) If I understood you correctly the first problem you want to solve is the problem to split each log into distinct section, so one for Hardware, one for Software etc.

In order to achieve this one approach could be try to extract heading which mark the beginning of a new section. In order to do so you could manually label a set of different logs and label each row as heading=true, heading= false

No you could try to train a classifier which takes your labeled data as an input and the result could be a model.

2) Now that you have this different sections, you can split each log into those section and treat each section as a separate document.

Now I would first try a straigt-forward document clustering using a standard nlp pipeline:

  1. Tokenize your document to get the tokens
  2. Normalize them (maybe stemming is not the best idea for logs)
  3. Create for each document a tf-idf vector
  4. Start with a simple clustering algorithm like k-means to try to cluster the different section

After the clustering you should have the section similar to each other in the same cluster

I hope this helped, I think especially the first task is quit hard and maybe hand-tailored patterns will perform better.

dice89
  • 459
  • 5
  • 10