Similarity measure to identify similar log files

Question

I want to implement a similarity function that can accurately identify the similar log files. So far, I am unable to find a suitable similarity metric for my problem.

I have log files generated from several PCs (around 300 PCs), where each file contains visited IP addresses on a daily basis. I want to compare the similarity by comparing the visited IP addresses on a daily basis. that is, I want to compare day1 of PC1 with day1 of PC2 and so on...

for example (assume each log file contains only 4 days of data, if nothing visited on a particular day that row is left blank):

PC1:
day1: 155.69.23.11, 155.34.45.5
day2: 165.34.5.67
day3:             //blank - nothing visited
day4: 155.35.45.55

PC2: 
day1: 155.34.45.5, 155.34.45.6
day2: 165.34.5.67
day3: 155.35.45.55
day4:              //blank - nothing visited

My similarity score between PC1 and PC2 would be:

Total similarity = similarity(day1) + similarity(day2) + similarity(day3)

For this problem, I can use Jaccard similarity index (considering each day as a set of IP addresses). But I am not sure whether that is a suitable metric or there are any technical flaws (or conditions that needs to be satisfied) in applying Jaccard index for this problem.

In finding similar documents, I have seen people applying Jaccard index to the whole document but that is not what I am looking for. In my case, I wanted to apply Jaccard index for each day and sum them up to find the final similarity value. Is this approach technically sound?

Thank you.

Update:

Objective of this study - we have around 1000 IP addresses and we want to monitor the browsing (browsing these 1000 IP addresses) pattern where each PC is used by the same person. This study is conducted for 5 working days and we log the visited IP addresses. If any of these IP addresses are visited on Monday it has the highest weight, while if its visited on Friday, it has the lowest weight. Weights for Tuesday, Wednesday and Thursday are normalized accordingly. This is why I am more interested in day wise similarity. while my ultimate objective is to find the people who have similar browsing pattern (considering all 5 days). This study is kind of weird but I am doing it for a project.

score 1 · Answer 1 · answered Sep 20 '12 at 15:31

1

Well, mathematically (and thus from a programming point of view), you can do it this way.

However, the results may or may not be what you are interested in.

But we cannot help you much with that, because we don't know your objectives (what do you want to discover - people accessing Facebook and Google? This will likely dominate your results ...) nor do we have insight into your data.

Using the raw IP adresses also neglects the fact that certain addresses are essentially equivalent. (e.g. 173.194.70.113 and 173.194.70.139 and 173.194.70.102 are all google.com, even in the same datacenter). And at the same time, one address can serve millions of completely different web pages (e.g. http://www.websitelooker.com/ip/81.169.145.160 - one IP of a large scale hoster in Germany)

So maybe you first need to find out what you actually want to achieve. Then do feature extraction to capture what you need, then define an appropriate similarity function.

answered Sep 20 '12 at 15:31

Has QUIT--Anony-Mousse

76,138
12
138
194

Thank you for the answer. I have updated the question with the objective of my study and I hope it gives better understanding of the question. – Maggie Sep 20 '12 at 16:00
The problem is that you need to have some idea of what kind of *patterns* you are looking for, and what *bias* you have in the data that you need to remove. Just throwing the IPs into an algorithm will likely yield only the obvious results - everybody goes to Google at least once a day. – Has QUIT--Anony-Mousse Sep 20 '12 at 17:18
Google is not part of our study. I have a list of IP addresses (1000) that are related to my study and it doesn't not contains google, fb or any other common IP addresses. I am interested in the browsing pattern the 1000 IP addresses that we have selected for our study. – Maggie Sep 20 '12 at 17:22

Similarity measure to identify similar log files

1 Answers1