I want to implement a similarity function that can accurately identify the similar log files. So far, I am unable to find a suitable similarity metric for my problem.
I have log files generated from several PCs (around 300 PCs), where each file contains visited IP addresses on a daily basis. I want to compare the similarity by comparing the visited IP addresses on a daily basis. that is, I want to compare day1 of PC1 with day1 of PC2 and so on...
for example (assume each log file contains only 4 days of data, if nothing visited on a particular day that row is left blank):
PC1:
day1: 155.69.23.11, 155.34.45.5
day2: 165.34.5.67
day3: //blank - nothing visited
day4: 155.35.45.55
PC2:
day1: 155.34.45.5, 155.34.45.6
day2: 165.34.5.67
day3: 155.35.45.55
day4: //blank - nothing visited
My similarity score between PC1 and PC2 would be:
Total similarity = similarity(day1) + similarity(day2) + similarity(day3)
For this problem, I can use Jaccard similarity index (considering each day as a set of IP addresses). But I am not sure whether that is a suitable metric or
there are any technical flaws (or conditions that needs to be satisfied) in applying Jaccard index for this problem.
In finding similar documents, I have seen people applying Jaccard index to the whole document but that is not what I am looking for. In my case, I wanted to apply Jaccard index for each day and sum them up to find the final similarity value. Is this approach technically sound?
Thank you.
Update:
Objective of this study
- we have around 1000 IP addresses and we want to monitor the browsing (browsing these 1000 IP addresses) pattern where each PC is used by the same person. This study is conducted for 5 working days and we log the visited IP addresses. If any of these IP addresses are visited on Monday it has the highest weight, while if its visited on Friday, it has the lowest weight. Weights for Tuesday, Wednesday and Thursday are normalized accordingly. This is why I am more interested in day wise similarity. while my ultimate objective is to find the people who have similar browsing pattern (considering all 5 days). This study is kind of weird but I am doing it for a project.