1

I have an ip-network which is basically a list of sequential ip-addresses. From this list I want to cluster ranges of ip-addresses into independent entities. I want to give each IP in the range a set of properties like time to live, nameservers and domain names associated with it.

I then want to determine the distance between each IP-address and its neighbors and start clustering based on shortest distance.

My question lies in the distance function. TTL is a number so that should not be a problem. Nameservers and domain names are strings however, how would you represent those as numbers in a vector?

Basically if 2 IP-addresses have the same nameserver or very similar domain names (equal 2LD) you want them to have a smaller distance. I've looked into something like word2vec but can't really find a useful implementation.

mBo
  • 155
  • 2
  • 9

1 Answers1

1

I would try using difflib like this.

from difflib import SequenceMatcher

def similarity(a, b):
    return SequenceMatcher(None, a, b).ratio()

Then you can call the function against each set of names to get a similarity score and group them based on that.

similarity("server1","server1")
1.0

similarity("Server1","Server2")
0.8571428571428571

similarity("foo","bar")
0.0
Seth Wahle
  • 166
  • 1
  • 2
  • 12