1

Suppose, I have been given a dataset consisting of:

  1. Source IP Address (e.g: 10.200.32.150)
  2. Source Port Address (e.g: 443)
  3. Destination IP Address (e.g: 10.220.32.210)
  4. Destination Port Address (e.g: 80)

(IP addresses or Port numbers can be repeated in the dataset)

Now, I want to apply k means clustering in the dataset. What should be the best approach to pre-process the data or to normalize the data?

What I have done as of now is that, at first I split each IP address based on "." and thus I would have 4 integer numbers for each IP. In total I will have 10 integer numbers.

For the example data, I will have the following after splitting: 10 200 32 150 443 10 200 32 210 80

Now, I consider this type of data as input to my K-Means algorithm and find out different clusters. (There can be "M" numbers of such data which I will give as input)

I also normalized the values (scaled in between 0 to 1) and also applied K-Means algorithm there.

Now I want to know whether my approach is okay or not? Or, should I follow different preprocessing / normalization approach? My end goal of this problem is to develop some outlier/anomalies after applying any unsupervised machine learning algorithm since the dataset is unlabeled.

Thanks.

1 Answers1

0

Your solution is straightforward but unweighted. Think about IP1:10.200.32.150 IP2:10.200.32.151 and IP3:11.200.32.151, only one-bit difference between IP1 IP2 and IP2 IP3, but the first two are much closer. Unweighted dataset will result in incorrect parameters when training.

What I suggested is putting weight on bits correspondingly, 10 get the highest while 151 get the lowest.

Alex
  • 601
  • 8
  • 22