Suppose, I have been given a dataset consisting of:
- Source IP Address (e.g: 10.200.32.150)
- Source Port Address (e.g: 443)
- Destination IP Address (e.g: 10.220.32.210)
- Destination Port Address (e.g: 80)
(IP addresses or Port numbers can be repeated in the dataset)
Now, I want to apply k means clustering in the dataset. What should be the best approach to pre-process the data or to normalize the data?
What I have done as of now is that, at first I split each IP address based on "." and thus I would have 4 integer numbers for each IP. In total I will have 10 integer numbers.
For the example data, I will have the following after splitting: 10 200 32 150 443 10 200 32 210 80
Now, I consider this type of data as input to my K-Means algorithm and find out different clusters. (There can be "M" numbers of such data which I will give as input)
I also normalized the values (scaled in between 0 to 1) and also applied K-Means algorithm there.
Now I want to know whether my approach is okay or not? Or, should I follow different preprocessing / normalization approach? My end goal of this problem is to develop some outlier/anomalies after applying any unsupervised machine learning algorithm since the dataset is unlabeled.
Thanks.