1

I have two simple datasets having 10k IP addresses encoded as Integers (so the data is discrete and can take any number range between 1 and 4B).

FYI: One dataset is a real dataset captured at a network, while the other one is a synthetic one. At the end of the day, I want to see how good the synthetic one is (generated via AI/ML) compared to the real one. But I am pretty stuck at the beginning:D

Since the dataset's distribution is unknown yet not following any well-known distribution, I want to calculate the PDF of them (and later compare how similar they are).

My two datasets are termed p and q, both arrays of IP addresses (as integers).

I am not an expert in probability theory, so please, bear with me :)

Since I want to compare the two probabilities eventually, to calculate the PDFs of them, I take all possible events (i.e., IP addresses) present in p and q. For this, I do the following in Python using numpy:

import numpy as np
import pandas as pd

q=np.array(real_data_1m.srcip) #
p=np.array(syn_data_1m.srcip)

#get all possible discrete events from p and q
px=np.array(list(set(p))) #use set here to remove duplicates
qx=np.array(list(set(q))) #use set here to remove duplicates

#concatenate px and qx
mx=np.concatenate([px,qx])

mx.sort() #sort them, as they are anyway integers
mx=np.array(list(set(mx))) #remove duplicates by creating a set
#mx.reshape((len(mx),1)) #reshape from 1D to nD, where n=len(mx)

Then, to calculate the PDF, I created a simply function create_prob_dist() to help towards this goal.

def create_prob_dist(data: np.array, sample_space: np.array):
  #number of all events
  sum_of_events = sample_space.size
  #get the counts of each event via pandas.crosstab()
  data_counts = pd.crosstab(index='counts', columns=data)
  
  #create probabilities for each event
  prob_dist=dict()

  for i in sample_space:
    if i in data_counts:
      prob_dist[i]=(data_counts[i]['counts'])/sum_of_events
    else: 
      prob_dist[i]=0

  return prob_dist

This function does not return the PDF itself. At this stage, it returns a Python dictionary, where the keys are the possible IP addresses that are represented in both p and q, i.e., in mx. The corresponding values, therefore, are the probability of each of them. Something like: dict[2130706433]=0.05, meaning the probability of IP address 127.0.0.1 in the dataset is 0.05.

After I have this dictionary of probabilities, I try to plot it, but then comes my problems:

#create true PDFs of p and q using mx
p_pdf=create_prob_dist(p, mx)
q_pdf=create_prob_dist(q, mx)

#get the probability values only from the dictionary
p_pdf=np.array(list(p_pdf.values())) #already sorted according to mx
q_pdf=np.array(list(q_pdf.values())) #already sorted according to mx

plt.figure()
plt.plot(mx, q_pdf, 'g', label="Q")
plt.plot(mx, p_pdf, 'r', label="P")
plt.legend(loc="upper right")
plt.show()

The PDF plot does not look good

I know there should be a problem around the scales or something, but I could not get my head around it.

What am I doing wrong? Is it a wrong Python call or is the calculation of the PDF wrong?

Btw., the pure histogram of the p and q looks like this:

# plot a histogram of the two datasets to have a quick look at them
plt.hist(np.array(syn_data_1m.srcip), bins=100)
plt.hist(np.array(real_data_1m.srcip),bins=100, alpha=.5)
plt.show()

Histogram of the two datasets

cs.lev
  • 182
  • 1
  • 11
  • Note that you do `mx.sort()`, but in the next line you use `mx=np.array(list(set(mx)))`. A set is an unordered collection, so your `mx` is no longer in sorted order. – slothrop Jul 25 '23 at 08:47
  • Also, `.reshape()` doesn't work in-place, it returns a new array, so the result of your reshape operation is never used. https://numpy.org/doc/stable/reference/generated/numpy.ndarray.reshape.html#numpy.ndarray.reshape – slothrop Jul 25 '23 at 08:50
  • Mother of GOD :) Thanks for pointing this out! How could I not see this, haha? So, I changed the order of the sort() and set() parts in my code. And eventually, I did not need to reshape my `mx` variable as it threw an error. But now it works and I amend my question with your suggestions as a working solution – cs.lev Jul 25 '23 at 09:21
  • @slothrop, maybe you can check my related question here: https://stackoverflow.com/questions/76761210/kl-and-js-divergence-analysis-of-pdfs-of-numbers – cs.lev Jul 25 '23 at 09:26
  • Please don't edit your question to show solved in Title, update the question with the answer in the appropriate Answer section and mark it as answered, in this way we can all benefit from being able to search for valid answers. – itprorh66 Jul 25 '23 at 14:04
  • okay, did it. Sorry about the previous attempt – cs.lev Jul 25 '23 at 23:49

1 Answers1

0

Thanks to slothrop, the solution is as follows:

import numpy as np
import pandas as pd

q=np.array(real_data_1m.srcip) #
p=np.array(syn_data_1m.srcip)

#get all possible discrete events from p and q
px=np.array(list(set(p))) #use set here to remove duplicates
qx=np.array(list(set(q))) #use set here to remove duplicates

#concatenate px and qx
mx=np.concatenate([px,qx])


mx=np.array(list(set(mx))) #remove duplicates by creating a set
# CALL SORT THE LAST TIME
mx.sort() #sort them, as they are anyway integers
#mx.reshape((len(mx),1)) #reshape from 1D to nD, where n=len(mx)

The PDFs are good now: the correct PDF got after repairing the code

cs.lev
  • 182
  • 1
  • 11