I have two simple datasets having 10k IP addresses encoded as Integers (so the data is discrete and can take any number range between 1 and 4B).
FYI: One dataset is a real dataset captured at a network, while the other one is a synthetic one. At the end of the day, I want to see how good the synthetic one is (generated via AI/ML) compared to the real one. But I am pretty stuck at the beginning:D
Since the dataset's distribution is unknown yet not following any well-known distribution, I want to calculate the PDF of them (and later compare how similar they are).
My two datasets are termed p
and q
, both arrays of IP addresses (as integers).
I am not an expert in probability theory, so please, bear with me :)
Since I want to compare the two probabilities eventually, to calculate the PDFs of them, I take all possible events (i.e., IP addresses) present in p
and q
. For this, I do the following in Python using numpy
:
import numpy as np
import pandas as pd
q=np.array(real_data_1m.srcip) #
p=np.array(syn_data_1m.srcip)
#get all possible discrete events from p and q
px=np.array(list(set(p))) #use set here to remove duplicates
qx=np.array(list(set(q))) #use set here to remove duplicates
#concatenate px and qx
mx=np.concatenate([px,qx])
mx.sort() #sort them, as they are anyway integers
mx=np.array(list(set(mx))) #remove duplicates by creating a set
#mx.reshape((len(mx),1)) #reshape from 1D to nD, where n=len(mx)
Then, to calculate the PDF, I created a simply function create_prob_dist()
to help towards this goal.
def create_prob_dist(data: np.array, sample_space: np.array):
#number of all events
sum_of_events = sample_space.size
#get the counts of each event via pandas.crosstab()
data_counts = pd.crosstab(index='counts', columns=data)
#create probabilities for each event
prob_dist=dict()
for i in sample_space:
if i in data_counts:
prob_dist[i]=(data_counts[i]['counts'])/sum_of_events
else:
prob_dist[i]=0
return prob_dist
This function does not return the PDF itself. At this stage, it returns a Python dictionary, where the keys are the possible IP addresses that are represented in both p
and q
, i.e., in mx
. The corresponding values, therefore, are the probability of each of them. Something like: dict[2130706433]=0.05, meaning the probability of IP address 127.0.0.1 in the dataset is 0.05.
After I have this dictionary of probabilities, I try to plot it, but then comes my problems:
#create true PDFs of p and q using mx
p_pdf=create_prob_dist(p, mx)
q_pdf=create_prob_dist(q, mx)
#get the probability values only from the dictionary
p_pdf=np.array(list(p_pdf.values())) #already sorted according to mx
q_pdf=np.array(list(q_pdf.values())) #already sorted according to mx
plt.figure()
plt.plot(mx, q_pdf, 'g', label="Q")
plt.plot(mx, p_pdf, 'r', label="P")
plt.legend(loc="upper right")
plt.show()
I know there should be a problem around the scales or something, but I could not get my head around it.
What am I doing wrong? Is it a wrong Python call or is the calculation of the PDF wrong?
Btw., the pure histogram of the p
and q
looks like this:
# plot a histogram of the two datasets to have a quick look at them
plt.hist(np.array(syn_data_1m.srcip), bins=100)
plt.hist(np.array(real_data_1m.srcip),bins=100, alpha=.5)
plt.show()