I'm trying to augment the number of records of my dataset in order to make a semi-supervised learning algorithm. The starting dataset has around 495 record with 9 features and 2 targets.
So far I've plot the distribution of the features in order to have a quick idea about possible distribution. Moreover I've tried to fit them both in parametric and non parametric approaches.
The question is, how can I get numerical results about the error estimation?
EDIT: Thanks to juanpa.arrivillaga, I added a summary of the question: How can I get numerical results (e.g. MSE) to get an idea about the fit quality of the density estimations with respect to the real ones? With the pasted code I get an idea only based on plots.
Below there is my current function, I've already look for it on stack overflow but I was able only to end up into plots, and the functions that I've used. Thanks in advance for any help!
def learnTheGaussianEstimation(title, x_data, y_data, mean, variance,parametric_kde):
"""
Function that given the curve try to fit it with the Gaussian. The curve represents a single parameter behaviour over
all the monitored Dates.
:param title: string to be assigned upon the plot
:param x: list of x-points
:param y: list of y-points
:param mean: mean of the values
:param variance: variance of the values
:param parametric_kde: boolean that denotes hte mode if parametric or non parametric fit
:return: void
"""
if parametric_kde:
n_bins = len(y_data)/7
# Density given by the samples
pd.DataFrame(y_data).plot(kind="density",
figsize=(9, 9),title=title,label='data')
# Gaussian fitting - parametric
param = norm.fit(y_data,param=[mean,])
x = np.linspace(min(y_data), max(y_data), len(x_data))
norm_fitted = norm.pdf(x, loc=param[0], scale=param[1])
plt.plot(x,norm_fitted, 'r' , label='gaussian')
# Rayleight fitting - parametric
param = rayleigh.fit(y_data)
rayleigh_fitted = rayleigh.pdf(x, loc=param[0], scale=param[1])
plt.plot(x,rayleigh_fitted, 'g',label='rayleigh')
plt.legend()
plt.show()
else:
q1 = np.percentile(y_data, 25)
q3 = np.percentile(y_data, 75)
bandwidth = 0.25
bins = 6
x = np.linspace(min(y_data), max(y_data), 2000)
kde = gaussian_kde(y_data)
kde.covariance_factor = lambda: bandwidth
kde._compute_covariance()
plt.plot(x, kde(x), 'r') # distribution function
plt.hist(y_data, bins=bins, normed=True) # histogram
plt.show()
return