0

I'm trying to augment the number of records of my dataset in order to make a semi-supervised learning algorithm. The starting dataset has around 495 record with 9 features and 2 targets.

So far I've plot the distribution of the features in order to have a quick idea about possible distribution. Moreover I've tried to fit them both in parametric and non parametric approaches.

The question is, how can I get numerical results about the error estimation?

EDIT: Thanks to juanpa.arrivillaga, I added a summary of the question: How can I get numerical results (e.g. MSE) to get an idea about the fit quality of the density estimations with respect to the real ones? With the pasted code I get an idea only based on plots.

Below there is my current function, I've already look for it on stack overflow but I was able only to end up into plots, and the functions that I've used. Thanks in advance for any help!

def learnTheGaussianEstimation(title, x_data, y_data, mean, variance,parametric_kde):
    """
    Function that given the curve try to fit it with the Gaussian. The curve represents a single parameter behaviour over
all the monitored Dates.
    :param title: string to be assigned upon the plot
    :param x: list of x-points
    :param y: list of y-points
    :param mean: mean of the values
    :param variance: variance of the values
    :param parametric_kde: boolean that denotes hte mode if parametric or non parametric fit
    :return: void
    """
    if parametric_kde:
        n_bins = len(y_data)/7

        # Density given by the samples
        pd.DataFrame(y_data).plot(kind="density",
                                    figsize=(9, 9),title=title,label='data')

        # Gaussian fitting  -   parametric
        param = norm.fit(y_data,param=[mean,])
        x = np.linspace(min(y_data), max(y_data), len(x_data))
        norm_fitted = norm.pdf(x, loc=param[0], scale=param[1])

        plt.plot(x,norm_fitted, 'r' , label='gaussian')

        # Rayleight fitting -   parametric
        param = rayleigh.fit(y_data)
        rayleigh_fitted = rayleigh.pdf(x, loc=param[0], scale=param[1])
        plt.plot(x,rayleigh_fitted, 'g',label='rayleigh')

        plt.legend()
        plt.show()
    else:
        q1 = np.percentile(y_data, 25)
        q3 = np.percentile(y_data, 75)
        bandwidth = 0.25
        bins = 6

        x = np.linspace(min(y_data), max(y_data), 2000)

        kde = gaussian_kde(y_data)
        kde.covariance_factor = lambda: bandwidth
        kde._compute_covariance()

        plt.plot(x, kde(x), 'r')  # distribution function
        plt.hist(y_data, bins=bins, normed=True)  # histogram

        plt.show()
    return
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
LNRD.CLL
  • 385
  • 1
  • 5
  • 19
  • I'm new to python so, for any advice even if not strictly related with the question, I would appreciate! – LNRD.CLL Aug 09 '16 at 19:09
  • Not related to the issue, but it is bad style to use camelCase for function definitions... [embrace_pep8](https://www.python.org/dev/peps/pep-0008/) – juanpa.arrivillaga Aug 09 '16 at 19:36
  • Also, your question is not really clear. It is usually better to give an example output and an example input. It's good that you posted your code, although, it'd be better to narrow it down some more. But what is the problem with your code? – juanpa.arrivillaga Aug 09 '16 at 19:37
  • Thanks to both of them! Anyway, the problem is that with the previous code I was able to understand just that neither the gaussian or the rayleigh densities fit good the samples. What I'm not able to do, is to translate these graphical results into numerical ones. In particular I would like to get an estimation of the error between the real density and the estimated ones. – LNRD.CLL Aug 09 '16 at 19:54
  • That's what I looked for! http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.mstats.normaltest.html#scipy.stats.mstats.normaltest Thanks to everybody! – LNRD.CLL Aug 10 '16 at 23:06

0 Answers0