-1

I am new to data science and doing a project about Kernel Density Estimation, specifically about finding the best bandwidth and kernel function to use. I want to use Scikit Learn's KernelDensity which allows choosing the bandwidth and the kernel. I need to use some datasets to create a KDE model which will evaluate the probability density function and somehow evaluate its performance. The problem is, I don't have the actual PDF to compare to, so I'm not sure how to evaluate the performance. I guess I could split the data to train and test sets, but then I'm not sure how to evaluate the model's performance on the test set.

Can anyone suggest good methods in order to evaluate how good is the evaluated PDF compared to the underlying distribution?

Also, I found Scikit Learn's GridSearchCV which tries to find the best hyperparameters (such as bandwidth or kernel) using cross validation. As far as I saw, this can be used on KernelDensity with a certain dataset without giving the actual values to compare to. Something like this:

from sklearn.grid_search import GridSearchCV
grid = GridSearchCV(KernelDensity(),
                    {'bandwidth': np.linspace(0.1, 1.0, 30)},
                    cv=20) # 20-fold cross-validation
grid.fit(x[:, None])
print grid.best_params_

So my question is, how does this grid search know to evaluate the performance of KDE in order to select the best hyperparameters if it doesn't have the actual underlying PDF to compare to? Maybe if I understand what method is used for this, I could get an idea of how to evaluate my models.

Liel
  • 29
  • 1

1 Answers1

0

So regarding compared to pdf without knowing. Basically, It splits the data into training and testing sets and uses cross-validation to estimate the performance of the model on unseen data. It then selects the hyperparameters that result in the highest cross-validated log-likelihood.

You can use GridSearchCV to find the optimal bandwidth and kernel for your KDE model

    param_grid = {
    'bandwidth': np.linspace(1e-3, 1, 30),
    'kernel': ['gaussian', 'tophat', 'exponential']
}

grid = GridSearchCV(KernelDensity(), param_grid, cv=5)
grid.fit(data)
Abdulmajeed
  • 1,502
  • 2
  • 10
  • 13
  • so basically the method "fit" splits X_train to train + test sets? does this require any assumption on the distribution of the data? (such that it's from a normal distribution with unknown parameters) or it can be from any distribution? – Liel Feb 19 '23 at 17:37
  • The fit method in scikit-learn's KDE estimator does not split the data into train and test sets. Instead, it uses the entire dataset (X_train) to estimate the underlying probability density function (PDF). There is no assumption of the distribution of the data, except for the assumption that the data points are independent and identically distributed (iid). The KDE method can work for any distribution, not just normal distributions. It's a non-parametric method – Abdulmajeed Feb 19 '23 at 17:43
  • thanks! So the fit method uses the entire dataset to estimate the PDF, but in order to evaluate its performance on each set of hyperparameters it uses a test subset from the dataset, right? – Liel Feb 19 '23 at 18:46
  • also when I try to use the code you wrote, using both kernel and bandwidth in param_grid, I get some errors: "UserWarning: One or more of the test scores are non-finite", "RuntimeWarning: invalid value encountered in subtract", any idea what could cause that? – Liel Feb 19 '23 at 18:48
  • Yes, that's correct. The fit method estimates the PDF using the entire dataset, but to evaluate the performance of different hyperparameters, you need to hold out some data as a test set. – Abdulmajeed Feb 19 '23 at 19:13
  • One or more of the test scores are non-finite" typically means that one or more test scores are invalid, possibly due to very low values. The warning "RuntimeWarning: invalid value encountered in subtract" is usually caused by an attempt to subtract an infinite or NaN value. try specifying a different scoring metric for the GridSearchCV, such as 'neg_mean_squared_error' or 'neg_log_loss', to see if this helps to avoid the error. – Abdulmajeed Feb 19 '23 at 19:15
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/251990/discussion-between-magedo-and-liel). – Abdulmajeed Feb 19 '23 at 21:12