I am new to data science and doing a project about Kernel Density Estimation, specifically about finding the best bandwidth and kernel function to use.
I want to use Scikit Learn's KernelDensity
which allows choosing the bandwidth and the kernel.
I need to use some datasets to create a KDE model which will evaluate the probability density function and somehow evaluate its performance.
The problem is, I don't have the actual PDF to compare to, so I'm not sure how to evaluate the performance. I guess I could split the data to train and test sets, but then I'm not sure how to evaluate the model's performance on the test set.
Can anyone suggest good methods in order to evaluate how good is the evaluated PDF compared to the underlying distribution?
Also, I found Scikit Learn's GridSearchCV
which tries to find the best hyperparameters (such as bandwidth or kernel) using cross validation.
As far as I saw, this can be used on KernelDensity
with a certain dataset without giving the actual values to compare to. Something like this:
from sklearn.grid_search import GridSearchCV
grid = GridSearchCV(KernelDensity(),
{'bandwidth': np.linspace(0.1, 1.0, 30)},
cv=20) # 20-fold cross-validation
grid.fit(x[:, None])
print grid.best_params_
So my question is, how does this grid search know to evaluate the performance of KDE in order to select the best hyperparameters if it doesn't have the actual underlying PDF to compare to? Maybe if I understand what method is used for this, I could get an idea of how to evaluate my models.