I am modelling the distribution of repair costs with the Kernel Density Estimator of the scikit learn package in Python. I have created the density function fitted to my observations, but when taking a random sample from this distribution negative values occur. Since the observations regard costs, which are always positive, sample values should be non-negative.
I have read that with transformation of the data this result can be reached. These sources use log transformation to truncate the distribution at 0 (Log-transform kernel density estimation of income distribution, Kernel Density Estimation for Random Variables with Bounded Support — The Transformation Trick ). The problem is that I don't know how to use this log transformation of my observations in combination with the scikit learn Kernal Density function.
The code for the KDE without tranformation is as follows:
import numpy as np
from sklearn.neighbors import KernelDensity
import math as math
'Dataframe with costs'
x = costs
maxVal = x.max()
minVal = x.min()
upperBound = math.ceil(maxVal/1000)*1000
x_grid = np.linspace(0, upperBound, 1000)
'Create pdf with Kernel Density'
kde = KernelDensity(kernel='gaussian', bandwidth=612).fit(x_grid[:, np.newaxis])
log_pdf = kde.score_samples(x_grid[:, np.newaxis])
pdf=np.exp(log_pdf)
My code including transformation:
'Log tranformation and creation of pdf'
x_pseudo = x.apply(np.log)
kde_psuedo = KernelDensity(kernel='gaussian', bandwidth=612).fit(x_pseudo[:, np.newaxis])
log_pdf_pseudo = kde_psuedo.score_samples(x_pseudo[:, np.newaxis])
pdf_pseudo=np.exp(log_pdf_pseudo)
x_grid_log = np.linspace(minVal, maxVal, 1000)
density = np.zeros(len(x_grid_log))
for i in range(len(x_grid_log)):
xx=x_grid_log[i]
density[i]=pdf_pseudo[xx.apply(np.log)/xx]
output = list(x=x_grid_log, y=density)
This code is based on the example in source 2, that is made in R. I know the code is wrong, but I don't know how to fix this. Any help would be greatly appreciated!