7

I am modelling the distribution of repair costs with the Kernel Density Estimator of the scikit learn package in Python. I have created the density function fitted to my observations, but when taking a random sample from this distribution negative values occur. Since the observations regard costs, which are always positive, sample values should be non-negative.

I have read that with transformation of the data this result can be reached. These sources use log transformation to truncate the distribution at 0 (Log-transform kernel density estimation of income distribution, Kernel Density Estimation for Random Variables with Bounded Support — The Transformation Trick ). The problem is that I don't know how to use this log transformation of my observations in combination with the scikit learn Kernal Density function.

The code for the KDE without tranformation is as follows:

import numpy as np
from sklearn.neighbors import KernelDensity
import math as math

'Dataframe with costs'
x = costs

maxVal = x.max()
minVal = x.min()
upperBound = math.ceil(maxVal/1000)*1000

x_grid = np.linspace(0, upperBound, 1000)

'Create pdf with Kernel Density'
kde = KernelDensity(kernel='gaussian', bandwidth=612).fit(x_grid[:, np.newaxis])
log_pdf = kde.score_samples(x_grid[:, np.newaxis])
pdf=np.exp(log_pdf)

My code including transformation:

'Log tranformation and creation of pdf'

x_pseudo = x.apply(np.log)

kde_psuedo = KernelDensity(kernel='gaussian', bandwidth=612).fit(x_pseudo[:, np.newaxis])
log_pdf_pseudo = kde_psuedo.score_samples(x_pseudo[:, np.newaxis])
pdf_pseudo=np.exp(log_pdf_pseudo)

x_grid_log = np.linspace(minVal, maxVal, 1000)

density = np.zeros(len(x_grid_log))

for i in range(len(x_grid_log)):
    xx=x_grid_log[i]
    density[i]=pdf_pseudo[xx.apply(np.log)/xx]

output = list(x=x_grid_log, y=density)  

This code is based on the example in source 2, that is made in R. I know the code is wrong, but I don't know how to fix this. Any help would be greatly appreciated!

Machavity
  • 30,841
  • 27
  • 92
  • 100
  • Division by xx needs to happen outside the pdf_pseudo per equation 5 in the U Ottawa paper you referenced. – wrkyle Aug 04 '23 at 08:13
  • What exactly are you asking about? Are you trying to get the distribution of the original (non-log transformed) data under the constraint that costs can only be positive? Or would it be ok to have a distribution of the log costs? What exactly is going wrong in your code? Why do you think it's wrong? – Ingo Aug 09 '23 at 15:19
  • The bounty attracted a [ChatGPT](https://meta.stackoverflow.com/questions/421831/temporary-policy-chatgpt-is-banned) plagiariser. – Peter Mortensen Aug 21 '23 at 08:57

0 Answers0