0

I have some data I that I'm assuming comes from a distribution and I'm trying to estimate that distribution. Right now I'm using the package KernSmooth in R with a Gaussian kernel and am using the package's dpik() function to automatically select my bandwidth. (I assume it uses AMISE or the sort, please let me know if there is a better auto-bandwidth selection process) What I'm interested in, though, is finding the x-value that corresponds with the highest peak in the distribution...This seems like a very simple thing to me and something I put off as trivial earlier on but to my frustration, I'm hitting some snags. The bkde() function in KernSmooth passes back a set of (x,y) coordinates which map out the distribution the algorithm has estimated. I know I could simply do a linear search through the data to find the max y-value and could simply grab the corresponding x-value but, as I am writing a function which may be called frequently in an automated process, I feel it is inefficient. Especially inefficient since bkde() gives back a lot of values.

My other idea was to attempt to fit a curve to it and take the derivative and set it equal to zero but that sounds like it may be inefficient as well. Maybe density() would be a better function to use here?

Please let me know if there is any efficient way for this...I actually plan to do a little bit of inference on the distributions I find. Such as finding the cutoff points to chop off a certain percentage of the tail on either side (i.e. confidence intervals) and finding the expected value. My vague plan now is to use some monte carlo techniques or attempt to draw from the distribution to get an idea for areas with bootstrapping techniques. Any help on any methods to do any of these would be greatly appreciated.

tdy
  • 36,675
  • 19
  • 86
  • 83
msabin
  • 1
  • 1
  • 1
  • I would recommend Wand and Jones' 1995 book "Kernel Smoothing", from Chapman & Hall- the book on which the KernSmooth package is based- to get a more complete understanding of the bandwidth selection processes being implemented. – Nan Sep 19 '12 at 18:47

1 Answers1

2

Using:

> require(KernSmooth)
Loading required package: KernSmooth
KernSmooth 2.23 loaded
Copyright M. P. Wand 1997-2009
> mod <- bkde(faithful$waiting)
> str(mod)
List of 2
 $ x: num [1:401] 22.7 23 23.2 23.4 23.7 ...
 $ y: num [1:401] 3.46e-08 1.17e-07 1.40e-07 1.68e-07 2.00e-07 ...

is this not efficient enough?

> which(mod$y == max(mod$y))
[1] 245

density() does something similar, but it returns 512 values of the density evaluate at 512 regular intervals of x.

In both functions the number of points returned can be controlled. See argument gridsize in bkde() and n in density(). Of course, the precision of the approach does depend on the density of points at which the KDE is estimated so you won;t want to set this too low.

My gut tells me you may spend an awful lot more time thinking up and implementing a more efficient approach than you would spend just going with the above simple solution.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • Thanks. The linear search probably is fine...I was most likely just overthinking it and expecting some magic for it. Any tips for how I should attempt to sample from it...actually I most likely don't even need to do that: For the expected value I could sum over x*y since y represents p(x) here and divide by dx which would be 401 in this case. Decent enough? As for computing areas of probability... Any built in function or do I have to do Riemann sums by hand or the sort? – msabin Jun 05 '12 at 20:34