5

I am trying to use Julia to estimate a continuous univariate distribution using N observed data points (stored as an array of Float64 numbers), and then sample from this estimated distribution. I have no prior knowledge restricting attention to some family of distributions.

I was thinking of using the KernelDensity package to estimate the distribution, but I'm not sure how to sample from the resulting output.

Any help/tips would be much appreciated.

Fengyang Wang
  • 11,901
  • 2
  • 38
  • 67
Chai
  • 53
  • 3

1 Answers1

6

Without any restrictions on the estimated distribution, a natural candidate would be the empirical distribution function (see Wikipedia). For this distribution there are very nice theorems about convergence to actual distribution (see Dvoretzky–Kiefer–Wolfowitz inequality).

With this choice, sampling is especially simple. If dataset is a list of current samples, then dataset[rand(1:length(dataset),sample_size)] is a set of new samples from the empirical distribution. With the Distributions package, it could be more readable, like so:

using Distributions
new_sample = sample(dataset,sample_size)

Finally, Kernel density estimation is also good, but might need a parameter to be chosen (the kernel and its width). This shows a preference for a certain family of distributions. Sampling from a kernel distribution is surprisingly similar to sampling from the empirical distribution: 1. choose a sample from the empirical distributions; 2. perturb each sample using a sample from the kernal function.

For example, if the kernel function is a Normal distribution of width w, then the perturbed sample could be calculated as:

new_sample = dataset[rand(1:length(dataset),sample_size)]+w*randn(sample_size)
Dan Getz
  • 17,002
  • 2
  • 23
  • 41
  • 1
    "surprisingly similar" -> "directly equivalent" – Simon Byrne Oct 19 '16 at 19:44
  • 1
    Thank you, @DanGetz! Such a simple, neat solution! – Chai Oct 19 '16 at 21:43
  • 4
    @Chai I just thought I'd add that what DanGetz has proposed here in this very good answer is, for all practical purposes, an *iid* bootstrap. Note that if your data is not *iid* (e.g. if it is time-series data), then you may need to use a more robust sampling scheme, see e.g. a dependent bootstrap, if you want the resampled data to exhibit the same statistical properties as the original data. – Colin T Bowers Oct 19 '16 at 23:03
  • Thanks, @ColinTBowers, good point! Fortunately, I think my data is IID, at least at a first approximation. – Chai Oct 20 '16 at 03:10
  • 1
    Re the case that @ColinTBowers mentioned: I did the quick-and-dirty version of this recently. I have time-series data non-uniformly sampled on the time axis, and needed the estimated distribution as a fn of time. KernelDensity and normalizing the kde over the time axis worked pretty well. – ARM Oct 20 '16 at 17:35
  • 2
    BTW, @DanGetz, I think the code snippet new_sample = dataset[rand(1:length(dataset),sample_size)+w*randn(sample_size) is missing a closing square bracket before the noise term, and should be: new_sample = dataset[rand(1:length(dataset),sample_size)+w*randn(sample_size) I can't make this edit this myself (edit too short...), but it would be nice to have for anyone looking at this in the future. Are you able to make this edit? Thanks! – Chai Oct 20 '16 at 19:05
  • @Chai. Thanks. Fixed. – Dan Getz Oct 20 '16 at 19:18
  • 2
    @ARM If you ever need to do a dependent bootstrap I've written a package for it [here](https://github.com/colintbowers/DependentBootstrap.jl). I haven't added it to metadata yet, but am hoping to get it updated for v0.5 and added sometime in the next month or two. – Colin T Bowers Oct 21 '16 at 00:09