0

This is a very simplified example but hopefully it gives everyone an idea what I'm talking about:

real.length = c(10,11,12,13,13,13,13,14,15,50)

random.length = vector() 
for (i in 1:length(real.length)){
    random.length[i] = sample(min(real.length):max(real.length),1)
}

(NB: I know I could just say random.length=sample(min:max,10) but I need the loop in my real code.)

I would like my random lengths to have a similar range to my real lengths, but also a similar distribution. I've tried rnorm but my real data doesn't have a normal distribution so I don't think that will work unless there's some options I've missed.

Is it possible to set the sample function's prob using my real data? So in this case give a higher weight/probability of a number between 10-15 and a lower weight/probability of a high number like 50.

EDIT: Using James's solution:

samples = length(real.length) 
d = density(real.length)
random.length = d$x[findInterval(runif(samples+100),cumsum(d$y)/sum(d$y))]
random.length = subset(random.length, random.length>0)
random.length = random.length[1:samples]
Jessica B
  • 321
  • 1
  • 4
  • 15
  • Are you expecting to be able to return values other than [10 11 12 13 14 15 50}? In other words - are you trying to find a smooth PDF that approximates your observed frequencies (although other values might be possible, e.g. 45), or are the observed frequencies the _only_ ones you want to sample? The latter is much easier than the former... – Floris Mar 19 '13 at 16:08
  • Unfortunately the former sounds more like it - so I'd be happy to generate a random.length that looked a bit like [10 10 11 14 14 15 15 16 48 49] – Jessica B Mar 19 '13 at 16:14

2 Answers2

0

You can create a density estimate and sample from that:

d <- density(real.length)
d$x[findInterval(runif(6),cumsum(d$y)/sum(d$y))]
[1] 13.066019 49.591973  9.636352 15.209561 11.951377 12.808794

Note that this assumes that your variable is continuous, so round as you see fit.

James
  • 65,548
  • 14
  • 155
  • 193
  • This is definitely along the right lines - thank you. However, when I try it out of my proper data min = -441268 and lots of my lengths are now negative, which obviously isn't right. The distribution looks much better though. – Jessica B Mar 19 '13 at 16:38
  • @JessicaB Just subset away the negative numbers (presumably they are erroneous) – James Mar 19 '13 at 16:53
0

While I can read R, I can't write it (I don't have it installed, so can't test). I will give you a simple example in Matlab that will do something like what you asked - I hope this inspires you:

obs = sort([10 11 12 13 13 13 13 14 15 50]); % have to make sure they are sorted...
uo = unique(obs);
hh = hist(obs, uo); % find frequencies of each value
cpdf = cumsum(obs);
cpdfn = cpdf / max(cpdf); % normalized cumulative pdf
r = rand(1, 100); % 100 random numbers from 0 to 1
rv = round(interp1(cpdfn, uo, r)); % randomly pick values in the cpdfn; find corresponding "observation"
hr = hist(rv, 1:50);
hrc = cumsum(hr);
figure
plot(uo, cpdfn);
hold all;
plot(1:50, hhc/max(hhc))

figure; hist(rv, 1:50);

This produces the following plots: enter image description here

enter image description here

Note - this works better as you have more observations; with the current example, because you have relatively few samples, the space between 15 and 50 gets sampled about 10% of the time.

Floris
  • 45,857
  • 6
  • 70
  • 122
  • Thank you! At first glance this looks like a cool solution, I've never used Matlab but this will be interesting to look over when I have the time - many thanks! – Jessica B Mar 20 '13 at 09:23