6

I have a set of >2000 numbers, gathered from measurement. I want to sample from this data set, ~10 times in each test, while preserving probability distribution overall, and in each test (to extent approximately possible). For example, in each test, I want some small value, some middle class value, some big value, with the mean and variance approximately close to the original distribution. Combining all the tests, I also want the total mean and variance of all the samples, approximately close to the original distribution.

As my dataset is a long-tail probability distribution, the amount of data at each quantile are not the same:

Probability density

Fig 1. Density plot of ~2k elements of data.

I am using Java, and right now I am using a uniform distribution, and use a random int from the dataset, and return the data element at that position:

public int getRandomData() {
    int data[] ={1231,414,222,4211,,41,203,123,432,...};
    length=data.length;
    Random r=new Random();
    int randomInt = r.nextInt(length);
    return data[randomInt];
}

I don't know if it works as I want, because I use data in order it is measured, which has great amount of serial correlation.

Ho1
  • 1,239
  • 1
  • 11
  • 29

2 Answers2

3

It works as you want. The order of the data is irrelevant.

Rex D
  • 497
  • 3
  • 9
  • You made me fill better. :-) But how can I prove this? And I am still worried about the fact that I don't get enough small and big values in each test. – Ho1 Sep 12 '15 at 14:18
  • @Ho1 the means and standard deviation are unchanged by the order. If you want the same distribution you need to sort the values and randomly select different portions of the samples, Of course this is not entirely random as you are constraining the results you want. – Peter Lawrey Sep 12 '15 at 14:24
  • @PeterLawrey: What you say is FALSE. – Rex D Sep 12 '15 at 17:19
  • @RexD can you be more specific? – Peter Lawrey Sep 12 '15 at 17:23
  • @PeterLawrey Specifically, your first statement is TRUE. Your second and third statements are FALSE. The order does not matter since you are drawing randomly from whatever order they are in. If you rearranged the numbers on a die, you'd still get the same distribution of probabilities in any dice game. – Rex D Sep 12 '15 at 22:33
  • 1
    @RexD If you randomly select results, you have no control over the distribution of any individual selection. You could have values of 1,1,1,2,2,3,3,4,5,6 but randomly select 5,6 which has a higher mean. Or 1,1 which has a lower mean. However if you need to control the random-ish selection process, as the OP does, and for example randomly select from the first half, and from the second, the highest average you could get is 2,6 or 4 on average. The lowest mean from two selections would be 1,3 or 2 on average. – Peter Lawrey Sep 13 '15 at 06:47
  • 1
    @PeterLawrey For clarifying the need for control over sampling, I have asked a new question: http://stackoverflow.com/questions/32550059 In short, I partition data into quantiles, and sample from each quantile. – Ho1 Sep 13 '15 at 13:38
  • @Ho1 this is broadly what I had in mind. This means that your distribution for an individual sub-sample should be similar to the original distribution. If you don't do this, you need to take many sub-samples to get a similar distribution between them. – Peter Lawrey Sep 13 '15 at 15:48
  • If you roll a die 1000 times, then randomly select 100 rolls from the 1000, then the result has the same distribution as if you had rolled the die 100 times. If you sort the 1000 rolls into 1,2,3,4,5, and 6, and sample a certain number from each group, it t does not have the same distribution as 100 rolls. The question asks for two things that are incompatible: that the distribution be the same as the original, and that the sample always have relatively equal numbers of small, medium, and large numbers. The solution given meets the first criterion only. The second is a "stratified sample." – Rex D Sep 13 '15 at 18:00
3

Random sampling preserves the probability distribution.

Raedwald
  • 46,613
  • 43
  • 151
  • 237