1

Following my previous question titled: "Random sampling from a dataset, while preserving original probability distribution", I want to sample from a set of >2000 numbers, gathered from measurement. I want to perform several tests (I take maximum of 10 samples in each tests), while preserving probability distribution in overall testiong process, and in each test (as much as possible). Now, instead of completely random sampling, I partition data into 5 quantiles, and in 10 tests, I sample 2 data elements from each quantile, using a uniformly random distribution for the array of data in each quantile.

The problem with the completely random sampling was that as the distribution of data is long-tailed, I was getting almost the same values in each test. I want some small value samples, some middle value samples, and some big value samples in each test. So I sampled as described.

density plot of data

Fig 1. Density plot of ~2k elements of data.

This is the R code for calculating quantiles:

q=quantile(data, probs = seq(0, 1, by= 0.1))

And then I partition data into 5 quantiles (each one as an array) and sample from each partition. For example, I do this in Java:

public int getRandomData(int quantile) {
    int data[][] = {1,2,3,4,5}
                  ,{6,7,8,9,10}
                  ,{11,12,13,14,15}
                  ,{16,17,18,19,20}
                  ,{21,22,23,24,25}};
    length=data[quantile][].length;
    Random r=new Random();
    int randomInt = r.nextInt(length);
    return data[quantile][randomInt];
}

So, does the samples for each tests and all tests overall, preserve the characteristics of the original distribution, for example mean and variance? If not, how to arrange sampling to achieve this goal?

Community
  • 1
  • 1
Ho1
  • 1,239
  • 1
  • 11
  • 29
  • 2
    Have a look at http://topepo.github.io/caret/splitting.html – Steven Beaupré Sep 13 '15 at 13:31
  • I had to deal with a similar problem recently. Yes, you will preserve the distribution. But, I don't think you will get what you want if you are sampling quantiles since the quantile containing the long tail will have a huge support (width, breadth). – Rorschach Sep 13 '15 at 14:50
  • quantile means to divide into quarters. I suspect you meant quintiles. https://en.wiktionary.org/wiki/quintile – Peter Lawrey Sep 13 '15 at 15:50
  • @bunk quartile comes from the same word as quarters. It specific means to divide into four, not just any equal splitting. This is why you have a word like quintile which means to divide into 5. – Peter Lawrey Sep 13 '15 at 16:13
  • @bunk it's the Romans fault and they are all dead so you can't blame them. ;) – Peter Lawrey Sep 13 '15 at 16:16
  • 1
    @PeterLawrey xD, to make it all more confusing, the R function `quantiles` doesnt even require splits to be equal – Rorschach Sep 13 '15 at 16:18

1 Answers1

1

preserve the characteristics of the original distribution, for example mean and variance?

This will have a similar distribution. You might want to have an additional check to ensure it meets your requirement, and perhaps try again, but this will get you close.

If not, how to arrange sampling to achieve this goal?

Unless you have duplication of all data i.e. double everything, you need to have one of every sample value. This is the only way to get exactly the same distribution.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • Thanks. I can consume all data elements in the sampling process, but I am concerned about the individual tests. If I do fixed number of sampling (for example exactly 10) from different quantiles, I can make sure that the result is ok. But if I do sampling at most 10 times (<=10) each time, there will be problems. I think If I change quantiles randomly, this will be ok in overall. Btw, should I have asked this question on **stats.SE**? – Ho1 Sep 13 '15 at 16:42
  • @Ho1 You say there will be problems, what problems did you have in mind? Changing the quinities randomly is the same as simple random selection. – Peter Lawrey Sep 13 '15 at 16:58
  • Ok, let me explain. If the test is only done 3 times, and I have 10 quantiles, there will be a mismatch between the characteristics of samples and the original distribution. Isn't it? – Ho1 Sep 13 '15 at 17:03
  • @Ho1 if you select randomly, there will always be a mismatch. If it was exactly the same it wouldn't be random. (Unless you happen to randomly select every value) The questions is how closely should it match but still be as random as possible. – Peter Lawrey Sep 13 '15 at 17:05
  • Ok. I'll choose from random quantiles, so it may conform the original distribution. Although one should think about the distribution of number of sampling done in a test. Do you think it would be easy to prove that the latter distribution is not important? – Ho1 Sep 13 '15 at 17:08
  • @Ho1 Only you can say whether the difference is important or not. These are your requirements are they not? – Peter Lawrey Sep 13 '15 at 17:09
  • It is important for me. But it is also hard for me to make the number of sampling constant, because it is not a simulation. So, I'm trying to prove that even if the number of samples taken in each test is not constant, it is ok. – Ho1 Sep 13 '15 at 17:24
  • 1
    @Ho1 it's ok for me. Define "ok"? – Peter Lawrey Sep 13 '15 at 17:25
  • In my view, "ok" means they will have nearly the same probability distribution with a difference smaller than a desired value, when the total number of tests tend to be larger than some threshold. Remember https://en.wikipedia.org/wiki/Central_limit_theorem? – Ho1 Sep 13 '15 at 17:31