0

I have two sets of data (a, b) with distinct distributions. Set b has more data points and also has more variation. I need to subsample set b in a way that best approximates the distribution of set a. Although set b has a substantially larger mean, some values from set 'a' are also large and need to remain in set a.

I could just start trimming the lower and upper ranges of set b to get a similar mean, but then the standard deviations are not comparable. The next thing I considered was to do a bunch of permutations where I randomly subsample set b until I find a case in which a subsampled set b distribution is not different from the set a distribution (as assessed by ks.test in R stats). I guess I'm wondering if there is a package or function out there that can robustly do this (or perhaps something more appropriate).

An example dataset:

a = c(rnorm(n = 100, mean = 0, sd = sqrt(.1)), 4, 7, 10)
b = rnorm(n = 1000, mean = 3, sd = sqrt(4))
b = b[which(b >= min(a))]

range(a)
[1] -0.6215744 10.0000000
range(b)
[1] -0.5520407 8.7371966

sd(a)
[1] 1.287062
sd(b)
[1] 1.834108

Jay
  • 442
  • 1
  • 5
  • 13
  • 1
    Have you considered using something like rank()? or this: `b[rank(b) %in% rank(a)]` – Bryan Wammack Aug 11 '20 at 20:03
  • 2
    I'm not exactly sure what you mean by "subsample set b in a way that best approximates the distribution of set a". That doesn't seem precise enough to turn into code. If you need some statistical advice, a better place to ask for a general strategy would probably be [stats.se]. Why are you trying to force data with two different distributions to look like the same distribution? – MrFlick Aug 11 '20 at 20:15

0 Answers0