Is there an R package or function to subsample a dataset to approximate a certain distribution?

Question

I have two sets of data (a, b) with distinct distributions. Set b has more data points and also has more variation. I need to subsample set b in a way that best approximates the distribution of set a. Although set b has a substantially larger mean, some values from set 'a' are also large and need to remain in set a.

I could just start trimming the lower and upper ranges of set b to get a similar mean, but then the standard deviations are not comparable. The next thing I considered was to do a bunch of permutations where I randomly subsample set b until I find a case in which a subsampled set b distribution is not different from the set a distribution (as assessed by ks.test in R stats). I guess I'm wondering if there is a package or function out there that can robustly do this (or perhaps something more appropriate).

An example dataset:

a = c(rnorm(n = 100, mean = 0, sd = sqrt(.1)), 4, 7, 10)
b = rnorm(n = 1000, mean = 3, sd = sqrt(4))
b = b[which(b >= min(a))]

range(a)
[1] -0.6215744 10.0000000
range(b)
[1] -0.5520407 8.7371966

sd(a)
[1] 1.287062
sd(b)
[1] 1.834108

Have you considered using something like rank()? or this: `b[rank(b) %in% rank(a)]` — Bryan Wammack, Aug 11 '20 at 20:03
I'm not exactly sure what you mean by "subsample set b in a way that best approximates the distribution of set a". That doesn't seem precise enough to turn into code. If you need some statistical advice, a better place to ask for a general strategy would probably be [stats.se]. Why are you trying to force data with two different distributions to look like the same distribution? — MrFlick, Aug 11 '20 at 20:15

Is there an R package or function to subsample a dataset to approximate a certain distribution?

0 Answers0