I have two sets of data (a, b) with distinct distributions. Set b has more data points and also has more variation. I need to subsample set b in a way that best approximates the distribution of set a. Although set b has a substantially larger mean, some values from set 'a' are also large and need to remain in set a.
I could just start trimming the lower and upper ranges of set b to get a similar mean, but then the standard deviations are not comparable. The next thing I considered was to do a bunch of permutations where I randomly subsample set b until I find a case in which a subsampled set b distribution is not different from the set a distribution (as assessed by ks.test in R stats). I guess I'm wondering if there is a package or function out there that can robustly do this (or perhaps something more appropriate).
An example dataset:
a = c(rnorm(n = 100, mean = 0, sd = sqrt(.1)), 4, 7, 10)
b = rnorm(n = 1000, mean = 3, sd = sqrt(4))
b = b[which(b >= min(a))]range(a)
[1] -0.6215744 10.0000000
range(b)
[1] -0.5520407 8.7371966sd(a)
[1] 1.287062
sd(b)
[1] 1.834108