How to repeat downSample in R?

Question

I'm not sure what approach to take to this problem (I'm new to both R and statistical analysis). I have a highly imbalanced class in my data set:


  PCL_Sum     n
*     <dbl> <int>
1         0   300
2         1    25

I realise that I could use downSample for this data to get a balanced set with 25 randomly selected 0s and my existing 25 1s. But, I would like to repeat this process 12 times so that all of my '0' data is used, leaving me with 12 sets of data.

I realise that I could do this 12 times by hand, but I'd like to automate the process. Could someone give me a general idea of how they would approach the problem? I realise that there is likely an answer out there but I'm having trouble understanding the documentation I've found. Thank you!

My general idea would be: (1) split into one tibble of `PCL_SUM==0` and another for `PCL_SUM==1`. (2) Re-order rows using `sample()`. (3) Get first data set by taking rows `1:12`, getting second data set by taking rows `13:24`, etc. ... Does that help? — rcst, Apr 03 '21 at 17:17

Carey Caginalp · Accepted Answer · 2021-04-04T14:30:24.060

Is there something undesirable about downSample? It seems like you could just apply it 12 times and go from there for your samples. Here's an example.

data(oil)
table(oilType)
downSample(fattyAcids, oilType)
mysamples <- lapply(1:12, function(x){downSample(fattyAcids, oilType)})

Then you can call mysamples[[1]] for the first set and so on.

> mysamples[[1]]
   Palmitic Stearic Oleic Linoleic Linolenic Eicosanoic Eicosenoic Class
1      11.5     5.1  27.8     54.5       0.2        0.4        0.1     A
2      11.4     5.8  34.5     48.3       1.0        0.1        0.1     A
3       6.1     4.1  24.0     64.3       0.1        0.3        0.1     B
4       6.1     4.1  26.7     61.0       0.6        0.3        0.2     B
5       9.7     3.4  59.3     20.5       0.1        1.5        1.2     C
6       9.6     3.3  57.7     20.7       0.2        1.5        1.8     C
7       9.3     2.8  65.0     17.0       3.9        0.5        0.7     D
8      10.9     2.7  76.7      7.9       0.8        0.1        0.1     D
9      10.9     3.6  26.0     52.6       5.5        0.4        0.2     E
10     10.5     4.2  24.4     52.1       7.5        0.4        0.1     E
11      5.4     2.0  53.2     28.9       7.3        0.6        1.3     F
12      5.1     2.3  55.9     27.4       6.8        0.5        0.5     F
13     10.0     2.3  36.9     47.1       2.2        0.5        0.5     G
14     10.7     1.8  30.2     55.5       0.9        0.5        0.3     G
> mysamples[[2]]
   Palmitic Stearic Oleic Linoleic Linolenic Eicosanoic Eicosenoic Class
1      13.0     6.2  25.8     55.0       0.8        0.1        0.1     A
2      13.1     5.7  31.7     49.5       0.6        0.1        0.1     A
3       5.6     4.2  25.7     58.9       1.7        2.8        0.9     B
4       6.1     4.1  24.0     64.3       0.1        0.3        0.1     B
5       9.6     3.3  57.7     20.7       0.2        1.5        1.8     C
6      10.0     3.3  60.0     21.3       0.2        1.5        1.3     C
7       9.3     2.8  65.0     17.0       3.9        0.5        0.7     D
8      14.9     2.6  68.2     12.8       0.6        0.4        0.3     D
9      10.9     3.6  26.0     52.6       5.5        0.4        0.2     E
10      9.7     3.9  25.1     54.2       5.9        0.1        0.1     E
11      5.1     2.3  55.9     27.4       6.8        0.5        0.5     F
12      5.5     1.7  59.0     21.3       9.3        0.6        1.5     F
13     10.7     1.8  30.2     55.5       0.9        0.5        0.3     G
14     10.0     2.3  36.9     47.1       2.2        0.5        0.5     G

Edit for unique samples:

df <- data.frame(class = c(rep("A", 25), rep("B", 300)),
                 value = 1:325)
mysamples <- lapply(1:12, function(x){df[c(1:25, (x * 25 + 1) : ((x+1) * 25)), ]})

This will take the first 25 of the majority class in sample 1, the next 25 in sample 2, etc. up to the 12th sample.

Thank you so much @Carey Caginalp! Just for clarification: what exactly does the 1:12 mean? I'd like to make sure that each group of 25 from the majority class is unique, i.e. that each of those 300 observations is used only once. — Socsi2, Apr 04 '21 at 13:16
It's generating random downsamples of the majority class 12 times. I did not realize you wanted unique observations so I'm editing my answer for that case. Essentially you can just order and pull out a different slice of the dataframe each time. — Carey Caginalp, Apr 04 '21 at 14:29

score 0 · Answer 2 · answered Apr 03 '21 at 18:22

0

We could also use replicate

library(caret)
out <- replicate(12, downSample(fattyAcids, oilType), simplify = FALSE)

answered Apr 03 '21 at 18:22

akrun

874,273
37
540
662

How to repeat downSample in R?

2 Answers2

Linked