0

I'm not sure what approach to take to this problem (I'm new to both R and statistical analysis). I have a highly imbalanced class in my data set:


  PCL_Sum     n
*     <dbl> <int>
1         0   300
2         1    25

I realise that I could use downSample for this data to get a balanced set with 25 randomly selected 0s and my existing 25 1s. But, I would like to repeat this process 12 times so that all of my '0' data is used, leaving me with 12 sets of data.

I realise that I could do this 12 times by hand, but I'd like to automate the process. Could someone give me a general idea of how they would approach the problem? I realise that there is likely an answer out there but I'm having trouble understanding the documentation I've found. Thank you!

Socsi2
  • 33
  • 3
  • My general idea would be: (1) split into one tibble of `PCL_SUM==0` and another for `PCL_SUM==1`. (2) Re-order rows using `sample()`. (3) Get first data set by taking rows `1:12`, getting second data set by taking rows `13:24`, etc. ... Does that help? – rcst Apr 03 '21 at 17:17

2 Answers2

0

Is there something undesirable about downSample? It seems like you could just apply it 12 times and go from there for your samples. Here's an example.

data(oil)
table(oilType)
downSample(fattyAcids, oilType)
mysamples <- lapply(1:12, function(x){downSample(fattyAcids, oilType)})

Then you can call mysamples[[1]] for the first set and so on.

> mysamples[[1]]
   Palmitic Stearic Oleic Linoleic Linolenic Eicosanoic Eicosenoic Class
1      11.5     5.1  27.8     54.5       0.2        0.4        0.1     A
2      11.4     5.8  34.5     48.3       1.0        0.1        0.1     A
3       6.1     4.1  24.0     64.3       0.1        0.3        0.1     B
4       6.1     4.1  26.7     61.0       0.6        0.3        0.2     B
5       9.7     3.4  59.3     20.5       0.1        1.5        1.2     C
6       9.6     3.3  57.7     20.7       0.2        1.5        1.8     C
7       9.3     2.8  65.0     17.0       3.9        0.5        0.7     D
8      10.9     2.7  76.7      7.9       0.8        0.1        0.1     D
9      10.9     3.6  26.0     52.6       5.5        0.4        0.2     E
10     10.5     4.2  24.4     52.1       7.5        0.4        0.1     E
11      5.4     2.0  53.2     28.9       7.3        0.6        1.3     F
12      5.1     2.3  55.9     27.4       6.8        0.5        0.5     F
13     10.0     2.3  36.9     47.1       2.2        0.5        0.5     G
14     10.7     1.8  30.2     55.5       0.9        0.5        0.3     G
> mysamples[[2]]
   Palmitic Stearic Oleic Linoleic Linolenic Eicosanoic Eicosenoic Class
1      13.0     6.2  25.8     55.0       0.8        0.1        0.1     A
2      13.1     5.7  31.7     49.5       0.6        0.1        0.1     A
3       5.6     4.2  25.7     58.9       1.7        2.8        0.9     B
4       6.1     4.1  24.0     64.3       0.1        0.3        0.1     B
5       9.6     3.3  57.7     20.7       0.2        1.5        1.8     C
6      10.0     3.3  60.0     21.3       0.2        1.5        1.3     C
7       9.3     2.8  65.0     17.0       3.9        0.5        0.7     D
8      14.9     2.6  68.2     12.8       0.6        0.4        0.3     D
9      10.9     3.6  26.0     52.6       5.5        0.4        0.2     E
10      9.7     3.9  25.1     54.2       5.9        0.1        0.1     E
11      5.1     2.3  55.9     27.4       6.8        0.5        0.5     F
12      5.5     1.7  59.0     21.3       9.3        0.6        1.5     F
13     10.7     1.8  30.2     55.5       0.9        0.5        0.3     G
14     10.0     2.3  36.9     47.1       2.2        0.5        0.5     G

Edit for unique samples:

df <- data.frame(class = c(rep("A", 25), rep("B", 300)),
                 value = 1:325)
mysamples <- lapply(1:12, function(x){df[c(1:25, (x * 25 + 1) : ((x+1) * 25)), ]})

This will take the first 25 of the majority class in sample 1, the next 25 in sample 2, etc. up to the 12th sample.

Carey Caginalp
  • 432
  • 2
  • 5
  • Thank you so much @Carey Caginalp! Just for clarification: what exactly does the 1:12 mean? I'd like to make sure that each group of 25 from the majority class is unique, i.e. that each of those 300 observations is used only once. – Socsi2 Apr 04 '21 at 13:16
  • It's generating random downsamples of the majority class 12 times. I did not realize you wanted unique observations so I'm editing my answer for that case. Essentially you can just order and pull out a different slice of the dataframe each time. – Carey Caginalp Apr 04 '21 at 14:29
  • Thank you so much! It seems to be working perfectly. – Socsi2 Apr 05 '21 at 11:26
0

We could also use replicate

library(caret)
out <- replicate(12, downSample(fattyAcids, oilType), simplify = FALSE)
akrun
  • 874,273
  • 37
  • 540
  • 662