0

I want to repeat specific rows of my minority class in my train set. I know, this is not a very fancy way to work like, but I just want to try it out.

Suppose, I have this dataframe:

> df

    group     type  number
1   class1     one    4
2   class1   three   10
3   class1    nine    3
4   class4   seven    9
5   class1   eight    4
6   class1     ten    2
7   class1     two   22
8   class4  eleven    8

Now I want to repeat the row of my minority class (class4) so many times, that I have 50% of class1 and 50% of class4 in a new dataframe.

I know there is the function rep, but I could only find solutions to repeat the whole dataframe.

How can I do this?

smci
  • 32,567
  • 20
  • 113
  • 146
pineapple
  • 169
  • 9
  • What id you have more than 2 groups? then you want them to be divided in 33%? – Ronak Shah Oct 22 '18 at 07:01
  • No, I will just have these two classes – pineapple Oct 22 '18 at 07:02
  • 1
    **You don't need to do this: if you only want to upweight the minority-class in resampling, just set weights on each class inversely proportional to frequency.** Most classifiers (RF, tree, LR, NN etc. allow weights) .And if you want to get fancier with resampling the minority class by creating synthetic exemplars, use SMOTE. See [Dealing with the class imbalance in binary classification](https://stackoverflow.com/questions/26221312/dealing-with-the-class-imbalance-in-binary-classification) – smci Oct 22 '18 at 07:11
  • @smci Thanks for your comment! I already weighted my minority class in decision tree and used the SMOTE function, but results weren't that promising. – pineapple Oct 22 '18 at 07:16
  • @pineapple: hmm, please tell us more. What is your evaluation function for training? (raw accuracy? AUC? something else?) How large is the class imbalance, please post the table. **As to SMOTE, please post your exact command line. Also post the before-and-after scores of your evaluation function.** – smci Oct 22 '18 at 07:19
  • See also e.g. [How to balance 1:1 with SMOTE in R](https://stackoverflow.com/questions/36651596/how-to-balance-11-with-smote-in-r) – smci Oct 22 '18 at 07:21

3 Answers3

2

Base R approach

#Count frequency of groups
tab <- table(df$group)

#Count number of rows to be added
no_of_rows <- max(tab) - min(tab)

#count number of rows which are already there in the dataframe for the minimum group
existing_rows <- which(df$group %in% names(which.min(tab)))

#Add new rows
new_df <- rbind(df, df[rep(existing_rows,no_of_rows/length(existing_rows)), ])

#Check the count
table(new_df$group)

#class1 class4 
#     6      6 
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • 1
    @pineapple: you honestly don't need to do this, just set weights on your classifier. Pretty much all good classifier implementations support per-class or per-exemplar weights. – smci Oct 22 '18 at 07:29
1

Here is an option using tidyverse

library(tidyverse)
n1 <- df %>% 
        count(group) %>% 
        slice(which.max(n)) %>%
        pull(n) 
df %>%
   filter(group == "class4") %>%
   mutate(n = n1/2) %>% 
   uncount(n) %>%
   bind_rows(filter(df, group == "class1"))
#    group   type number
#1  class4  seven      9
#2  class4  seven      9
#3  class4  seven      9
#4  class4 eleven      8
#5  class4 eleven      8
#6  class4 eleven      8
#7  class1    one      4
#8  class1  three     10
#9  class1   nine      3
#10 class1  eight      4
#11 class1    ten      2
#12 class1    two     22
akrun
  • 874,273
  • 37
  • 540
  • 662
1

I would suggest you using "Synthetic minority over-sampling technique (SMOTE)" (Chawla et al. 2002) or "Randomly Over Sampling Examples (ROSE)" (Menardi and Torelli, 2013).

1) You can either adjust the sampling in each cross-validation fold by adding sampling= in trainControl.

E.g.:

trainControl(method = "repeatedcv", 
                     number = 10, 
                     repeats = 10, 
                     sampling = "up")

2) Or, adjusting the sampling before training by calling the SMOTE and ROSE functions.

library("DMwR") #for smote
library("ROSE")

dat <- iris[1:70,]
dat$Species <- factor(dat$Species)

table(dat$Species) #class imbalances

setosa versicolor 
    50         20     

set.seed(100)
smote_train <- SMOTE(Species ~ ., data  = dat)                         
table(smote_train$Species)

setosa versicolor 
    80         60 


set.seed(100)
rose_train <- ROSE(Species ~ ., data  = dat)$data    
table(rose_train$Species)


setosa versicolor 
    37         33 
nadizan
  • 1,323
  • 10
  • 23