1

Below is the downSample function of caret that I found here .

downSample <- function(x, y, list = FALSE, yname = "Class")
  {
    xc <- class(x)
    if(!is.data.frame(x)) x <- as.data.frame(x)
    if(!is.factor(y))
      {
        warning("Down-sampling requires a factor variable as the response. The original data was returned.")
        return(list(x = x, y = y))
      }

    minClass <- min(table(y))
    x$.outcome <- y
    
    x <- ddply(x, .(y),
               function(dat, n) dat[sample(seq(along = dat$.outcome), n),,drop = FALSE],
               n = minClass)
    y <- x$.outcome
    x <- x[, !(colnames(x) %in% c("y", ".outcome")), drop = FALSE]
    if(list)
      {
        if(xc[1] == "matrix") x <- as.matrix(x)
        out <- list(x = x, y = y)
      } else {
        out <- cbind(x, y)
        colnames(out)[ncol(out)] <- yname
      }
    out
  }

suppose that my data set is iris :

data(iris) 
x <- iris[, -5]
y <- iris[, 5]

to make the response variable a hugely unbalanced binary one :

y[-c(130, 146)] <- "setosa"

There are now therefore two instances of "virginica" and 148 instances of "setosa". I would like to modify the function downSample so that, in the end, instead of returning a subsampled data set with 50% of minClass, it returns a subsampled data set with for instance 30% (k) of minor class and 70% of major class. Because using the downSample function for n instances in the minClass it selects n instances of the other class to get a fully balanced data set. But in my case I loose a lot of data so I just want to balance it a bit not fully. Let's suppose that k = 20 % i.e. in the end I want 20% of minClaas and 80% of the other class. I have already tried to modify this part of function :

x <- ddply(x, .(y), function(dat, n) 
             dat[sample(seq(along = dat$.outcome), n),, drop = FALSE], n = minClass)

by changing n to 4*n but I did not achieve it. There is this error :

Error in size <= n/2 : comparison (4) is possible only for atomic and list types

Your help would be appreciated.

UseR10085
  • 7,120
  • 3
  • 24
  • 54
Basilique
  • 150
  • 1
  • 11

1 Answers1

2

A simple way to perform this is to change the n = minClass part of the ddply call.

downSample_custom <- function(x, y, list = FALSE, yname = "Class", frac = 1){ #add argument frac which is in the 0 - 1 range
  xc <- class(x)
  if(!is.data.frame(x)) x <- as.data.frame(x)
  if(!is.factor(y))
  {
    warning("Down-sampling requires a factor variable as the response. The original data was returned.")
    return(list(x = x, y = y))
  }
  
  minClass <- min(table(y))
  x$.outcome <- y
  
  x <- ddply(x, .(y),
             function(dat, n) dat[sample(seq(along = dat$.outcome), n),,drop = FALSE],
             n = minClass*frac) #change the n to this
  y <- x$.outcome
  x <- x[, !(colnames(x) %in% c("y", ".outcome")), drop = FALSE]
  if(list)
  {
    if(xc[1] == "matrix") x <- as.matrix(x)
    out <- list(x = x, y = y)
  } else {
    out <- cbind(x, y)
    colnames(out)[ncol(out)] <- yname
  }
  out
}

Does it work:

library(plyr)

imbalanced y:

set.seed(1)
y <- as.factor(sample(c("M", "F"),
                      prob = c(0.1, 0.9),
                      size = 10000,
                      replace = TRUE))


x <- rnorm(10000)

table(downSample_custom(x, y)[,2])

output:

   F    M 
1044 1044 

table(downSample_custom(x, y, frac = 0.5)[,2])

output:

  F   M 
522 522 

table(downSample_custom(x, y, frac = 0.2)[,2])

output

  F   M 
208 208

using frac > 1 returns an error:

downSample_custom(x, y, frac = 2)

output

Error in sample.int(length(x), size, replace, prob) : cannot take a sample larger than the population when 'replace = FALSE'

EDIT: answer to the updated question.

This can be achieved for instance by sampling the indexes of each class separately. Here is an example that works only for two class problems:

downSample_custom <- function(x, y, yname = "Class", frac = 1){
  lev <- levels(y)
  minClass <- min(table(y))
  lev_min <- levels(y)[which.min(table(y))]
  inds_down <- sample(which(y == lev[lev != lev_min]), size = minClass * frac) #sample the indexes of the more abundant class according to minClass * frac
  inds_minClass <- which(y == lev[lev == lev_min]) #take all the indexes of the lesser abundant class
  out <- data.frame(x, y)
  out <- out[sort(c(inds_down, inds_minClass)),]
  colnames(out)[ncol(out)] <- yname
  return(out)
} 

how it looks in practice:

table(downSample_custom(x, y)[,2])

output:

   F    M 
1044 1044 

table(downSample_custom(x, y, frac = 5)[,2])

output:

   F    M 
5220 1044 

head(downSample_custom(x, y, frac = 5))

output:

           x Class
1 -1.5163733     F
2  0.6291412     F
4  1.1797811     M
5  1.1176545     F
6 -1.2377359     F
7 -1.2301645     M
UseR10085
  • 7,120
  • 3
  • 24
  • 54
missuse
  • 19,056
  • 3
  • 25
  • 47
  • Thank you very much for your answer. I think my question was not asked clearly. In fact, what I am looking for is this : if we have `n` instances of `minClass` I want to choose k*n instances of the other class instead so that in the end we obtain a down sampled data set with for instance 30% of `minClass` and 70% of the other class. what `downSample` does is always 50%-50%. I don't want to have a totally balanced class since I will loose a lot of my data. I will correct my question. – Basilique Oct 23 '19 at 08:32
  • I have updated the answer to reflect your comment. Another approach you could use instead of sampling is to use an algorithm that supports class weights. And set the weight of the minority class to be higher. – missuse Oct 23 '19 at 09:07
  • great ! It works perfectly ! I am grateful for your help. – Basilique Oct 23 '19 at 09:23