Balanced sample with defined n in R

Question

I have an imbalanced dataset for sentiment analysis with about 65000 observations (~60000 positive and ~5000 negatives). This dataset should be balanced so that I have the same number of positive and negative observations to train my machine learning algorithms.

The package caret and the function downSample help me to get ~5000 negative and ~5000 positive observations (downsampling to minority class). But I like to have exactly 2500 randomly selected positive and 2500 randomly selected negative observations. Is there anyone who knows how to do this?

Also check package `ROSE` and function `SMOTE` from package `‘DMwR’` or package `unbalanced` — Pierre Lapointe, Mar 17 '19 at 21:58

score 3 · Accepted Answer · answered Mar 17 '19 at 22:51

3

You just want 2500 of each??

require(tidyverse)
df <- data.frame(class = c(rep('POS',60000), rep('NEG',5000)), random = runif(65000))
result <- df %>% 
  group_by(class) %>% 
  sample_n(2500)
table(result$class)

answered Mar 17 '19 at 22:51

mr.joshuagordon

754
4
8

score 1 · Answer 2 · answered Mar 20 '19 at 21:08

Ideally, you should be done the subsampling inside of a resampling procedure. I suggest using the sampling argument of trainControl to specify different down-samples. Using the code from @mr.joshuagordon :

library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
require(tidyverse)
#> Loading required package: tidyverse
df <-
  data.frame(
    class = factor(c(rep('POS', 60000), rep('NEG', 5000))),
    random1 = runif(65000),
    random2 = runif(65000)
  )

sampler <- function(x, y) {
  if (!is.data.frame(x))
    x <- as.data.frame(x)
  dat <- 
    x %>% 
    mutate(.y = y) %>% 
    group_by(.y) %>% 
    sample_n(2500) %>% 
    ungroup() %>% 
    as.data.frame()
  list(x = dat[, names(dat) != ".y", drop = FALSE], y = dat$.y)
}

samp_info <- list(name = sampler, first = TRUE)

ctrl <- trainControl(method = "cv", sampling = sampler)

lr_mod <- train(class ~ ., data = df, method = "glm", trControl = ctrl)
length(lr_mod$finalModel$residuals)
#> [1] 5000

^{Created on 2019-03-20 by the reprex package (v0.2.1)}

Balanced sample with defined n in R

2 Answers2