Below is the downSample
function of caret
that I found here .
downSample <- function(x, y, list = FALSE, yname = "Class")
{
xc <- class(x)
if(!is.data.frame(x)) x <- as.data.frame(x)
if(!is.factor(y))
{
warning("Down-sampling requires a factor variable as the response. The original data was returned.")
return(list(x = x, y = y))
}
minClass <- min(table(y))
x$.outcome <- y
x <- ddply(x, .(y),
function(dat, n) dat[sample(seq(along = dat$.outcome), n),,drop = FALSE],
n = minClass)
y <- x$.outcome
x <- x[, !(colnames(x) %in% c("y", ".outcome")), drop = FALSE]
if(list)
{
if(xc[1] == "matrix") x <- as.matrix(x)
out <- list(x = x, y = y)
} else {
out <- cbind(x, y)
colnames(out)[ncol(out)] <- yname
}
out
}
suppose that my data set is iris
:
data(iris)
x <- iris[, -5]
y <- iris[, 5]
to make the response variable a hugely unbalanced binary one :
y[-c(130, 146)] <- "setosa"
There are now therefore two instances of "virginica"
and 148 instances of "setosa"
. I would like to modify the function downSample
so that, in the end, instead of returning a subsampled data set with 50% of minClass, it returns a subsampled data set with for instance 30% (k) of minor class and 70% of major class. Because using the downSample
function for n instances in the minClass it selects n instances of the other class to get a fully balanced data set. But in my case I loose a lot of data so I just want to balance it a bit not fully.
Let's suppose that k = 20 %
i.e. in the end I want 20% of minClaas
and 80% of the other class. I have already tried to modify this part of function :
x <- ddply(x, .(y), function(dat, n)
dat[sample(seq(along = dat$.outcome), n),, drop = FALSE], n = minClass)
by changing n
to 4*n
but I did not achieve it. There is this error :
Error in size <= n/2 : comparison (4) is possible only for atomic and list types
Your help would be appreciated.