stratified splitting the data

Question

I have a large data set and like to fit different logistic regression for each City, one of the column in my data. The following 70/30 split works without considering City group.

indexes <- sample(1:nrow(data), size = 0.7*nrow(data))

train <- data[indexes,]
test <- data[-indexes,]

But this does not guarantee the 70/30 split for each city.

lets say that I have City A and City B, where City A has 100 rows, and City B has 900 rows, totaling 1000 rows. Splitting the data with above code will give me 700 rows for train and 300 for test data, but it does not guarantee that i will have 70 rows for City A, and 630 rows for City B in the train data. How do i do that?

Once i have the training data split-ed to 70/30 fashion for each city,i will run logistic regression for each city ( I know how to do this once i have the train data)

You would need to assign the output of the lapply call to an object name. R is a functional language. Functions return values but they will be garbage collected if you don't save them. — IRTFM, Dec 26 '13 at 05:38

muon · Answer 1 · 2017-04-11T14:48:30.007

Try createDataPartition from caret package. Its document states: By default, createDataPartition does a stratified random split of the data.

library(caret)
train.index <- createDataPartition(Data$Class, p = .7, list = FALSE)
train <- Data[ train.index,]
test  <- Data[-train.index,]

it can also be used for stratified K-fold like:

ctrl <- trainControl(method = "repeatedcv",
                     repeats = 3,
                     ...)
# when calling train, pass this train control
train(...,
      trControl = ctrl,
      ...)

check out caret document for more details

score 15 · Answer 2 · answered Oct 02 '19 at 03:24

The package splitstackshape has a nice function stratified which can do this as well, but this is a bit better than createDataPartition because it can use multiple columns to stratify at once. It can be used with one column like:

library(splitstackshape)
set.seed(42)  # good idea to set the random seed for reproducibility
stratified(data, c('City'), 0.7)

Or with multiple columns:

stratified(data, c('City', 'column2'), 0.7)

IRTFM · Answer 3 · 2013-12-26T05:41:57.553

1

The typical way is with split

lapply( split(dfrm, dfrm$City), function(dd){
            indexes= sample(1:nrow(dd), size = 0.7*nrow(dd))
            train= dd[indexes, ]    # Notice that you may want all columns
            test= dd[-indexes, ]
            # analysis goes here
            }

If you were to do it in steps as you attempted above it would be like this:

cities <- split(data,data$city)

idxs <- lapply(cities, function (d) {
    indexes <- sample(1:nrow(d), size=0.7*nrow(d))
})

train <- data[ idxs[[1]], ]  # for the first city
test <-  data[ -idxs[[1]], ]

I happen to think the is the clumsy way to do it, but perhaps breaking it down into small steps will let you examine the intermediate values.

edited Dec 26 '13 at 05:41

answered Dec 25 '13 at 21:42

IRTFM

258,963
21
364
487

thanks for your note but i dont think this works. there is no data in train and test data sets. – user35577 Dec 26 '13 at 04:14
Change the "data"'s to 'dd. – IRTFM Dec 26 '13 at 04:25
Right. It would create those objects inside the function call, but what gets returned depends on the analysis. If you just ran that function, then it might or might not return anything. Furthermore, the results were not assigned anything. You never said what analysis you wanted done, so I just put in a placeholder. – IRTFM Dec 26 '13 at 05:35
Only the value of the RHS of that assignment would get returned. There should not be any object named 'indexes'. My more recent code used `[[.]]` to pull a vector out of a list. – IRTFM Dec 26 '13 at 05:44
Ishouldbuyaoat: once i get the train data, i will run logistic regression for each city,something like the following: city_2<-split(train,train$city) lapply(city_2, function(d) glm(X~Y, data=d) – user35577 Dec 26 '13 at 05:45
Fine. Do it in two steps if you want. What you just wrote should work. – IRTFM Dec 26 '13 at 05:48
train <- data[ idxs[[1]], ] this does not give me the data for the first city. if i do head(train) after this, i see the rows for the other cities as well – user35577 Dec 26 '13 at 06:01
You should edit your question so we can see the full sequence of code. (... and you should probably delete most of your comments to my answer.) – IRTFM Dec 26 '13 at 06:03

score 0 · Answer 4 · answered Dec 26 '13 at 04:38

0

Your code works just fine as is, if City is a column, simply run training data as train[,2]. You can do this easily for each one with a lambda function

logReg<-function(ind) {
    reg<-glm(train[,ind]~WHATEVER)
    ....
    return(val) }

Then run sapply over the vector of city indexes.

answered Dec 26 '13 at 04:38

evolvedmicrobe

2,672
2
22
30

yeah, to me, it should work too, but it does not work. Train or test data does not exist. – user35577 Dec 26 '13 at 05:19
can you verify that indexes is made correctly? I just tested it on some data and it works fine for me, not sure what the problem is – evolvedmicrobe Dec 26 '13 at 06:40

score 0 · Answer 5 · answered Mar 04 '21 at 09:41

Another possible way, similar to IRTFMs answer (e.g., using only base-r) is to use the following. Note that this answer returns a stratified index, which can be used like the index calculated in the question.

p <- 0.7
strats <- your_data$the_stratify_variable

rr <- split(1:length(strats), strats)
idx <- sort(as.numeric(unlist(sapply(rr, function(x) sample(x, length(x) * p)))))

train <- your_data[idx, ]
test <- your_data[-idx, ]

Example:

p <- 0.7
strats <- mtcars$cyl

rr <- split(1:length(strats), strats)
idx <- sort(as.numeric(unlist(sapply(rr, function(x) sample(x, length(x) * p)))))

train <- mtcars[idx, ]
test <- mtcars[-idx, ]

table(mtcars$cyl) / nrow(mtcars)
#>       4       6       8
#> 0.34375 0.21875 0.43750 

table(train$cyl) / nrow(train)
#>    4    6    8
#> 0.35 0.20 0.45 

table(test$cyl) / nrow(test)
#>         4         6         8 
#> 0.3333333 0.2500000 0.4166667

We see that all datasets all (mtcars), train, and test have roughly the same class distributions!

stratified splitting the data

5 Answers5

Linked