17

I have a large data set and like to fit different logistic regression for each City, one of the column in my data. The following 70/30 split works without considering City group.

indexes <- sample(1:nrow(data), size = 0.7*nrow(data))

train <- data[indexes,]
test <- data[-indexes,]

But this does not guarantee the 70/30 split for each city.

lets say that I have City A and City B, where City A has 100 rows, and City B has 900 rows, totaling 1000 rows. Splitting the data with above code will give me 700 rows for train and 300 for test data, but it does not guarantee that i will have 70 rows for City A, and 630 rows for City B in the train data. How do i do that?

Once i have the training data split-ed to 70/30 fashion for each city,i will run logistic regression for each city ( I know how to do this once i have the train data)

user35577
  • 191
  • 1
  • 1
  • 7
  • You would need to assign the output of the lapply call to an object name. R is a functional language. Functions return values but they will be garbage collected if you don't save them. – IRTFM Dec 26 '13 at 05:38

5 Answers5

48

Try createDataPartition from caret package. Its document states: By default, createDataPartition does a stratified random split of the data.

library(caret)
train.index <- createDataPartition(Data$Class, p = .7, list = FALSE)
train <- Data[ train.index,]
test  <- Data[-train.index,]

it can also be used for stratified K-fold like:

ctrl <- trainControl(method = "repeatedcv",
                     repeats = 3,
                     ...)
# when calling train, pass this train control
train(...,
      trControl = ctrl,
      ...)

check out caret document for more details

muon
  • 12,821
  • 11
  • 69
  • 88
15

The package splitstackshape has a nice function stratified which can do this as well, but this is a bit better than createDataPartition because it can use multiple columns to stratify at once. It can be used with one column like:

library(splitstackshape)
set.seed(42)  # good idea to set the random seed for reproducibility
stratified(data, c('City'), 0.7)

Or with multiple columns:

stratified(data, c('City', 'column2'), 0.7)
wordsforthewise
  • 13,746
  • 5
  • 87
  • 117
1

The typical way is with split

lapply( split(dfrm, dfrm$City), function(dd){
            indexes= sample(1:nrow(dd), size = 0.7*nrow(dd))
            train= dd[indexes, ]    # Notice that you may want all columns
            test= dd[-indexes, ]
            # analysis goes here
            }

If you were to do it in steps as you attempted above it would be like this:

cities <- split(data,data$city)

idxs <- lapply(cities, function (d) {
    indexes <- sample(1:nrow(d), size=0.7*nrow(d))
})

train <- data[ idxs[[1]], ]  # for the first city
test <-  data[ -idxs[[1]], ]

I happen to think the is the clumsy way to do it, but perhaps breaking it down into small steps will let you examine the intermediate values.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • thanks for your note but i dont think this works. there is no data in train and test data sets. – user35577 Dec 26 '13 at 04:14
  • Change the "data"'s to 'dd. – IRTFM Dec 26 '13 at 04:25
  • Right. It would create those objects inside the function call, but what gets returned depends on the analysis. If you just ran that function, then it might or might not return anything. Furthermore, the results were not assigned anything. You never said what analysis you wanted done, so I just put in a placeholder. – IRTFM Dec 26 '13 at 05:35
  • Only the value of the RHS of that assignment would get returned. There should not be any object named 'indexes'. My more recent code used `[[.]]` to pull a vector out of a list. – IRTFM Dec 26 '13 at 05:44
  • Ishouldbuyaoat: once i get the train data, i will run logistic regression for each city,something like the following: city_2<-split(train,train$city) lapply(city_2, function(d) glm(X~Y, data=d) – user35577 Dec 26 '13 at 05:45
  • Fine. Do it in two steps if you want. What you just wrote should work. – IRTFM Dec 26 '13 at 05:48
  • train <- data[ idxs[[1]], ] this does not give me the data for the first city. if i do head(train) after this, i see the rows for the other cities as well – user35577 Dec 26 '13 at 06:01
  • You should edit your question so we can see the full sequence of code. (... and you should probably delete most of your comments to my answer.) – IRTFM Dec 26 '13 at 06:03
0

Your code works just fine as is, if City is a column, simply run training data as train[,2]. You can do this easily for each one with a lambda function

logReg<-function(ind) {
    reg<-glm(train[,ind]~WHATEVER)
    ....
    return(val) }

Then run sapply over the vector of city indexes.

evolvedmicrobe
  • 2,672
  • 2
  • 22
  • 30
0

Another possible way, similar to IRTFMs answer (e.g., using only base-r) is to use the following. Note that this answer returns a stratified index, which can be used like the index calculated in the question.

p <- 0.7
strats <- your_data$the_stratify_variable

rr <- split(1:length(strats), strats)
idx <- sort(as.numeric(unlist(sapply(rr, function(x) sample(x, length(x) * p)))))

train <- your_data[idx, ]
test <- your_data[-idx, ]

Example:

p <- 0.7
strats <- mtcars$cyl

rr <- split(1:length(strats), strats)
idx <- sort(as.numeric(unlist(sapply(rr, function(x) sample(x, length(x) * p)))))

train <- mtcars[idx, ]
test <- mtcars[-idx, ]

table(mtcars$cyl) / nrow(mtcars)
#>       4       6       8
#> 0.34375 0.21875 0.43750 

table(train$cyl) / nrow(train)
#>    4    6    8
#> 0.35 0.20 0.45 

table(test$cyl) / nrow(test)
#>         4         6         8 
#> 0.3333333 0.2500000 0.4166667 

We see that all datasets all (mtcars), train, and test have roughly the same class distributions!

David
  • 9,216
  • 4
  • 45
  • 78