Unable to create exactly equal data partitions using createDataPartition in R- getting 1396 and 1398 observations each but need 1397

Question

I am quite familiar with R but never had this requirement where I need to create exactly equal data partition randomly using createDataPartition in R.

index = createDataPartition(final_ts$SAR,p=0.5, list = F)
final_test_data = final_ts[index,]
final_validation_data = final_ts[-index,]

This code creates two datasets with sizes 1396 and 1398 observations respectively.

I am surprised why p=0.5 doesn't do what it is supposed to do. Does it have something to do with resulting dataset not having odd number of observations by default? Thanks in advance!

score 2 · Accepted Answer · answered Jan 04 '19 at 10:49

2

It has to do with the number of cases of the response variable (final_ts$SAR in your case).

For example:

y <- rep(c(0,1), 10)
table(y)
y
0  1 
10 10 
# even number of cases

Now we split:

train <- y[caret::createDataPartition(y, p=0.5,list=F)]
table(train) # we have 10 obs 
train
0 1 
5 5 

test <- y[-caret::createDataPartition(y, p=0.5,list=F)]
table(test) # we have 10 obs.
test
0 1 
5 5

If we build and example instead with odd number of cases:

y <- rep(c(0,1), 11)
table(y)
y
0  1 
11 11

We have:

train <- y[caret::createDataPartition(y, p=0.5,list=F)]
table(train) # we have 12 obs.
train
0 1 
6 6 

test <- y[-caret::createDataPartition(y, p=0.5,list=F)]
table(test) # we have 10 obs.
test
0 1 
5 5

More info here.

answered Jan 04 '19 at 10:49

RLave

8,144
3
21
37

Thanks for your answer, but if you add up 1396 and 1398 it is an even number and not odd. That's the reason I mentioned why can't it split into 1397 each, like it did with 10 observations splitting into 5 each and not 4 and 6 each. – Bharat Ram Ammu Jan 04 '19 at 10:54
1

I meant the numer of cases not the number of rows in the data, like in my two examples. First we have 10-10 (even) then 11-11 (odd) – RLave Jan 04 '19 at 10:56
Oh now I see, it was indirect explanation but got the reason why it splits into unequal halfs. But can you please help how I can split equally irrespective of the distribution of my response variable. Like in your example, how can I split with 6 0's and 4 1's in train and vice versa in test? – Bharat Ram Ammu Jan 04 '19 at 12:07
I don't think this can be done with `createDataPartition` because by default it tries to balance the class distribution of `y`. – RLave Jan 04 '19 at 12:18
I suggest you ask a different question where you show your data and expected output with a reproducible example. – RLave Jan 04 '19 at 12:19

score 0 · Answer 2 · answered Jun 01 '20 at 19:41

Here is another thread which explains why the number returned from createDataPartition might seem to be "off" to us but not according to what this function is trying to do. So, it depends on what you have in final_ts$SAR and the spread of the data. If it is categorical value, ex: T and F, if you have 100 total, 55 are T, 45 are F. When you invoke the way in your code, it will return you 51 because: 55*0.5=27.5, 45*0.5=22.5, round each result up, 28+23=51.

You can refer to below thread which has a great explanation about this when the values you want to split are numbers.

R - caret createDataPartition returns more samples than expected

Unable to create exactly equal data partitions using createDataPartition in R- getting 1396 and 1398 observations each but need 1397

2 Answers2