R - caret createDataPartition returns more samples than expected

Question

I'm trying to split the iris dataset into a training set and a test set. I used createDataPartition() like this:

library(caret)
createDataPartition(iris$Species, p=0.1)
# [1]  12  22  26  41  42  57  63  79  89  93 114 117 134 137 142

createDataPartition(iris$Sepal.Length, p=0.1)
# [1]   1  27  44  46  54  68  72  77  83  84  93  99 104 109 117 132 134

I understand the first query. I have a vector of 0.1*150 elements (150 is the number of samples in the dataset). However, I should have the same vector on the second query but I am getting a vector of 17 elements instead of 15.

Any ideas as to why I get these results?

I'm not sure why it does that. I'd solve it differently by `inds <- sample(1:nrow(iris), size=0.1*nrow(iris), replace=FALSE)` and then use `testdata <- iris[inds, ]` and `traindata <- iris[-inds, ]`. The `sample()` function does the same as `createDataPartition()` as far as I know, except that `createDataPartition()` takes into account that it creates a representative training dataset (some low values, some high value, etc.) — KenHBS, Oct 05 '17 at 09:50
You asked why you get these results, and you got a detailed explanation - why don't you accept the answer?? — desertnaut, Nov 28 '17 at 22:55

score 3 · Answer 1 · edited Jun 20 '20 at 09:12

Sepal.Length is a numeric feature; from the online documentation:

For numeric y, the sample is split into groups sections based on percentiles and sampling is done within these subgroups. For createDataPartition, the number of percentiles is set via the groups argument.

groups: for numeric y, the number of breaks in the quantiles

with default value:

groups = min(5, length(y))

Here is what happens in your case:

Since you do not specify groups, it takes a value of min(5, 150) = 5 breaks; now, in that case, these breaks coincide with the natural quantiles, i.e. the minimum, the 1st quantile, the median, the 3rd quantile, and the maximum - which you can see from the summary:

> summary(iris$Sepal.Length)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  4.300   5.100   5.800   5.843   6.400   7.900

For numeric features, the function will take a percentage of p = 0.1 from each one of the (4) intervals defined by the above breaks (quantiles); let's see how many samples we have per such interval:

l1 = length(which(iris$Sepal.Length >= 4.3 & iris$Sepal.Length <= 5.1)) # 41
l2 = length(which(iris$Sepal.Length > 5.1 & iris$Sepal.Length <= 5.8))  # 39
l3 = length(which(iris$Sepal.Length > 5.8 & iris$Sepal.Length <= 6.4))  # 35
l4 = length(which(iris$Sepal.Length > 6.4 & iris$Sepal.Length <= 7.9))  # 35

Exactly how many samples will be returned from each interval? Here is the catch - according to line # 140 of the source code, it will be the ceiling of the product between the no. of samples and your p; let's see what this should be in your case for p = 0.1:

ceiling(l1*p) + ceiling(l2*p) + ceiling(l3*p) + ceiling(l4*p)
# 17

Bingo! :)

R - caret createDataPartition returns more samples than expected

1 Answers1

Linked