Sepal.Length
is a numeric feature; from the online documentation:
For numeric y
, the sample is split into groups sections based on percentiles and sampling is done within these subgroups. For createDataPartition
, the number of percentiles is set via the groups
argument.
groups
: for numeric y
, the number of breaks in the quantiles
with default value:
groups = min(5, length(y)
)
Here is what happens in your case:
Since you do not specify groups
, it takes a value of min(5, 150) = 5
breaks; now, in that case, these breaks coincide with the natural quantiles, i.e. the minimum, the 1st quantile, the median, the 3rd quantile, and the maximum - which you can see from the summary
:
> summary(iris$Sepal.Length)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.300 5.100 5.800 5.843 6.400 7.900
For numeric features, the function will take a percentage of p = 0.1
from each one of the (4) intervals defined by the above breaks (quantiles); let's see how many samples we have per such interval:
l1 = length(which(iris$Sepal.Length >= 4.3 & iris$Sepal.Length <= 5.1)) # 41
l2 = length(which(iris$Sepal.Length > 5.1 & iris$Sepal.Length <= 5.8)) # 39
l3 = length(which(iris$Sepal.Length > 5.8 & iris$Sepal.Length <= 6.4)) # 35
l4 = length(which(iris$Sepal.Length > 6.4 & iris$Sepal.Length <= 7.9)) # 35
Exactly how many samples will be returned from each interval? Here is the catch - according to line # 140 of the source code, it will be the ceiling of the product between the no. of samples and your p
; let's see what this should be in your case for p = 0.1
:
ceiling(l1*p) + ceiling(l2*p) + ceiling(l3*p) + ceiling(l4*p)
# 17
Bingo! :)