I'd like to preface my question by stating that this appears to be a common issue:
- Incorrect splitting of data using sample.split in R and issue with logistic regression
- SplitRatio results with sample.split (caTools)
Yet, I cannot fix my problem using the solutions recommended in the first question, and the second was never answered.
In the following code, I would expect 100 observations for each of the four results, as obviously 100/150 = 2/3:
library(caTools)
set.seed(123)
isample <- sample.split(iris[,1], SplitRatio = 2/3, group = NULL)
iris2 <- iris[isample,]
isample2 <- sample.split(iris[,1], SplitRatio = 2/3, group = NULL)
iris3 <- subset(iris, isample2 == T)
isample3 <- sample.split(iris$Sepal.Length, SplitRatio = 2/3, group = NULL)
sepal.length2 <- iris[isample3,1]
isample4 <- sample.split(iris$Sepal.Length, SplitRatio = 2/3, group = NULL)
sepal.length3 <- subset(iris[,1], isample4 == T)
However, I get 104 observations in both iris2
and iris3
, as well as the vectors sepal.length2
and sepal.length3
. I make sure to draw a new sample each time to ensure this isn't something weird with rounding in the sample function. Using column 2 and 3 from iris
return 100 observations, but using column 5 returns 99 observations. Why does changing the column return different values? A common error with this function is to accidentally give it the entire data frame, so it selects based on the columns, but here I am making sure to give it a vector each time. In the last two examples, I am giving it a vector and then determining the split from a vector, and it still does not work.
If it helps, I'm running R 3.6.0 and caTools 1.18.0 on OS X. I normally would use the sample
or sample.int
function, so I am not all that familiar with caTools.