6

With caret package, when creating data partition 75% training and 25% test, we use:

inTrain<- createDataPartition(y=spam$type,p=0.75, list=FALSE)

Note: dataset is named spam and target variable is named type

My question is, what is the purpose of including y=spam$type argument?

Isn’t the purpose of creating data partitions simply to split the entire data set based on the proportion you require for training vs testing? Why is there the need to include that argument in the code?

Imran Ali
  • 2,223
  • 2
  • 28
  • 41
Aiden
  • 71
  • 1
  • 1
  • 3
  • not 100% but I believe this is just to tell the command by what variable you are partitioning the data. I'm not sure it is of much importance except as to clarify how to partition - easier for the computer to understand – a.powell Jul 20 '16 at 20:13
  • Where did you get this function 'createDataPartition'? What does 'str(inTrain)' output? – aichao Jul 20 '16 at 20:17
  • @a.powell What do you mean by "to tell.. by what variable you are partitioning the data"? My understanding of partitioning is simply to split the entire data. Why should we bring up the fact that "type" is my target variable at this stage? Am I conceptually misunderstanding the idea of data partitioning? – Aiden Jul 20 '16 at 20:18
  • No you are correct, but the software may not understand as intuitively as you. This command just allows it to take the random sample from that class. – a.powell Jul 20 '16 at 20:25
  • @Zhenyuan Li Well I did read the documentation but it didnt provide me with clarity on this specific issue. Why would you assume that I haven't done so? The reason why my question makes no sense to you is because we come from different learning paths but that's okay I have gotten my answer from Imran Ali below. Thank you anyway. – Aiden Jul 20 '16 at 20:43

2 Answers2

10

I have assumed that the createDataPartition() in question is referring to the caret package.

If sample$type argument is a factor which is generally the case, the random sampling occurs within each class.

Some more explanation: For example if we were to partition the iris data set in the same proportion as in your question.

attach(iris)
summary(iris)

notice the numbers against each species. Now using the following command:

library(caret)
inTrain <- createDataPartition(y=Species, p=0.75, list=FALSE)  

inTrain would take approximately 75% rows from each species, which can be verified by issuing the following command:

summary(iris[inTrain,])

There are 50 species in each category, and 38 (approximately 75%)have been randomly selected for the training data set.

Imran Ali
  • 2,223
  • 2
  • 28
  • 41
  • 1
    Yes, I am referring to the caret package. sample$type is the target variable that I would like to predict later on after building a linear model. What do you mean by "random sampling occurs within each class."? – Aiden Jul 20 '16 at 20:25
  • I have added further explanation to the answer. You can understand easily by selecting different value of `p` e.g. 0.5 and examining how many rows are selected for training set. – Imran Ali Jul 20 '16 at 20:40
0

df <- iris

verifying the proportion of distribution of dependent variable classes in the original dataset

prop.table(table(iris$Species))

R output:

 setosa     versicolor  virginica 
 0.3333333  0.3333333  0.3333333 

creating the split:

split <- createDataPartition(iris$Species, p = .30, list = F)

applying the split generates a stratified random sampling

proof:

prop.table(table(iris$Species[split]))

R output:

 setosa     versicolor virginica 
 0.3333333  0.3333333  0.3333333