My data set consists of information collected from inpatients on their satisfaction about the services they received at the hospital. Data looks as below (only a set of variables are mentioned here);
$ Advised : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 1 2 2 ...
$ Overall_Rate_Discharge_Process : Factor w/ 5 levels "1","2","3","4",..: 3 4 5 5 4 4 4 4 4 5 ...
$ Rights_Responsibilities : Factor w/ 2 levels "0","1": 1 2 2 2 2 2 2 2 1 2 ...
$ Overall_Care : Factor w/ 5 levels "1","2","3","4",..: 4 4 5 5 4 4 4 3 5 5 ...
$ Recommend_Employees : Factor w/ 2 levels "0","1": 1 1 2 2 2 1 2 1 1 2 ...
$ NPSVal3.1 : Factor w/ 3 levels "Detractor","Passive",..: 3 2 3 3 3 2 2 1 3 3 ...
My objective is to find the factors that affect the NPSVal3.1 of the patients (using Ordinal Logistic Regression). The NPSVal3.1 column does not have equal number of rows from each level;
Detractor Passive Promoter
981 12932 8560
Therefore, I'm trying "downsampling" method to select the train set of the data. Below is the code I used (from library "caret");
train3.1 <- downSample(mydata3.1, mydata3.1$NPSVal3.1)
When the head() and tail() of the train set was checked, it doesn't look random (The row IDs are in order)
> head(train3.1)
Discharge_Instructions_Treatment_Plans Advised Overall_Rate_Discharge_Process Rights_Responsibilities Overall_Care
1 1 1 2 1 3
2 1 1 4 0 4
3 1 0 4 0 5
4 1 1 3 1 4
5 1 1 4 0 4
6 1 0 4 1 4
Recommend_Employees NPSVal3.1 Class
1 0 Detractor Detractor
2 0 Detractor Detractor
3 0 Detractor Detractor
4 0 Detractor Detractor
5 0 Detractor Detractor
6 1 Detractor Detractor
Also, when I extracted the test set, it doesn't look random either. Below is the code I used.
test3.1 <- dplyr::anti_join(mydata3.1, train3.1)
Are these data sets random? If yes, how can I know that? If not, how can I make both train and test sets random? Thank you for your support!