0

My data set consists of information collected from inpatients on their satisfaction about the services they received at the hospital. Data looks as below (only a set of variables are mentioned here);

 $ Advised                                : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 1 2 2 ...
 $ Overall_Rate_Discharge_Process         : Factor w/ 5 levels "1","2","3","4",..: 3 4 5 5 4 4 4 4 4 5 ...
 $ Rights_Responsibilities                : Factor w/ 2 levels "0","1": 1 2 2 2 2 2 2 2 1 2 ...
 $ Overall_Care                           : Factor w/ 5 levels "1","2","3","4",..: 4 4 5 5 4 4 4 3 5 5 ...
 $ Recommend_Employees                    : Factor w/ 2 levels "0","1": 1 1 2 2 2 1 2 1 1 2 ...
 $ NPSVal3.1                              : Factor w/ 3 levels "Detractor","Passive",..: 3 2 3 3 3 2 2 1 3 3 ...

My objective is to find the factors that affect the NPSVal3.1 of the patients (using Ordinal Logistic Regression). The NPSVal3.1 column does not have equal number of rows from each level;

Detractor   Passive  Promoter 
  981     12932      8560 

Therefore, I'm trying "downsampling" method to select the train set of the data. Below is the code I used (from library "caret");

train3.1 <- downSample(mydata3.1, mydata3.1$NPSVal3.1)

When the head() and tail() of the train set was checked, it doesn't look random (The row IDs are in order)

> head(train3.1)

  Discharge_Instructions_Treatment_Plans Advised Overall_Rate_Discharge_Process Rights_Responsibilities Overall_Care
1                                      1       1                              2                       1            3
2                                      1       1                              4                       0            4
3                                      1       0                              4                       0            5
4                                      1       1                              3                       1            4
5                                      1       1                              4                       0            4
6                                      1       0                              4                       1            4
  Recommend_Employees NPSVal3.1     Class
1                   0 Detractor Detractor
2                   0 Detractor Detractor
3                   0 Detractor Detractor
4                   0 Detractor Detractor
5                   0 Detractor Detractor
6                   1 Detractor Detractor

Also, when I extracted the test set, it doesn't look random either. Below is the code I used.

test3.1 <- dplyr::anti_join(mydata3.1, train3.1)

Are these data sets random? If yes, how can I know that? If not, how can I make both train and test sets random? Thank you for your support!

  • Not sure if the caret downsample function is the correct tool for the job. Why do you want an equal number of observations across your response categories? Would the caret 'createDataPartition' function be what you're looking for? – jared_mamrot May 04 '20 at 06:26
  • I thought having equal number of observations across levels so then the train set will not be bias towards one level. Let me try createDataPartition too. Thank you for the tip – user13178113 May 04 '20 at 06:44

2 Answers2

0

You could also downsample to the size of the smallest category (in your case 'NPSVal3.1') using the sample function from base R:

# make some dummy data
mydata=data.frame(
    categories=as.factor(c(rep("a",100),rep("b",200),rep("c",300))), # want to sample these equally
    values=as.factor(sample(1:600)) # and get the corresponding values
)

# downsample to size of smallest category
n_smallest=min(table(mydata$categories))
mysampledcols=sapply(levels(mydata$categories), 
    function(cat) sample(which(mydata$categories==cat),size=n_smallest)
)
mysampleddata=mydata[mysampledcols,]

# if you want the categories to also appear in random order:
mysampleddata=mydata[sample(mysampledcols),]
0

If a new employee has ~equal chance of being "Detractor", "Passive" or "Promoter" downsampling makes sense, but I don't think you should use downsampling to select your train/test sets. My advice is to conduct the downsampling then use caret::createDataPartition to ensure a quasi-random split of your data into train/test.

Downsampling has are a number of pitfalls/caveats to be aware of. The caret docs has some excellent discussion on this point: https://topepo.github.io/caret/subsampling-for-class-imbalances.html

A straightforward answer to your question (test the 'randomness' of the sampling method) would be to set a seed prior to downsampling then see if changing the seed changes which employees are included/excluded in each dataframe e.g.

set.seed(123)
train3.1_v1 <- caret::downSample(mydata3.1, mydata3.1$NPSVal3.1)

set.seed(300)
train3.1_v2 <- caret::downSample(mydata3.1, mydata3.1$NPSVal3.1)

dplyr::anti_join(train3.1_v1, train3.1_v2)
jared_mamrot
  • 22,354
  • 4
  • 21
  • 46