2

I have a dataset with multiple observations by participant. Participants are denoted by id. To account for this in the cross validation process, I add blocking = factor(id) to makeClassifTask() and blocking.cv = TRUE to makeResampleDesc(). However, if I leave id in the dataset, it will be used as a predictor. My question is: How do I correctly use blocking? My take would be to create a new variable, e.g. participant.id (outside of the dataset), next to remove id from the original dataset and then to use blocking = factor(participant.id), but I am not sure if this is the correct way to handle blocking.

00schneider
  • 698
  • 9
  • 21

1 Answers1

3

Rather than supplying a variable for blocking you can provide a custom factor vector that specifies the observations which belong together. This is also shown in the tutorial.

This way you do not need to have the variable "participant.id" in the dataset.

Also make sure that you really want to use "blocking". Did you have a look at "grouping" already? The differences between both are also described in the linked tutorial section.

pat-s
  • 5,992
  • 1
  • 32
  • 60
  • To be honest, I am not sure if grouping or blocking is correct. I have 30 measurement points per participant and 6 participants in total. As far as I can tell, the difference is that grouping fixes the folds of CV and, thereby, ensures folds of same size. what is your recommendation? – 00schneider May 11 '19 at 11:20
  • 1
    Right, "grouping" ensures an equal fold size but removes the ability to repeat the CV (i.e. you can only run one repetition). There is no general answer to this but if you have no problems with autocorrelation between the groups and enough groups (6 should be fine) then "blocking" is fine. If you have let's say < 4 groups, the differences in the fold sizes become quite substantial in "blocking" which might cause problems. – pat-s May 11 '19 at 15:27
  • Thanks. With grouping enabled, I get drastically worce perfomance measures (mmce with grouping 0.4 vs. 0.005 without grouping). That difference seems too large. Any idea why this could be the case? Is it safe to ignore the participant id and simply train the model without grouping? – 00schneider May 13 '19 at 16:34
  • 1
    Your model overfits on the training data, i.e. the model is unable to generalize to an unseen participant that seems to behave very differently to the others. But usually this is what you want - a model that is able to generalize well and behaving well on unseen data. It looks like you need more training data that describes all sorts of participant behavior. The good results in the non-grouping case indicates that the model relies heavily on the parts of the training data which are very similar to the test data (which might indicate (strong) autocorrelation). – pat-s May 13 '19 at 20:29