EDIT: This is not a course with a live instructor and I cannot ask him directly in any way. If it were, I wouldn't be wasting your time here.
I am taking an R class that is dealing with the basics of machine learning. We are working with the Vanderbilt Titanic dataset available HERE. The goal is the use the R mice
package to imput missing age
values. I've already split my data into train and test samples, and str(training)
outputs:
'data.frame': 917 obs. of 14 variables:
$ pclass : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
$ survived : Factor w/ 2 levels "0","1": 2 2 1 1 2 2 1 2 2 2 ...
$ name : chr "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" "Allison, Miss. Helen Loraine" "Allison, Mrs. Hudson J C (Bessie Waldo Daniels)" ...
$ sex : Factor w/ 2 levels "female","male": 1 2 1 1 2 1 2 1 1 1 ...
$ age : num 29 0.92 2 25 48 63 71 18 24 26 ...
$ sibsp : int 0 1 1 1 0 1 0 1 0 0 ...
$ parch : int 0 2 2 2 0 0 0 0 0 0 ...
$ ticket : chr "24160" "113781" "113781" "113781" ...
$ fare : num 211.3 151.6 151.6 151.6 26.6 ...
$ cabin : chr "B5" "C22 C26" "C22 C26" "C22 C26" ...
$ embarked : Factor w/ 4 levels "","C","Q","S": 4 4 4 4 4 4 2 2 2 4 ...
$ boat : chr "2" "11" "" "" ...
$ body : int NA NA NA NA NA NA 22 NA NA NA ...
$ home.dest: chr "St Louis, MO" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" ...
The instructor then goes on to write:
factor_vars <- c('pclass', 'sex', 'embarked', 'survived')
training[factor_vars] <- lapply(training[factor_vars], function(x) as.factor(x))
impute_variables <- c('pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked')
mice_model <- mice(training[,impute_variables], method='rf')
mice_output <- complete(mice_model)
mice_output
I understand the factor_vars
piece - these variables are labelled as factors in the structure output. What I don't understand is how the impute_variables
were chosen or what they mean exactly. Are they arbitrarily chosen, perhaps on the basis that the instructor believed things like 'pclass'
(which is the indicator for steerage, coach, or first class) may help predict age (with older people being able to afford first class perhaps) while things like 'cabin'
would have no relevance?
Furthermore, in the line mice_model <- mice(training[,impute_variables], method='rf')
, which portion of the function is declaring that we want to be imputing the age of the passengers?