I came across some code studying machine learning in R leveraging the Vanderbilt Titanic dataset available HERE. It is part of a class without a live instructor or additional resources to answer my own question. The ultimate goal of this exercise is to predict survival based on the other observed data. We have split the data into training and test sets, and running str(training)
returns:
> str(training)
'data.frame': 917 obs. of 14 variables:
$ pclass : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
$ survived : Factor w/ 2 levels "0","1": 2 2 1 1 2 2 1 2 2 2 ...
$ name : chr "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" "Allison, Miss. Helen Loraine" "Allison, Mrs. Hudson J C (Bessie Waldo Daniels)" ...
$ sex : Factor w/ 2 levels "female","male": 1 2 1 1 2 1 2 1 1 1 ...
$ age : num 29 0.92 2 25 48 63 71 18 24 26 ...
$ sibsp : int 0 1 1 1 0 1 0 1 0 0 ...
$ parch : int 0 2 2 2 0 0 0 0 0 0 ...
$ ticket : chr "24160" "113781" "113781" "113781" ...
$ fare : num 211.3 151.6 151.6 151.6 26.6 ...
$ cabin : chr "B5" "C22 C26" "C22 C26" "C22 C26" ...
$ embarked : Factor w/ 4 levels "","C","Q","S": 4 4 4 4 4 4 2 2 2 4 ...
$ boat : chr "2" "11" "" "" ...
$ body : int NA NA NA NA NA NA 22 NA NA NA ...
$ home.dest: chr "St Louis, MO" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" ...
My question is twofold. The first step in this process was to label and apply a function to the "factor variables" like so:
factor_vars <- c('pclass', 'sex', 'embarked', 'survived')
training[factor_vars] <- lapply(training[factor_vars], function(x) as.factor(x))
I understand the factor_vars
assignment here, as those variables are clearly labelled as Factor
when calling str(training)
. My question is why are we running the lapply
function? It appears it is simply classifying the factor variables as factors. What is really happening in the training[factor_vars] <- lapply(training[factor_vars], function(x) as.factor(x))
line of code?
The next step was to impute the missing variable age
.
impute_variables <- c('pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked')
mice_model <- mice(training[,impute_variables], method='rf')
- Why was that specific subset of variables selected as
impute_variables
? What was the basis for including things likesex
but notboat
? - Why are we subsetting the training data within the
mice()
function to only act on theimpute_variables
columns? The output returned by
mice_model
is:iter imp variable 1 1 age 1 2 age 1 3 age 1 4 age 1 5 age 2 1 age 2 2 age 2 3 age 2 4 age 2 5 age 3 1 age 3 2 age 3 3 age 3 4 age 3 5 age 4 1 age 4 2 age 4 3 age 4 4 age 4 5 age 5 1 age 5 2 age 5 3 age 5 4 age 5 5 age
Where in any of the above code did we explicitly tell the mice()
function to impute age?