0

I came across some code studying machine learning in R leveraging the Vanderbilt Titanic dataset available HERE. It is part of a class without a live instructor or additional resources to answer my own question. The ultimate goal of this exercise is to predict survival based on the other observed data. We have split the data into training and test sets, and running str(training) returns:

> str(training)
'data.frame':   917 obs. of  14 variables:
 $ pclass   : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
 $ survived : Factor w/ 2 levels "0","1": 2 2 1 1 2 2 1 2 2 2 ...
 $ name     : chr  "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" "Allison, Miss. Helen Loraine" "Allison, Mrs. Hudson J C (Bessie Waldo Daniels)" ...
 $ sex      : Factor w/ 2 levels "female","male": 1 2 1 1 2 1 2 1 1 1 ...
 $ age      : num  29 0.92 2 25 48 63 71 18 24 26 ...
 $ sibsp    : int  0 1 1 1 0 1 0 1 0 0 ...
 $ parch    : int  0 2 2 2 0 0 0 0 0 0 ...
 $ ticket   : chr  "24160" "113781" "113781" "113781" ...
 $ fare     : num  211.3 151.6 151.6 151.6 26.6 ...
 $ cabin    : chr  "B5" "C22 C26" "C22 C26" "C22 C26" ...
 $ embarked : Factor w/ 4 levels "","C","Q","S": 4 4 4 4 4 4 2 2 2 4 ...
 $ boat     : chr  "2" "11" "" "" ...
 $ body     : int  NA NA NA NA NA NA 22 NA NA NA ...
 $ home.dest: chr  "St Louis, MO" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" ...

My question is twofold. The first step in this process was to label and apply a function to the "factor variables" like so:

factor_vars <- c('pclass', 'sex', 'embarked', 'survived')
training[factor_vars] <- lapply(training[factor_vars], function(x) as.factor(x))

I understand the factor_vars assignment here, as those variables are clearly labelled as Factor when calling str(training). My question is why are we running the lapply function? It appears it is simply classifying the factor variables as factors. What is really happening in the training[factor_vars] <- lapply(training[factor_vars], function(x) as.factor(x)) line of code?

The next step was to impute the missing variable age.

impute_variables <- c('pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked')
mice_model <- mice(training[,impute_variables], method='rf')
  1. Why was that specific subset of variables selected as impute_variables? What was the basis for including things like sex but not boat?
  2. Why are we subsetting the training data within the mice() function to only act on the impute_variables columns?
  3. The output returned by mice_model is:

    iter imp variable 1 1 age 1 2 age 1 3 age 1 4 age 1 5 age 2 1 age 2 2 age 2 3 age 2 4 age 2 5 age 3 1 age 3 2 age 3 3 age 3 4 age 3 5 age 4 1 age 4 2 age 4 3 age 4 4 age 4 5 age 5 1 age 5 2 age 5 3 age 5 4 age 5 5 age

Where in any of the above code did we explicitly tell the mice() function to impute age?

Richard Golz
  • 361
  • 2
  • 3
  • 16
  • As far as I know, `mice()` uses the subset to impute all `NA` in every included variable. – LAP Mar 29 '18 at 21:12
  • That was a theory I was thinking about, but then why would the `mice_model` itself only show imputations occurring on the `age` variable? – Richard Golz Mar 29 '18 at 21:18
  • How can I mark this question as closed? You were correct LAP - in this dataset, it just so happened that age was the only variable with any missing values. I guess I would still want to know why some values were chosen to impute over others... particularly if only one of the variables needed imputing. To be honest, the course is pretty garbage, I've ran into several ambiguous and unclear examples like this – Richard Golz Mar 29 '18 at 21:26
  • Just accept you own answer, it'll be fine. – LAP Mar 29 '18 at 21:30
  • Thanks. To confirm, I ran a substitution to populate some `NA` values in the `pclass` column and `mice()` imputed both. Thanks for being kind to a newbie! – Richard Golz Mar 29 '18 at 21:40

1 Answers1

1

Short Answer: the instructor of the course routinely gives ambiguous and confusing examples.

Long Answer: As LAP pointed out, mice() does impute any variables fed into it. In this particular case, the titanic dataset only had a single column with ANY missing values - age. Why the instructor chose to arbitrarily include other variables in the imputation is anybody's guess. He did not explain why he was doing so in the book.

Richard Golz
  • 361
  • 2
  • 3
  • 16
  • Does `mice` give a different answer if you only include `age` in `impute_variables`? – hpesoj626 Mar 29 '18 at 23:24
  • Yes it does. It gives the error that `Data should be a matrix or data frame`. Including at least 1 other variable is necessary for imputation. `mice()` is both using the impute_variables to impute missing values, as well as imputing their own missing values. – Richard Golz Apr 02 '18 at 18:33