0

EDIT: This is not a course with a live instructor and I cannot ask him directly in any way. If it were, I wouldn't be wasting your time here.

I am taking an R class that is dealing with the basics of machine learning. We are working with the Vanderbilt Titanic dataset available HERE. The goal is the use the R mice package to imput missing age values. I've already split my data into train and test samples, and str(training) outputs:

'data.frame':   917 obs. of  14 variables:
 $ pclass   : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
 $ survived : Factor w/ 2 levels "0","1": 2 2 1 1 2 2 1 2 2 2 ...
 $ name     : chr  "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" "Allison, Miss. Helen Loraine" "Allison, Mrs. Hudson J C (Bessie Waldo Daniels)" ...
 $ sex      : Factor w/ 2 levels "female","male": 1 2 1 1 2 1 2 1 1 1 ...
 $ age      : num  29 0.92 2 25 48 63 71 18 24 26 ...
 $ sibsp    : int  0 1 1 1 0 1 0 1 0 0 ...
 $ parch    : int  0 2 2 2 0 0 0 0 0 0 ...
 $ ticket   : chr  "24160" "113781" "113781" "113781" ...
 $ fare     : num  211.3 151.6 151.6 151.6 26.6 ...
 $ cabin    : chr  "B5" "C22 C26" "C22 C26" "C22 C26" ...
 $ embarked : Factor w/ 4 levels "","C","Q","S": 4 4 4 4 4 4 2 2 2 4 ...
 $ boat     : chr  "2" "11" "" "" ...
 $ body     : int  NA NA NA NA NA NA 22 NA NA NA ...
 $ home.dest: chr  "St Louis, MO" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" ...

The instructor then goes on to write:

factor_vars <- c('pclass', 'sex', 'embarked', 'survived')

training[factor_vars] <- lapply(training[factor_vars], function(x) as.factor(x))

impute_variables <- c('pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked')
mice_model <- mice(training[,impute_variables], method='rf')
mice_output <- complete(mice_model)
mice_output

I understand the factor_vars piece - these variables are labelled as factors in the structure output. What I don't understand is how the impute_variables were chosen or what they mean exactly. Are they arbitrarily chosen, perhaps on the basis that the instructor believed things like 'pclass' (which is the indicator for steerage, coach, or first class) may help predict age (with older people being able to afford first class perhaps) while things like 'cabin' would have no relevance?

Furthermore, in the line mice_model <- mice(training[,impute_variables], method='rf'), which portion of the function is declaring that we want to be imputing the age of the passengers?

Richard Golz
  • 361
  • 2
  • 3
  • 16
  • These sound like questions to ask your instructor.... – emilliman5 Mar 29 '18 at 18:50
  • It is not a live course, that is not an option. I would not have asked the question otherwise – Richard Golz Mar 29 '18 at 18:56
  • 1
    Did you at least start by looking at the help page for the `?complete` function? Or maybe checkout the resources vignette from the `mice` package for help in understanding how the package works: `vignette("resources", "mice")` – MrFlick Mar 29 '18 at 19:02
  • Yes - the help page says that complete "exports" the results which I'm a bit unclear on. The vignettes are a bit over my head, I started learning R less than a week ago. – Richard Golz Mar 29 '18 at 19:09
  • 2
    I'm voting to close this question as off-topic because this is not a programming question. It is a question about why the tutor sleceted certain parameters to impute. Can only be answered by that instructor – dww Mar 29 '18 at 20:49
  • I disagree. I don't actually care about the instructor's logic, just the functional reason for choosing any variable over another for imputation – Richard Golz Mar 29 '18 at 20:51

0 Answers0