tidy method of testing model parameters

Question

I would like to compare model performance for a bunch of models using the same predictors but different model parameters. This seems like the place to use broom to create a tidy output, but I can't figure it out. Here's some non-working code that helps suggest what I'm thinking about:

seq(1:10) %>%
do(fit = knn(train_Market, test_Market, train_Direction, k=.), score = mean(fit==test_Direction)) %>%
tidy()

For more context, this is part of one of the ISLR labs that we are trying to tidyverse-ify. You can see the entire lab here: https://github.com/AmeliaMN/tidy-islr/blob/master/lab3/lab3.Rmd

[Update: reproducible example] It's hard to make a minimal example here because of the need for data wrangling before model fitting, but this should be reproducible:

library(ISLR)
library(dplyr)

train = Smarket %>%
  filter(Year < 2005)
test = Smarket %>%
  filter(Year >= 2005)

train_Market = train %>%
  select(Lag1, Lag2)
test_Market = test %>%
  select(Lag1, Lag2)

train_Direction = train %>%
  select(Direction) %>%
  .$Direction 

set.seed(1)
knn_pred = knn(train_Market, test_Market, train_Direction, k=1)
mean(knn_pred==test_Direction)

knn_pred = knn(train_Market, test_Market, train_Direction, k=3)
mean(knn_pred==test_Direction)

knn_pred = knn(train_Market, test_Market, train_Direction, k=4)
mean(knn_pred==test_Direction)

etc.

Are you trying to stick with dplyr/`do`? This seems a good fit for list-loops a la `lapply` or purrr functions. — aosmith, Sep 16 '16 at 17:17
Sorry Amelia. It's just that I was going through the link's write up and lost my attention. — alexwhitworth, Sep 16 '16 at 18:03

David Robinson · Accepted Answer · 2016-09-16T17:32:21.470

Since your output of each knn (and oracle) is a vector, this is a good case for tidyr's unnest (in combination with purrr's map and rep_along:

library(class)
library(purrr)
library(tidyr)
set.seed(1)

predictions <- data_frame(k = 1:5) %>%
  unnest(prediction = map(k, ~ knn(train_Market, test_Market, train_Direction, k = .))) %>%
  mutate(oracle = rep_along(prediction, test_Direction))

The predictions variable is then organized as:

# A tibble: 1,260 x 3
       k prediction oracle
   <int>     <fctr> <fctr>
1      1         Up     Up
2      1       Down     Up
3      1         Up   Down
4      1         Up     Up
5      1         Up     Up
6      1       Down     Up
7      1       Down   Down
8      1       Down     Up
9      1       Down     Up
10     1         Up     Up
# ... with 1,250 more rows

Which can easily be summarized:

predictions %>%
  group_by(k) %>%
  summarize(accuracy = mean(prediction == oracle))

Again, you don't need broom since each output is a factor, but if it were a model, you could use broom's tidy or augment and then unnest it in a similar fashion.

One important aspect of this approach is that it's flexible to many combinations of parameters, by combining them with tidyr's crossing (or expand.grid) and using invoke_rows to apply the function to each row. For example, you could try variations of l alongside k:

crossing(k = 2:5, l = 0:1) %>%
  invoke_rows(knn, ., train = train_Market, test = test_Market, cl = train_Direction) %>%
  unnest(prediction = .out) %>%
  mutate(oracle = rep_along(prediction, test_Direction)) %>%
  group_by(k, l) %>%
  summarize(accuracy = mean(prediction == oracle))

This returns:

Source: local data frame [8 x 3]
Groups: k [?]

      k     l  accuracy
  <int> <int>     <dbl>
1     2     0 0.5396825
2     2     1 0.5277778
3     3     0 0.5317460
4     3     1 0.5317460
5     4     0 0.5277778
6     4     1 0.5357143
7     5     0 0.4841270
8     5     1 0.4841270

tidy method of testing model parameters

1 Answers1