0

(Updated at the end based on Julia's reply. TL;DR: This seems to be an issue with the underlying kknn package, instead of with tidymodels)

I'm doing some k-nearest neighbours regression models with tidymodels. This is through the nearest_neighbor() function. I want to see what the difference is between the results with and without normalization of the features.

Now set_engine("kknn") uses the kknn::train.kknn() function under the hood, which has a normalization argument scale = TRUE. I want to compare models with scale = FALSE to scale = TRUE (actually, I want to do that in a recipe, but that is not possible, as I'll explain below).

But it does not seem as if I am able to reliably set scale = FALSE through tidymodels. Below is a reprex showing what I see.

The questions so long: Am I doing something wrong or is this a bug? If it is a bug, is it known and can I read about it somewhere? I'd be very grateful if someone can shed light on this.

Set up for the reprex

Here I'll use mtcars:

library(tidymodels)
data("mtcars")

A train-test split is:

set.seed(1)
mtcars_split <- initial_split(mtcars, prop = 0.7)

Here is a common recipe I'll use:

mtcars_recipe <- recipe(mpg ~ disp + wt, data = mtcars)

Here is model 1 (called knn_FALSE) where scale = FALSE:

knn_FALSE <- nearest_neighbor(neighbors = 5) %>% 
  set_mode("regression") %>% 
  set_engine("kknn", scale = FALSE)

Here is model 2 (called knn_TRUE) where scale = TRUE:

knn_TRUE <- nearest_neighbor(neighbors = 5) %>% 
  set_mode("regression") %>% 
  set_engine("kknn", scale = TRUE)

I bundle these two models into two workflows:

## Workflow with scale = FALSE
wf_FALSE <- workflow() %>% 
  add_model(knn_FALSE) %>% 
  add_recipe(mtcars_recipe)

## Worflow with scale = TRUE
wf_TRUE <- workflow() %>% 
  add_model(knn_TRUE) %>% 
  add_recipe(mtcars_recipe)

Using fit(), it is possible to have scale = FALSE

It does seem to be possible to have one version with scale = TRUE and one with scale = FALSE when using fit() on a workflow.

For example, for scale = TRUE I get:

wf_TRUE %>% fit(mtcars)
== Workflow [trained] ===============================================================================================
Preprocessor: Recipe
Model: nearest_neighbor()

-- Preprocessor -----------------------------------------------------------------------------------------------------
0 Recipe Steps

-- Model ------------------------------------------------------------------------------------------------------------

Call:
kknn::train.kknn(formula = ..y ~ ., data = data, ks = ~5, scale = ~TRUE)

Type of response variable: continuous
minimal mean absolute error: 2.09425
Minimal mean squared error: 7.219114
Best kernel: optimal
Best k: 5

Whereas for scale = FALSE I have:

wf_FALSE %>% fit(mtcars)
== Workflow [trained] ===============================================================================================
Preprocessor: Recipe
Model: nearest_neighbor()

-- Preprocessor -----------------------------------------------------------------------------------------------------
0 Recipe Steps

-- Model ------------------------------------------------------------------------------------------------------------

Call:
kknn::train.kknn(formula = ..y ~ ., data = data, ks = ~5, scale = ~FALSE)

Type of response variable: continuous
minimal mean absolute error: 2.1665
Minimal mean squared error: 6.538769
Best kernel: optimal
Best k: 5

The results are clearly different, which comes from the difference in the scale parameter.

But the plot thickens.

No difference with last_fit()

When using last_fit() however, the results for scale = TRUE and scale = FALSE are identical though.

For scale = TRUE:

wf_TRUE %>% last_fit(mtcars_split) %>% collect_metrics()
# A tibble: 2 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       3.16 
2 rsq     standard       0.663

Whereas for scale = FALSE:

wf_FALSE %>% last_fit(mtcars_split) %>% collect_metrics()
# A tibble: 2 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       3.16 
2 rsq     standard       0.663

These are clearly --- and unexpectedly --- the same.

There is also no difference when tuning using tune_grid()

If I do tuning with tune_grid() and a validation_split(), there is also no difference between the results for scale = TRUE and scale = FALSE.

Here is the code for that:

## Tune grid
knn_grid <- tibble(neighbors = c(5, 15))

## Tune Model 1: kNN regresson with no scaling in train.kknn
knn_FALSE_tune <- nearest_neighbor(neighbors = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("kknn", scale = FALSE)

## Model 2: kNN regresson with  scaling in train.kknn
knn_TRUE_tune <- nearest_neighbor(neighbors = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("kknn", scale = TRUE)

## Workflow with scale = FALSE
wf_FALSE_tune <- workflow() %>% 
  add_model(knn_FALSE_tune) %>% 
  add_recipe(mtcars_recipe)

## Worflow with scale = TRUE
wf_TRUE_tune <- workflow() %>% 
  add_model(knn_TRUE_tune) %>% 
  add_recipe(mtcars_recipe)

## Validation split
mtcars_val <- validation_split(mtcars)

## Tune results: Without scaling
wf_FALSE_tune %>% 
  tune_grid(resamples = mtcars_val, 
            grid = knn_grid) %>% 
  collect_metrics()

## Tune results: With scaling
wf_TRUE_tune %>% 
  tune_grid(resamples = mtcars_val, 
            grid = knn_grid) %>% 
  collect_metrics()

The result when scale = FALSE:

> wf_FALSE_tune %>% 
+   tune_grid(resamples = mtcars_val, 
+             grid = knn_grid) %>% 
+   collect_metrics()
# A tibble: 4 x 7
  neighbors .metric .estimator  mean     n std_err .config
      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>  
1         5 rmse    standard   1.64      1      NA Model1 
2         5 rsq     standard   0.920     1      NA Model1 
3        15 rmse    standard   2.55      1      NA Model2 
4        15 rsq     standard   0.956     1      NA Model2 

The results when scale = TRUE:

> wf_TRUE_tune %>% 
+   tune_grid(resamples = mtcars_val, 
+             grid = knn_grid) %>% 
+   collect_metrics()
# A tibble: 4 x 7
  neighbors .metric .estimator  mean     n std_err .config
      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>  
1         5 rmse    standard   1.64      1      NA Model1 
2         5 rsq     standard   0.920     1      NA Model1 
3        15 rmse    standard   2.55      1      NA Model2 
4        15 rsq     standard   0.956     1      NA Model2 

Question

Am I misunderstanding (or missing my own bug), or are the last_fit() and tune_grid() functions not respecting my choice for scale?

I'm new to tidymodels, so I might have missed something. Answers much appreciated.

I was hoping to use step_normalize() in a recipe to do the normalization myself, but since I cannot reliably set scale = FALSE in the underlying engine, I have not been able to experiment with that.

Update after Julia's reply

As Julia shows, predictions from train.kknn() provide the same predictions for scale = FALSE and scale = TRUE. So this isn't an tidymodels issue. Rather the kknn:::predict.train.kknn() function does not respect all parameters passed to train.kknn() when predicting.

Consider the following output which uses kknn() instead of train.kknn():

kknn::kknn(formula = mpg ~ disp + wt, train = training(mtcars_split), 
           test = testing(mtcars_split), k = 5, scale = FALSE) %>% 
  predict(newdata = testing(mtcars_split))
## [1] 21.276 21.276 16.860 16.276 21.276 16.404 29.680 15.700 16.020
kknn::kknn(formula = mpg ~ disp + wt, train = training(mtcars_split), 
           test = testing(mtcars_split), k = 5, scale = TRUE) %>% 
  predict(newdata = testing(mtcars_split))
## [1] 21.032 21.784 16.668 16.052 21.264 16.404 26.340 16.076 15.620

These are different, as it should be. The problem is that kknn:::predict.train.kknn() calls kknn(), but without passing along scale (and some other optional arguments):

function (object, newdata, ...) 
{
    if (missing(newdata)) 
        return(predict(object, ...))
    res <- kknn(formula(terms(object)), object$data, newdata, 
        k = object$best.parameters$k, kernel = object$best.parameters$kernel, 
        distance = object$distance)
    return(predict(res, ...))
}
<bytecode: 0x55e2304fba10>
<environment: namespace:kknn>
pcs
  • 13
  • 4
  • I've reported this to the kknn maintainer on Github here: https://github.com/KlausVigo/kknn/issues/22 – pcs Nov 19 '20 at 14:25

1 Answers1

0

I think you don't have a bug or problem but are just misunderstanding what last_fit() and friends are predicting on to estimate performance.

library(tidymodels)
set.seed(1)
mtcars_split <- initial_split(mtcars, prop = 0.7)

knn_FALSE <- nearest_neighbor(neighbors = 5) %>% 
  set_mode("regression") %>% 
  set_engine("kknn", scale = FALSE)

knn_FALSE %>% translate()
#> K-Nearest Neighbor Model Specification (regression)
#> 
#> Main Arguments:
#>   neighbors = 5
#> 
#> Engine-Specific Arguments:
#>   scale = FALSE
#> 
#> Computational engine: kknn 
#> 
#> Model fit template:
#> kknn::train.kknn(formula = missing_arg(), data = missing_arg(), 
#>     ks = min_rows(5, data, 5), scale = FALSE)

knn_TRUE <- nearest_neighbor(neighbors = 5) %>% 
  set_mode("regression") %>% 
  set_engine("kknn", scale = TRUE)

knn_TRUE %>% translate()
#> K-Nearest Neighbor Model Specification (regression)
#> 
#> Main Arguments:
#>   neighbors = 5
#> 
#> Engine-Specific Arguments:
#>   scale = TRUE
#> 
#> Computational engine: kknn 
#> 
#> Model fit template:
#> kknn::train.kknn(formula = missing_arg(), data = missing_arg(), 
#>     ks = min_rows(5, data, 5), scale = TRUE)

Notice that both parsnip models are correctly passing the scale parameter to the underlying engine.

We can now add these two parsnip models to a workflow(), with a formula preprocessor (a recipe would be fine too).

wf_FALSE <- workflow() %>% 
  add_model(knn_FALSE) %>% 
  add_formula(mpg ~ disp + wt)

## Worflow with scale = TRUE
wf_TRUE <- workflow() %>% 
  add_model(knn_TRUE) %>% 
  add_formula(mpg ~ disp + wt)

The function last_fit() fits on the training data and predicts on the testing data. We can do that manually with our workflows. Importantly, notice that for these examples in the testing set, the predictions are the same, so the metrics you would get are the same.

wf_TRUE %>% fit(training(mtcars_split)) %>% predict(testing(mtcars_split))
#> # A tibble: 9 x 1
#>   .pred
#>   <dbl>
#> 1  21.0
#> 2  21.8
#> 3  16.7
#> 4  16.1
#> 5  21.3
#> 6  16.4
#> 7  26.3
#> 8  16.1
#> 9  15.6
wf_FALSE %>% fit(training(mtcars_split)) %>% predict(testing(mtcars_split))
#> # A tibble: 9 x 1
#>   .pred
#>   <dbl>
#> 1  21.0
#> 2  21.8
#> 3  16.7
#> 4  16.1
#> 5  21.3
#> 6  16.4
#> 7  26.3
#> 8  16.1
#> 9  15.6

The same thing is true for fitting the models directly:

knn_TRUE %>% 
  fit(mpg ~ disp + wt, data = training(mtcars_split)) %>% 
  predict(testing(mtcars_split))
#> # A tibble: 9 x 1
#>   .pred
#>   <dbl>
#> 1  21.0
#> 2  21.8
#> 3  16.7
#> 4  16.1
#> 5  21.3
#> 6  16.4
#> 7  26.3
#> 8  16.1
#> 9  15.6
knn_FALSE %>% 
  fit(mpg ~ disp + wt, data = training(mtcars_split)) %>% 
  predict(testing(mtcars_split))
#> # A tibble: 9 x 1
#>   .pred
#>   <dbl>
#> 1  21.0
#> 2  21.8
#> 3  16.7
#> 4  16.1
#> 5  21.3
#> 6  16.4
#> 7  26.3
#> 8  16.1
#> 9  15.6

And in fact is true if we fit the underlying kknn model directly:

kknn::train.kknn(formula = mpg ~ disp + wt, data = training(mtcars_split), 
                 ks = 5, scale = FALSE) %>% 
  predict(testing(mtcars_split))
#> [1] 21.032 21.784 16.668 16.052 21.264 16.404 26.340 16.076 15.620
kknn::train.kknn(formula = mpg ~ disp + wt, data = training(mtcars_split), 
                 ks = 5, scale = TRUE) %>% 
  predict(testing(mtcars_split))
#> [1] 21.032 21.784 16.668 16.052 21.264 16.404 26.340 16.076 15.620

Created on 2020-11-12 by the reprex package (v0.3.0.9001)

The scale parameter is correctly being passed to the underlying engine; it just doesn't change the prediction for these test cases.

Julia Silge
  • 10,848
  • 2
  • 40
  • 48
  • Thanks a lot for looking into this, Julia, and for your great work on these packages. I should have checked the kknn package first. It looks like there is an issue with predict.train.kknn() where the non-usual arguments like scale isn't respected. So the predictions are the same, but it should not be. It would have worked fine using kknn() instead of train.kknn() though. I'll update the question above to report this. – pcs Nov 19 '20 at 13:55