(Updated at the end based on Julia's reply. TL;DR: This seems to be an issue with the underlying kknn package, instead of with tidymodels)
I'm doing some k-nearest neighbours regression models with tidymodels. This is through the nearest_neighbor()
function. I want to see what the difference is between the results with and without normalization of the features.
Now set_engine("kknn")
uses the kknn::train.kknn()
function under the hood, which has a normalization argument scale = TRUE
. I want to compare models with scale = FALSE
to scale = TRUE
(actually, I want to do that in a recipe, but that is not possible, as I'll explain below).
But it does not seem as if I am able to reliably set scale = FALSE
through tidymodels. Below is a reprex showing what I see.
The questions so long: Am I doing something wrong or is this a bug? If it is a bug, is it known and can I read about it somewhere? I'd be very grateful if someone can shed light on this.
Set up for the reprex
Here I'll use mtcars
:
library(tidymodels)
data("mtcars")
A train-test split is:
set.seed(1)
mtcars_split <- initial_split(mtcars, prop = 0.7)
Here is a common recipe I'll use:
mtcars_recipe <- recipe(mpg ~ disp + wt, data = mtcars)
Here is model 1 (called knn_FALSE
) where scale = FALSE
:
knn_FALSE <- nearest_neighbor(neighbors = 5) %>%
set_mode("regression") %>%
set_engine("kknn", scale = FALSE)
Here is model 2 (called knn_TRUE
) where scale = TRUE
:
knn_TRUE <- nearest_neighbor(neighbors = 5) %>%
set_mode("regression") %>%
set_engine("kknn", scale = TRUE)
I bundle these two models into two workflows:
## Workflow with scale = FALSE
wf_FALSE <- workflow() %>%
add_model(knn_FALSE) %>%
add_recipe(mtcars_recipe)
## Worflow with scale = TRUE
wf_TRUE <- workflow() %>%
add_model(knn_TRUE) %>%
add_recipe(mtcars_recipe)
Using fit()
, it is possible to have scale = FALSE
It does seem to be possible to have one version with scale = TRUE
and one with scale = FALSE
when using fit()
on a workflow.
For example, for scale = TRUE
I get:
wf_TRUE %>% fit(mtcars)
== Workflow [trained] ===============================================================================================
Preprocessor: Recipe
Model: nearest_neighbor()
-- Preprocessor -----------------------------------------------------------------------------------------------------
0 Recipe Steps
-- Model ------------------------------------------------------------------------------------------------------------
Call:
kknn::train.kknn(formula = ..y ~ ., data = data, ks = ~5, scale = ~TRUE)
Type of response variable: continuous
minimal mean absolute error: 2.09425
Minimal mean squared error: 7.219114
Best kernel: optimal
Best k: 5
Whereas for scale = FALSE
I have:
wf_FALSE %>% fit(mtcars)
== Workflow [trained] ===============================================================================================
Preprocessor: Recipe
Model: nearest_neighbor()
-- Preprocessor -----------------------------------------------------------------------------------------------------
0 Recipe Steps
-- Model ------------------------------------------------------------------------------------------------------------
Call:
kknn::train.kknn(formula = ..y ~ ., data = data, ks = ~5, scale = ~FALSE)
Type of response variable: continuous
minimal mean absolute error: 2.1665
Minimal mean squared error: 6.538769
Best kernel: optimal
Best k: 5
The results are clearly different, which comes from the difference in the scale
parameter.
But the plot thickens.
No difference with last_fit()
When using last_fit()
however, the results for scale = TRUE
and scale = FALSE
are identical though.
For scale = TRUE
:
wf_TRUE %>% last_fit(mtcars_split) %>% collect_metrics()
# A tibble: 2 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 3.16
2 rsq standard 0.663
Whereas for scale = FALSE
:
wf_FALSE %>% last_fit(mtcars_split) %>% collect_metrics()
# A tibble: 2 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 3.16
2 rsq standard 0.663
These are clearly --- and unexpectedly --- the same.
There is also no difference when tuning using tune_grid()
If I do tuning with tune_grid()
and a validation_split()
, there is also no difference between the results for scale = TRUE
and scale = FALSE
.
Here is the code for that:
## Tune grid
knn_grid <- tibble(neighbors = c(5, 15))
## Tune Model 1: kNN regresson with no scaling in train.kknn
knn_FALSE_tune <- nearest_neighbor(neighbors = tune()) %>%
set_mode("regression") %>%
set_engine("kknn", scale = FALSE)
## Model 2: kNN regresson with scaling in train.kknn
knn_TRUE_tune <- nearest_neighbor(neighbors = tune()) %>%
set_mode("regression") %>%
set_engine("kknn", scale = TRUE)
## Workflow with scale = FALSE
wf_FALSE_tune <- workflow() %>%
add_model(knn_FALSE_tune) %>%
add_recipe(mtcars_recipe)
## Worflow with scale = TRUE
wf_TRUE_tune <- workflow() %>%
add_model(knn_TRUE_tune) %>%
add_recipe(mtcars_recipe)
## Validation split
mtcars_val <- validation_split(mtcars)
## Tune results: Without scaling
wf_FALSE_tune %>%
tune_grid(resamples = mtcars_val,
grid = knn_grid) %>%
collect_metrics()
## Tune results: With scaling
wf_TRUE_tune %>%
tune_grid(resamples = mtcars_val,
grid = knn_grid) %>%
collect_metrics()
The result when scale = FALSE
:
> wf_FALSE_tune %>%
+ tune_grid(resamples = mtcars_val,
+ grid = knn_grid) %>%
+ collect_metrics()
# A tibble: 4 x 7
neighbors .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 5 rmse standard 1.64 1 NA Model1
2 5 rsq standard 0.920 1 NA Model1
3 15 rmse standard 2.55 1 NA Model2
4 15 rsq standard 0.956 1 NA Model2
The results when scale = TRUE
:
> wf_TRUE_tune %>%
+ tune_grid(resamples = mtcars_val,
+ grid = knn_grid) %>%
+ collect_metrics()
# A tibble: 4 x 7
neighbors .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 5 rmse standard 1.64 1 NA Model1
2 5 rsq standard 0.920 1 NA Model1
3 15 rmse standard 2.55 1 NA Model2
4 15 rsq standard 0.956 1 NA Model2
Question
Am I misunderstanding (or missing my own bug), or are the last_fit()
and tune_grid()
functions not respecting my choice for scale
?
I'm new to tidymodels, so I might have missed something. Answers much appreciated.
I was hoping to use step_normalize()
in a recipe to do the normalization myself, but since I cannot reliably set scale = FALSE
in the underlying engine, I have not been able to experiment with that.
Update after Julia's reply
As Julia shows, predictions from train.kknn()
provide the same predictions for scale = FALSE
and scale = TRUE
. So this isn't an tidymodels issue. Rather the kknn:::predict.train.kknn()
function does not respect all parameters passed to train.kknn()
when predicting.
Consider the following output which uses kknn()
instead of train.kknn()
:
kknn::kknn(formula = mpg ~ disp + wt, train = training(mtcars_split),
test = testing(mtcars_split), k = 5, scale = FALSE) %>%
predict(newdata = testing(mtcars_split))
## [1] 21.276 21.276 16.860 16.276 21.276 16.404 29.680 15.700 16.020
kknn::kknn(formula = mpg ~ disp + wt, train = training(mtcars_split),
test = testing(mtcars_split), k = 5, scale = TRUE) %>%
predict(newdata = testing(mtcars_split))
## [1] 21.032 21.784 16.668 16.052 21.264 16.404 26.340 16.076 15.620
These are different, as it should be. The problem is that kknn:::predict.train.kknn()
calls kknn()
, but without passing along scale
(and some other optional arguments):
function (object, newdata, ...)
{
if (missing(newdata))
return(predict(object, ...))
res <- kknn(formula(terms(object)), object$data, newdata,
k = object$best.parameters$k, kernel = object$best.parameters$kernel,
distance = object$distance)
return(predict(res, ...))
}
<bytecode: 0x55e2304fba10>
<environment: namespace:kknn>