I recommend that you consider implementing these kinds of "static transforms" in a data manipulation step before you start using recipes or other tidymodels packages. For example, if you wanted to take the log()
of an outcome such as price or divide a column by a scalar, you could do this before starting with tidymodels:
library(tidymodels)
#> ── Attaching packages ─────────────────────────── tidymodels 0.1.1 ──
#> ✓ broom 0.7.0 ✓ recipes 0.1.13
#> ✓ dials 0.0.8 ✓ rsample 0.0.7
#> ✓ dplyr 1.0.0 ✓ tibble 3.0.3
#> ✓ ggplot2 3.3.2 ✓ tidyr 1.1.0
#> ✓ infer 0.5.3 ✓ tune 0.1.1
#> ✓ modeldata 0.0.2 ✓ workflows 0.1.2
#> ✓ parsnip 0.1.2 ✓ yardstick 0.0.7
#> ✓ purrr 0.3.4
#> ── Conflicts ────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
#> x recipes::step() masks stats::step()
data(ames)
ames_transformed <- ames %>%
mutate(Sale_Price = log(Sale_Price),
Lot_Area = Lot_Area / 1e3)
Created on 2020-07-17 by the reprex package (v0.3.0)
Then this ames_transformed
object would be what you start from with splitting into testing and training. For predicting on new observations, you would implement the same transformations. Because these transformations are not learned from the data, there is no risk of data leakage.