2

A lot of feature engineering steps are transforms that do not need to be 'trained' on a dataset, for example, creating a new column x2 as x2=2*x1. These 'static transforms' are different are 'trainable' transforms such as demean and rescale.

Instead of relying on recipes package functions such as step_mutate(), I would like to define a function eg do_static_transforms() that takes in a tibble and outputs a transformed tibble. I would like to add this as the first step to a recipe. Alternatively, I would like to add this as the first step in a workflow (another tidymodels package).

Is this a sensible and possible thing to do?

1 Answers1

1

I recommend that you consider implementing these kinds of "static transforms" in a data manipulation step before you start using recipes or other tidymodels packages. For example, if you wanted to take the log() of an outcome such as price or divide a column by a scalar, you could do this before starting with tidymodels:

library(tidymodels)
#> ── Attaching packages ─────────────────────────── tidymodels 0.1.1 ──
#> ✓ broom     0.7.0      ✓ recipes   0.1.13
#> ✓ dials     0.0.8      ✓ rsample   0.0.7 
#> ✓ dplyr     1.0.0      ✓ tibble    3.0.3 
#> ✓ ggplot2   3.3.2      ✓ tidyr     1.1.0 
#> ✓ infer     0.5.3      ✓ tune      0.1.1 
#> ✓ modeldata 0.0.2      ✓ workflows 0.1.2 
#> ✓ parsnip   0.1.2      ✓ yardstick 0.0.7 
#> ✓ purrr     0.3.4
#> ── Conflicts ────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter()  masks stats::filter()
#> x dplyr::lag()     masks stats::lag()
#> x recipes::step()  masks stats::step()
data(ames)

ames_transformed <-  ames %>%
  mutate(Sale_Price = log(Sale_Price),
         Lot_Area  = Lot_Area / 1e3)

Created on 2020-07-17 by the reprex package (v0.3.0)

Then this ames_transformed object would be what you start from with splitting into testing and training. For predicting on new observations, you would implement the same transformations. Because these transformations are not learned from the data, there is no risk of data leakage.

Julia Silge
  • 10,848
  • 2
  • 40
  • 48