Apologies if this has been asked before, but I was unable to find the corresponding info.
I am using the recipe
from tidymodels
and trying to create a model (eventually).
As I prepped my this is effectively what it looked like (apologies, I can't include the actual data)
Import:
myData = read.csv('file')
which gives me something along the lines of
ID NumericField1 StringField1... NumericFieldN StringFieldN
and all the data types are correct for numeric/non-numeric data
I then go on to do some data manipulation and whatnot and get to the point where I split my data
train_test_split <- initial_split(data=myManipulatedData, prop=MySplitPct)
train_data <- train_test_split %>% training()
test_data <- train_test_split %>% testing()
at this point I double checked the data-types on the test_data
before I put it in my recipe and did make sure all my numeric data was still numeric etc... And verified the data has all the correct data types
so then I create my recipe:
my_recipe <- recipe(outcome ~ numeric_1 + numeric_2 + ... + string_1 + ..., data=test_data) %>%
update_role(numeric_n, numeric_m, new_role="ID")
however after doing so this is what it spits out in the summary:
> summary(my_recipe)
# A tibble: 15 × 4
variable type role source
<chr> <list> <chr> <chr>
1 numeric_1 <chr [2]> predictor original
2 numeric_2 <chr [2]> predictor original
3 string_1 <chr [3]> predictor original
4 numeric_3 <chr [2]> predictor original
5 numeric_4 <chr [2]> predictor original
6 string_2 <chr [3]> predictor original
7 string_3 <chr [3]> predictor original
8 numeric_5 <chr [2]> predictor original
9 string_4 <chr [3]> predictor original
10 string_5 <chr [3]> predictor original
11 string_6 <chr [3]> predictor original
12 numeric_6 <chr [2]> predictor original
13 numeric_7 <chr [2]> ID original
14 numeric_8 <chr [2]> ID original
15 outcome_1 <chr [3]> outcome original
Just to tie it together since I know it will be difficult for anyone to debug without the actual data here is one column which I can share:
numeric_1 = y_coordinate
In our summary(test_data)
this is how it shows:
y_coordinate
Min. :-42.0000
1st Qu.:-15.0000
Median : 0.0000
Mean : -0.5718
3rd Qu.: 13.0000
Max. : 42.0000
So it is clear to me the test_data is aware that the field is numeric, but I don't understand why the recipe continues to use the chr type?
TIA