2

Apologies if this has been asked before, but I was unable to find the corresponding info.

I am using the recipe from tidymodels and trying to create a model (eventually).

As I prepped my this is effectively what it looked like (apologies, I can't include the actual data)

Import:

myData = read.csv('file')

which gives me something along the lines of

ID NumericField1 StringField1... NumericFieldN StringFieldN

and all the data types are correct for numeric/non-numeric data

I then go on to do some data manipulation and whatnot and get to the point where I split my data

train_test_split <- initial_split(data=myManipulatedData, prop=MySplitPct)

train_data <- train_test_split %>% training()
test_data <- train_test_split %>% testing()

at this point I double checked the data-types on the test_data before I put it in my recipe and did make sure all my numeric data was still numeric etc... And verified the data has all the correct data types

so then I create my recipe:

my_recipe <- recipe(outcome ~ numeric_1 + numeric_2 + ... + string_1 + ..., data=test_data) %>%
  update_role(numeric_n, numeric_m, new_role="ID")

however after doing so this is what it spits out in the summary:

> summary(my_recipe)
# A tibble: 15 × 4
   variable              type      role      source  
   <chr>                 <list>    <chr>     <chr>   
 1 numeric_1             <chr [2]> predictor original
 2 numeric_2             <chr [2]> predictor original
 3 string_1              <chr [3]> predictor original
 4 numeric_3             <chr [2]> predictor original
 5 numeric_4             <chr [2]> predictor original
 6 string_2              <chr [3]> predictor original
 7 string_3              <chr [3]> predictor original
 8 numeric_5             <chr [2]> predictor original
 9 string_4              <chr [3]> predictor original
10 string_5              <chr [3]> predictor original
11 string_6              <chr [3]> predictor original
12 numeric_6             <chr [2]> predictor original
13 numeric_7             <chr [2]> ID        original
14 numeric_8             <chr [2]> ID        original
15 outcome_1             <chr [3]> outcome   original

Just to tie it together since I know it will be difficult for anyone to debug without the actual data here is one column which I can share:

numeric_1 = y_coordinate

In our summary(test_data) this is how it shows:

y_coordinate
Min.   :-42.0000
1st Qu.:-15.0000   
Median :  0.0000   
Mean   : -0.5718
3rd Qu.: 13.0000
Max.   : 42.0000

So it is clear to me the test_data is aware that the field is numeric, but I don't understand why the recipe continues to use the chr type?

TIA

  • 1
    Are you sure that it’s not numeric? The type column having a value of `` means that there are three data types. It does not mean that the column is character. – topepo Jun 04 '23 at 13:22
  • Ahh I see, its typing it as `'double' 'numeric'` or `'string' 'unordered' 'nominal'`and then in the summary showing that the type is a list of 2/3 types in their list of types I guess. It makes much more sense now. Thanks! – Elliott Barinberg Jun 04 '23 at 14:39

0 Answers0