Calculated Column Based on Rows in Tidymodels Recipe

Question

I have a dataframe as follows:

ID	Col1	RespID	Col3	Col4
1	blue	729Ad	3.2	A
2	orange	295gS	6.5	A
3	red	729Ad	8.4	B
4	yellow	592Jd	2.9	A
5	green	937sa	3.5	B

I would like to calculate a new column, Col5, such that its value is 1 if the row has Col4 value of A and there exists another column somewhere in the dataset a row with the same RespId but a Col4 value of B. Otherwise it’s value is 0. Then I will drop all rows with Col4 value of B, to keep just those with A. I would like to do this using the R tidymodels recipe package. I’d also like to do this all with data frames.

Here is what the desired output table would look like prior to dropping rows with Col4 value of B:

ID	Col1	RespID	Col3	Col4	Col5
1	blue	729Ad	3.2	A	1
2	orange	295gS	6.5	A	0
3	red	729Ad	8.4	B	0
4	yellow	592Jd	2.9	A	0
5	green	937sa	3.5	B	0

If this is not a statistically learned transformation from training data (like a mean or PCA) but instead something that is "constant" if you will, I think you may be better off doing such a transformation outside of recipes, before you start building your recipe. — Julia Silge, Nov 02 '21 at 21:03

score 1 · Answer 1 · answered Oct 23 '21 at 14:19

Does this work:

library(dplyr)
df %>% group_by(RespID) %>% mutate(col5 = case_when(Col4 == 'A' & last(Col4 == 'B') ~ 1, Col4 == 'B' & first(Col4 == 'B')  ~ 0, TRUE ~ 0))
# A tibble: 5 x 6
# Groups:   RespID [4]
     ID Col1   RespID  Col3 Col4   col5
  <int> <chr>  <chr>  <dbl> <chr> <dbl>
1     1 blue   729Ad    3.2 A         1
2     2 orange 295gS    6.5 A         0
3     3 red    729Ad    8.4 B         0
4     4 yellow 592Jd    2.9 A         0
5     5 green  937sa    3.5 B         0

score 1 · Answer 2 · answered Oct 23 '21 at 14:34

1

You may try -

library(dplyr)

df %>%
  group_by(RespID) %>%
  mutate(Col5 = as.integer(all(c('A', 'B') %in% Col4) & Col4 == 'A')) %>%
  ungroup

#     ID Col1   RespID  Col3 Col4   Col5
#  <int> <chr>  <chr>  <dbl> <chr> <int>
#1     1 blue   729Ad    3.2 A         1
#2     2 orange 295gS    6.5 A         0
#3     3 red    729Ad    8.4 B         0
#4     4 yellow 592Jd    2.9 A         0
#5     5 green  937sa    3.5 B         0

all(c('A', 'B') %in% Col4) checks that both A and B value is present in RespID whereas Col4 == 'A' would give 1 only where 'A' is present.

data

df <- structure(list(ID = 1:5, Col1 = c("blue", "orange", "red", "yellow", 
"green"), RespID = c("729Ad", "295gS", "729Ad", "592Jd", "937sa"
), Col3 = c(3.2, 6.5, 8.4, 2.9, 3.5), Col4 = c("A", "A", "B", 
"A", "B")), row.names = c(NA, -5L), class = "data.frame")

answered Oct 23 '21 at 14:34

Ronak Shah

377,200
20
156
213

Thanks for your help! I was wondering if there's a way to avoid using ```group_by```, but still specify the RespID within the ```mutate_ID``` clause somehow. I'm trying to integrate this into the recipes package which allows ```mutate()``` but not ```group_by``` – piper180 Oct 26 '21 at 00:39
I see. Sorry, I do not have experience working with `recipes` package. – Ronak Shah Oct 26 '21 at 01:44
Thanks Ronak. One other follow-up question, how would I specify this for a certain date range? I've updated my example to include a column with year, month, date. I would like to do this in groups based on say a 30 day timeframe -- so that if 'B' appears within 30 days of when 'A' appears in the dataset, only then is there a 1 present (if 'B' appears within 60 days, then there is no 1). – piper180 Nov 04 '21 at 00:13
Please ask follow up question as a new question instead of editing the original question. It invalidates the current answers and confuses future readers @ava – Ronak Shah Nov 04 '21 at 05:34
Reverted this post to the original and reposted the follow-up questions here: https://stackoverflow.com/questions/69885849/calculated-column-based-on-rows-with-date-range – piper180 Nov 08 '21 at 15:21

Calculated Column Based on Rows in Tidymodels Recipe

2 Answers2

Linked