1

The background

Question edited heavily for clarity

I have data like this:

    df<-structure(list(fname = c("Linda", "Bob"), employee_number = c("00000123456", 
"654321"), Calendar = c(0, 0), Protocol = c(0, 0), Subject = c(0, 
0), CRA = c(0, 0), Regulatory = c(1, 1), Finance = c(0, 1), ResearchNurse = c(0, 
0)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
))

enter image description here

In a previous question I asked on here, I mentioned that I needed to pivot this data from wide to long in order to export it elsewhere. Answers worked great!

Problem is, I discovered that some of the people in my dataset didn't fill out their surveys correctly and have all zero's in certain problematic columns. I.e. when they get pivoted and filtered to "1" values, they get dropped.

Luckily (depending on how you think about it) I can fix their mistakes. If they left those columns blank, I can populate what they should have based on their other columns. I.e. what they filled out under "CRA","Regulatory", "Finance" or "ResearchNurse" will determine whether they get 1's or 0's in "Calendar","Protocol" or "Subject"

To figure out what goes in those columns, we created this matrix of job responsibilities:

    jobs<-structure(list(`Roles (existing)` = c("Calendar Build", "Protocol Management", 
"Subject Management"), `CRA/ Manager/ Senior` = c(1, 1, 0), Regulatory = c(0, 
1, 1), Finance = c(0, 0, 0), `Research Nurse` = c(1, 0, 1)), row.names = c(NA, 
-3L), class = c("tbl_df", "tbl", "data.frame"))

enter image description here

So if you're following so far, no matter what "Bob" put in his columns for "Calendar", "Protocol" or "subject" (he currently has zeros), it will be overwritten based on what he put in other columns. So if Bob put a "1" in his 'Regulatory' column, based on that matrix I screenshotted, he should get a 1 in both the protocol and subject columns.

The specific question

So how do I tell R, "look at bob's "CRA,Regulatory, Finance, and researchNurse" columns, and then crossreference the "jobs" dataframe, and overwrite his "calendar, protocol, and subjects" columns?

My expected output in this particular case would be:

enter image description here

One last little detail: I could see instances where (depending on the order), numbers would overwrite each other. I.e. if Bob should get a 1 in protocol because he's got a 1 in regulatory... but he's got a 1 in finance which would mean he should get a 0 in protocol..... When in doubt, if a column is overwritten with a 1, it should never be turned back into a zero. I hope that makes sense.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Joe Crozier
  • 944
  • 8
  • 20
  • 1
    I'm a bit overwhelmed by your post. Te problem would become more clear to me (I hope) if you could post a small sample-data set which contains the actual problem, and a desired output based on this sample data. – Wimpel Mar 24 '22 at 15:16
  • Thank you for even reading it. I heavily edited it to condense down a bunch of superflous data and change column names to be less confusing – Joe Crozier Mar 24 '22 at 18:53

3 Answers3

3

I'd suggest converting your logic to ifelse statement(s):

df$Calendar <- ifelse(df$CRA == 1 | df$ResearchNurse == 1, 1, df$Calendar)
df$Protocol <- ifelse(df$CRA == 1 | df$Regulatory == 1, 1, df$Protocol)
df$Subject <- ifelse(df$Regulatory == 1 | df$ResearchNurse == 1, 1, df$Subject)

df
#>   fname employee_number Calendar Protocol Subject CRA Regulatory Finance
#> 1 Linda     00000123456        0        1       1   0          1       0
#> 2   Bob          654321        0        1       1   0          1       1
#>   ResearchNurse
#> 1             0
#> 2             0

data:

df <- structure(list(
  fname = c("Linda", "Bob"),
  employee_number = c("00000123456", "654321"),
  Calendar = c(0, 0), Protocol = c(0, 0), Subject = c(0, 0),
  CRA = c(0, 0), Regulatory = c(1, 1), Finance = c(0, 1),
  ResearchNurse = c(0, 0)), row.names = c(NA, -2L), class = c("data.frame"))

Created on 2022-03-28 by the reprex package (v2.0.1)
Skaqqs
  • 4,010
  • 1
  • 7
  • 21
1

Both tables need a common look up value.

So for example in your df table there is a employee_number column. Do you have the same field in the jobs table? If so this is easy to do with left_join() and then a case_when()

You will need simplify your current jobs table to have some summary value of the logic you put in your post eg(if Bob has a 1 in regulatory then he should get a 1 in protocol and subject columns). This can be done with some table manipulation functions. I can't tell you exactly which ones because I don't fully understand the logic.

Assuming that is clear to you and you know how to summarize that jobs table (and you have the unique employee_number) for each row then the below should work.

left_join(x=df,y=jobs,by="employee_number") %>% 
  muate(new_col1=case_when(logic_1 ~ value1,
                             logic_2 ~ value2,
                             logic_3 ~ value3,
                             TRUE ~ default_value))

You can repeat the newcol logic for additional columns as required.

alejandro_hagan
  • 843
  • 2
  • 13
  • I must not have conveyed the question well. There is no employee number in the "jobs" table. The "df" table will be dozens of rows long with a lot of people in it. The "jobs" table is only a row or two and column or two bigger than what I shared. The general idea is that using the small "jobs" reference table I can fix the calendar, protocol, and subjects columns for the dozens of people in the df table. – Joe Crozier Mar 28 '22 at 12:10
  • hey, yep I think I understood your question. The question to you is how do you know which sections of the job tables belongs to which employee_number? – alejandro_hagan Mar 28 '22 at 19:45
1
library(tidyverse)

First, by pivoting both df and jobs, the task should become much easier

(df_long <- df %>% 
  pivot_longer(
    cols = -c(fname, employee_number), names_to = "term"
  ) %>% 
  filter(value == 1) %>% 
  select(-value))

#> # A tibble: 3 x 3
#>   fname employee_number term      
#>   <chr> <chr>           <chr>     
#> 1 Linda 00000123456     Regulatory
#> 2 Bob   654321          Regulatory
#> 3 Bob   654321          Finance

Now, if I understand your question correctly, Bob should have added “Protocol” and “Subject”in his survey because he works in “Finance”. Luckily, we can add that information for him automatically. We pivot jobs and clean up the names/terms to match those in df. This can be done like this:

(jobs_long <- jobs %>% 
  rename(
    CRA = `CRA/ Manager/ Senior`, ResearchNurse = `Research Nurse`
  ) %>% 
  mutate(
    roles = `Roles (existing)` %>% str_extract("^\\w+"),
    .keep = "unused"
  ) %>% 
  pivot_longer(-roles, names_to = "term") %>% 
  filter(value == 1) %>% 
  select(-value))

#> # A tibble: 6 x 2
#>   roles    term         
#>   <chr>    <chr>        
#> 1 Calendar CRA          
#> 2 Calendar ResearchNurse
#> 3 Protocol CRA          
#> 4 Protocol Regulatory   
#> 5 Subject  Regulatory   
#> 6 Subject  ResearchNurse

Once in this shape, we can join the two tables, do some tidying, and then we end up with the correct information. We could continue from here and wrangle the data back into the wide shape, but it’s probably more useful like this so that’s where I would stop.

df_long %>% 
  left_join(jobs_long, by = c("term" = "term")) %>% 
  pivot_longer(cols = c(term, roles), values_drop_na = TRUE) %>% 
  distinct(fname, employee_number, term = value)

#> # A tibble: 7 x 3
#>   fname employee_number term      
#>   <chr> <chr>           <chr>     
#> 1 Linda 00000123456     Regulatory
#> 2 Linda 00000123456     Protocol  
#> 3 Linda 00000123456     Subject   
#> 4 Bob   654321          Regulatory
#> 5 Bob   654321          Protocol  
#> 6 Bob   654321          Subject   
#> 7 Bob   654321          Finance

Created on 2022-03-31 by the reprex package (v1.0.0)

Peter H.
  • 1,995
  • 8
  • 26