0

I have the following data set:

df <- data.frame(identifier = c("a","b","b","c"), 
                    disease = c("heart, lung","lung, heart,,","lung, heart, heart, liver", "kidney, brain "))

which gives:

  identifier                   disease
1          a               heart, lung
2          b             lung, heart,,
3          b lung, heart, heart, liver
4          c            kidney, brain

I want to be able to then go through the diseases, and for every condition create a new column. If the disease is present for a specific identifier, I want to then put a "yes" in that column. So the ideal output would be:

  identifier heart lung liver kidney brain
1          a   Yes  Yes    No     No    No
2          b   Yes  Yes    No     No    No
3          b   Yes  Yes   Yes     No    No
4          c    No   No    No    Yes   Yes

Would greatly appreciate any help with this as it has stumped me for a couple of hours now :)

Ahmed
  • 11
  • 3
  • Here’s a very similar question from earlier today: [Trouble converting a list column into factor in R](https://stackoverflow.com/questions/74353196/trouble-converting-a-list-column-into-factor-in-r) – zephryl Nov 08 '22 at 04:14

2 Answers2

1

After minor clean up of the input data, we can use separate_rows and pivot_wider to count the number of occurrences. They are from the tidyr library, part of the larger tidyverse package system.

df <- data.frame(identifier = c("a","b","b","c"), 
                 disease = c("heart, lung","lung, heart","lung, heart, heart, liver", "kidney, brain"))

library(tidyverse)
df %>% 
  separate_rows(disease) %>% 
  mutate(myvalues="yes") %>% 
  pivot_wider(names_from="disease", values_from = myvalues, values_fn=first, values_fill = "no")

#  identifier heart lung  liver kidney brain
#1 a          yes   yes   no    no     no   
#2 b          yes   yes   yes   no     no   
#3 c          no    no    no    yes    yes  

Alternatively, if you want to count the results:

df %>% 
  separate_rows(disease) %>% 
  mutate(myvalues=1) %>% 
  pivot_wider(names_from="disease", values_from = myvalues, values_fn=sum, values_fill=0)

#  identifier heart  lung liver kidney brain
#1 a              1     1     0      0     0
#2 b              3     2     1      0     0
#3 c              0     0     0      1     1

An alternative is from the janitor library tabyl() function:

library(janitor)
df %>% 
  separate_rows(disease) %>% 
  tabyl(identifier, disease)
M.Viking
  • 5,067
  • 4
  • 17
  • 33
0

You're going to need to do some cleaning of the df, getting rid of the commas

then one way to solve this is use %in% with ifelse and lapply

cleaned <- strsplit(gsub(",\\s?", " ", df$disease), "\\s\\s?")

unique_cleaned <- unique(unlist(cleaned))

cbind(df[1], do.call(rbind.data.frame, 
                     lapply(cleaned, \(x) ifelse(unique_cleaned %in% x, "yes", "no"))) |> 
  setNames(unique_cleaned)) 

  identifier heart lung liver kidney brain
1          a   yes  yes    no     no    no
2          b   yes  yes    no     no    no
3          b   yes  yes   yes     no    no
4          c    no   no    no    yes   yes
Just James
  • 1,222
  • 2
  • 7