Matching words with more than two number of words in different data frame in r

Question

I have two data frames DF1 and DF2 like this.

ID = c(1, 2, 3, 4) 
Issues = c('Issue1, Issue4', 'Issue2, Issue5, Issue6', 'Issue3, Issue4', 'Issue1, Issue5')
Location = c('x', 'y', 'z', 'w')
Customer = c('a', 'b', 'c', 'd')
DF1 = data.frame(ID, Issues, Location, Customer)

Root_Cause = c('R1', 'R2', 'R3', 'R4')
List_of_Issues = c('Issue1, Issue3, Issue5', 'Issue2, Issue1, Issue4', 'Issue6, Issue7', 'Issue5, Issue6')  
DF2 = data.frame(Root_Cause, List_of_Issues)

I want to compare both the data frames with "Issues" of DF1 and "List_of_Issues" of DF2 and if more than two words in "Issues" column is there in "List_of_Issues" column in DF2, then I want to populate subsequent "Root_Cause" from DF2. My resulting data frame should look like DF3.

ID = c(1, 2, 3, 4)
Issues = c('Issue1, Issue4', 'Issue2, Issue5, Issue6', 'Issue3, Issue4', 'Issue1, Issue5')
Location = c('x', 'y', 'z', 'w')
Customer = c('a', 'b', 'c', 'd')
Root_Cause = c('R2', 'R4', NA, 'R1')
DF3 = data.frame(ID, Issues, Location, Customer, Root_Cause)

Please provide minimal and reproducible example(s) along with the desired output. Use `dput()` for data and specify all non-base packages with `library()` calls. Do not embed pictures for data or code, use indented code blocks instead. — s_baldur, Oct 16 '19 at 09:19
Yes i have done it. Sorry for that. I am new to stack overflow. — Partha sarathi, Oct 17 '19 at 04:05
@Shiriam Well done. I tried to improve the formatting a bit. Now let's see if someone can help you out. — s_baldur, Oct 17 '19 at 08:20

JDG · Accepted Answer · 2019-10-21T08:28:36.090

0

Using data.table:

EDIT: I have edited your sample data to account for multi-root cause eventualities. In this data, ID==1 corresponds to both R2 and R3.

Data

ID = c(1, 2, 3, 4) 
Issues = c('Issue1, Issue4, Issue6, Issue7', 'Issue2, Issue5, Issue6', 'Issue3, Issue4', 'Issue1, Issue5')
Location = c('x', 'y', 'z', 'w')
Customer = c('a', 'b', 'c', 'd')
DF1 = data.table(ID, Issues, Location, Customer)

Root_Cause = c('R1', 'R2', 'R3', 'R4')
List_of_Issues = c('Issue1, Issue3, Issue5', 'Issue2, Issue1, Issue4', 'Issue6, Issue7', 'Issue5, Issue6')  
DF2 = data.table(Root_Cause, List_of_Issues)

Code

DF1[, Issues := strsplit(Issues, split = ', ')]
DF2[, List_of_Issues := strsplit(List_of_Issues, split = ', ')]

DF1[, RootCause := lapply(Issues, function(x){

  matchvec = sapply(DF2[, List_of_Issues], function(y) length(unlist(intersect(y, x))))
  ids = which(matchvec > 1)
  str = DF2[, paste(Root_Cause[ids], collapse = ', ')]

  ifelse(str == '', NA, str)

})]

Result

> DF1
   ID                      Issues Location Customer RootCause
1:  1 Issue1,Issue4,Issue6,Issue7        x        a    R2, R3
2:  2        Issue2,Issue5,Issue6        y        b        R4
3:  3               Issue3,Issue4        z        c        NA
4:  4               Issue1,Issue5        w        d        R1

edited Oct 21 '19 at 08:28

answered Oct 17 '19 at 09:23

JDG

1,342
8
18

`DF2 = DF2[, List_of_Issues := strsplit(List_of_Issues, split = ', ')]` Hope we are converting List_of_Issues column to vector here. – Partha sarathi Oct 18 '19 at 06:36
We're splitting each combined string of issues into separate strings. – JDG Oct 18 '19 at 06:43
Thank you so much. Solution is working fine and it was very helpful – Partha sarathi Oct 21 '19 at 04:49
`DF1 = data.table( ID = c(1, 2, 3, 4), Issues = c('Issue1, Issue4', 'Issue2, Issue5, Issue6, Issue7', 'Issue3, Issue4', 'Issue1, Issue5'), Location = c('x', 'y', 'z', 'w'), Customer = c('a', 'b', 'c', 'd') ) DF2 = data.table( Root_Cause = c('R1', 'R2', 'R3', 'R4'), List_of_Issues = c('Issue1, Issue3, Issue5', 'Issue2, Issue1, Issue4', 'Issue6, Issue7', 'Issue5, Issue6') )` Suppose i have more than one root cause for a single row of Issues as mentioned above. Is it possible to list the root cause by comma separated string of Root_cause – Partha sarathi Oct 21 '19 at 05:14
I have now edited the answer to account for multiple root causes. – JDG Oct 21 '19 at 08:29

Matching words with more than two number of words in different data frame in r

1 Answers1