2

I have two data frames D1 and D2 as mentioned below I want to compare D1 and D2 like "Issues" of D1 with "list_of_issues" of D2

ID = c(1, 2, 3, 4)
Issues = c('CI', 'E,CI', 'H,A', 'CI,E')
Location = c('x', 'y', 'z', 'w')
Customer = c('a', 'b', 'c', 'd')
D1 = data.frame(ID, Issues, Location, Customer)

Root_Cause = c('R1', 'R2', 'R3', 'R4')
List_of_Issues = c('A', 'A,H', 'E,CI', 'CI')  
D2 = data.frame(Root_Cause, List_of_Issues)

if issues are matching with the root cause then obtain the Root_cause in D3 by generating a column Root_Cause like this in D3

ID = c(1, 2, 3, 4)
Issues = c('CI', 'E,CI', 'H,A', 'CI,E')
Location = c('x', 'y', 'z', 'w')
Customer = c('a', 'b', 'c', 'd')
Root_Cause = c('R4', 'R3', 'R2', 'R3')
D3 = data.frame(ID, Issues, Location, Customer, Root_Cause)

1 Answers1

0

Here's a quick-fix using data.table and setkey (from data.table).

library(data.table) 
D1 <- as.data.table(D1)
setkey(D1, 'Issues') 

D2 <- as.data.table(D2) 
setkey(D2, 'List_of_Issues)

> D1
   ID Issues Location Customer
1:  1     CI        x        a
2:  4   CI,E        w        d
3:  2   E,CI        y        b
4:  3    H,A        z        c
> D2
   Root_Cause List_of_Issues
1:         R1              A
2:         R2            A,H
3:         R4           CI,E
4:         R3           E,CI

# Using the key to get the desired output
> D2[D1, ]
   Root_Cause List_of_Issues ID Location Customer
1:       <NA>             CI  1        x        a
2:         R4           CI,E  4        w        d
3:         R3           E,CI  2        y        b
4:       <NA>            H,A  3        z        c

Setting a key allows the data.table to be quickly sorted on that column. Since data.table works by reference, this makes the whole process really fast. The last line of code matches the entries of the key of D1 with the key of D2 and returns the Root_Cause for matching key entries in both tables.

It can be sorted by ID to get the desired output:

> D2[D1, ][order(ID)]
   Root_Cause List_of_Issues ID Location Customer
1:       <NA>             CI  1        x        a
2:         R3           E,CI  2        y        b
3:       <NA>            H,A  3        z        c
4:         R4           CI,E  4        w        d

In your example, the Root_Cause column of D3 is c(NA, 'R3', 'R2', 'R3') which is incorrect - by default A,H is not the same as H,A which is why R2 would not be matched. Secondly, CI,E matches R4 in D2 from your sample data.

You can always drop columns by setting them to NULL e.g. Customer := NULL and reorder columns by using setcolorder().

Gautam
  • 2,597
  • 1
  • 28
  • 51
  • Thanks for the inputs. I have edited my question a little bit. My desired output should be like if "Issues" column contains either 'CI,E' or 'E,CI' my root cause should be R3. – Partha sarathi Nov 03 '19 at 06:41