0

I am trying to do a match in R regardless of the order of the columns.

Basically the problem I am trying to solve is that if all of the values in the columns of df2, from column 2-to the end, are found in df1 (after Partner), then match df1.

Here's the catch: disregard the last non-NA value in each row when doing this match but include it in the final output. So don't take the last non-NA value into account when matching but include it.

After the match, determine if that last non-na value exists in any of the columns with it's respective row.

df1

Partner    Col1  Col2  Col3  Col4  
      A      A1    A2    NA    NA  
      B      A2    B9    NA    NA  
      C      B7    V9    C1    N9    
      D      Q1    Q3    Q4    NA    

df2

lift  rule1  rule2  rule3  
  11     A2     A1     A9  
  10     A1     A3     NA  
  11     B9     A2     D7  
  10     Q4     Q1     NA
  11     A2     B9     B1

How do I match df1 with df2 so that the following happens:

1) Disregards the order of the columns found in both dataframes.

2) Then determine if the last non-na value exists in the row currently.

Final output:

df3

Partner    Col1  Col2  Col3  Col4  lift   rule1   rule2   rule3   EXIST?
      A      A1    A2    NA    NA    11      A2      A1      A9     YES
      A      A1    A2    NA    NA    10      A1      A3      NA    NOPE
      B      A2    B9    NA    NA    11      B9      A2      D7     YES
      B      A2    B9    NA    NA    11      A2      B9      B1     YES        
      D      Q1    Q3    Q4    NA    10      Q4      Q1      NA     YES
nak5120
  • 4,089
  • 4
  • 35
  • 94
  • Can you explain where did A4 in df3$Col3 came from? It's not in df1 nor df2 – Pierre Lapointe Mar 23 '17 at 19:38
  • Sorry about that, just edited it. – nak5120 Mar 23 '17 at 19:39
  • Thanks for taking a look into this. I know it seems pretty complicated... – nak5120 Mar 23 '17 at 19:41
  • Can you explain how A is matched twice in df3? – Pierre Lapointe Mar 23 '17 at 19:50
  • So if you disregard the last value in the rules, you have two situations: 1) A2 and A1 from rule1 and rule2 both occur for partner A. 2) A1 from rule1 occurs for partner A as well. I basically need to match up df2 with df1 **disregarding the final value in each row from df2**. Let me know if this makes sense. – nak5120 Mar 23 '17 at 19:52
  • Not sure I understand. Let me ask another question. In df1, A Col2 is A2. Why doesn't it match rule 2, 3rd row in df2 (A2 also)? – Pierre Lapointe Mar 23 '17 at 20:00
  • Completely take that back, yup that should be in their also. Completely my bad...making the change – nak5120 Mar 23 '17 at 20:04
  • Sorry, I do not understand. I would suggest making a minimal example. You still have a lot of unrelevant data in your example. Keep it simple. Also, in your question, I would describe in detail why some data are matched and some are not. – Pierre Lapointe Mar 23 '17 at 20:10
  • I just made the change, let me know if it makes more sense – nak5120 Mar 23 '17 at 20:11
  • I revised it again to make the data simpler as well. – nak5120 Mar 23 '17 at 20:55
  • Ok, that's better. Additional question: In theory, (not so in your example would be helpful to know), could D be matched 3 times because it has three non-na Cols? – Pierre Lapointe Mar 23 '17 at 21:08
  • In theory, for this case, yes only 3 is possible. But if I were to be another variable in rule3 for that row, then there would be more possibilities. – nak5120 Mar 23 '17 at 21:18
  • One last question, do you have only 4 Cols and 3 rules? I'm asking because a large number of columns to match would rule out a combination solution – Pierre Lapointe Mar 23 '17 at 21:32
  • There's 20 rules in the actual dataset – nak5120 Mar 23 '17 at 21:34
  • @PLapointe let me know if you need me to supply any other information for this. Thanks for taking the time to look into this, I really appreciate it – nak5120 Mar 23 '17 at 23:39

1 Answers1

1

I get one more B match than you, but this solution is very close to what you want. You first have to add an id column as we use it to reconstruct the data. Then to perform the match, you first need to melt it with gather from tidyr and use inner_join from dplyr. We then cbind using the ids and the original data.frames.

    library(tidyr);library(dplyr)

df1 <- read.table(text="Partner    Col1  Col2  Col3  Col4
A      A1    A2    NA    NA
B      A2    B9    NA    NA
C      B7    V9    C1    N9
D      Q1    Q3    Q4    NA",header=TRUE, stringsAsFactors=FALSE)


df2 <- read.table(text="lift  rule1  rule2  rule3
  11     A2     A1     A9
  10     A1     A3     NA
  11     B9     A2     D7
  10     Q4     Q1     NA
  11     A2     B9     B1",header=TRUE, stringsAsFactors=FALSE)

df1 <- cbind(df1_id=1:nrow(df1),df1)
df2 <- cbind(df2_id=1:nrow(df2),df2)

#melt with gather
d11  <- df1 %>% gather(Col, Value,starts_with("C"))           #Long
d11 <- d11 %>% na.omit() %>%group_by(df1_id) %>% slice(-n()) #remove last non NA

d22  <- df2 %>%  gather(Rule, Value,starts_with("r"))         #Long

res <- inner_join(d11,d22)

cbind(df1[res$df1_id,],df2[res$df2_id,])

    df1_id Partner Col1 Col2 Col3 Col4 df2_id lift rule1 rule2 rule3
1        1       A   A1   A2 <NA> <NA>      2   10    A1    A3  <NA>
1.1      1       A   A1   A2 <NA> <NA>      1   11    A2    A1    A9
2        2       B   A2   B9 <NA> <NA>      1   11    A2    A1    A9
2.1      2       B   A2   B9 <NA> <NA>      5   11    A2    B9    B1
2.2      2       B   A2   B9 <NA> <NA>      3   11    B9    A2    D7
4        4       D   Q1   Q3   Q4 <NA>      4   10    Q4    Q1  <NA>
Pierre Lapointe
  • 16,017
  • 2
  • 43
  • 56
  • YOU ARE THE MAN! Have one last question for this, how do you determine if the last non-na value exist in that row? To populate with YES or NOPE. – nak5120 Mar 24 '17 at 15:12
  • I'll ask this in a separate question. Can't thank you enough – nak5120 Mar 24 '17 at 15:17
  • It's actually not doing what I wanted looking closer into this but will post a different question – nak5120 Mar 24 '17 at 17:06