0

I have two large data frames (500k rows) from two separate sources without a key. Instead of being able to merge using a key, I want to merge the two data frames by matching other columns. Such as age and amount. It is not a perfect match between the two data frames so some values will not match, and I will later simply remove these ones.

The data could look something like this.

enter image description here

So, in the example above I want to be able to create a table matching Key 1 and Key 2. In the picture above we see that XXX1 and YYY3 is a match. So from here I would like to create a data frame like:

[Key 1] [Key 2]

XXX1 YYY3

XXX2 N/A

XXX3 N/A

I know how to do this in Excel but due to the large amount of data, it simply crashes. I want to focus on R but for what it is worth, this is how I built it in Excel (where the idea is that we first do a VLOOKUP, and then uses INDEX as a VLOOKUP for getting the second match if the first one does not match both criteria):

=IF(P2=0;IFNA(VLOOKUP(L2;B:C;2;FALSE);VLOOKUP(L2;G:H;2;FALSE));IF(O2=Q2;INDEX($A$2:$A$378300;SMALL(IF($L2=$B$2:$B378300;ROW($B$2:$B$378300)-ROW($B$2)+1);2));0))

And this is the approach made in R:

for (i in 1:nrow(df)) {
  for (j in 1:nrow(df)) {
    if (df_1$pc_age[i] == df_2$pp_age[j] && (df_1$amount[i] %in% c(df_2$amount1[j], df_2$amount2[j], df_2$amount3[j]))) {
      df_1$Key1[i] = df_2$Key2[j]
    } else (df_1$Key1[i] = N/A)
  }}

The problem is that this takes way, way to long. Is there a more effective way to map this data as good as possible?

Thanks!

Raz89
  • 45
  • 1
  • 6

2 Answers2

0

Create dummy columns in both the data frames such as(I can show you for df1) :

 for(i in 1:nrow(df1)){
 df1$key1 <- paste0("X_",i)
 }

Similarly for df2 from Y1....Yn and then join both data frames using "merge" on columns age and amount. Concatenate Key1 and key2 in a new column in the merged data frame. You will directly get your desired data frame.

adjustedR2
  • 161
  • 4
  • Thank you, I did something similar but was able to skip for loop altogether by using the solution suggested by @FabioMarroni. – Raz89 Mar 01 '19 at 03:49
0

could the following code work for you?

# create random data
set.seed(123)
df1 <- data.frame(
  key_1=as.factor(paste("xxx",1:100,sep="_")),
  age = sample(1:100,100,replace=TRUE),
  amount = sample(1:200,100))

df2 <- data.frame(
  key_1=paste("yyy",1:500,sep="_"),
  age = sample(1:100,500,replace=TRUE),
  amount_1 = sample(1:200,500,replace=TRUE),
  amount_2 = sample(1:200,500,replace=TRUE),
  amount_3 = sample(1:200,500,replace=TRUE))
# ensure at least three fit rows
df2[10,2:3]    <- df1[1,2:3]
df2[20,c(2,4)] <- df1[2,2:3]
df2[30,c(2,5)] <- df1[3,2:3]
# define comparrison with df2
comp2df2 <- function(x){
  ageComp <- df2$age == as.numeric(x[2])
  if(!any(ageComp)){
    return(NaN)
  }
  amountComp <- apply(df2,1,function(a) as.numeric(x[3]) %in% as.numeric(a[3:5]))
  if(!any(amountComp)){
    return(NaN)
  }
  matchIdx <- ageComp & amountComp
  if(sum(matchIdx) > 1){
    warning("multible match detected first match is taken\n")
  }
  return(which(matchIdx)[1])
}
# run match
matchIdx <- apply(df1,1,comp2df2)
# merge
df_new <- cbind(df1[!is.na(matchIdx),],df2[matchIdx[!is.na(matchIdx)],])

didn't had time to test it on really big data, but this should be faster than your two for loops I guess.... To further speed up things you could delete the

if(sum(matchIdx) > 1){
    warning("multible match detected first match is taken\n")
  }

lines if you are not worried about a line matches several others.

HolgerBarlt
  • 307
  • 1
  • 18
  • I believe that this would work, thank you! Do you see parts where this suggestion is superior to the answer given by @FabioMarroni (which I ended up implementing)? – Raz89 Mar 01 '19 at 03:50
  • Depends. If you only want to merge by df1$amount and df2$amount2 the solution of @FabioMarroni seems quite optimal to me, but you get a different result if you want to merge df1$amount by fitting it to at least to one column df$amount[1-3] like you did in your question... – HolgerBarlt Mar 05 '19 at 09:22