0

I need merge two datasets

df1

df1=structure(list(id = structure(c(1L, 4L, 5L, 6L, 2L, 3L), .Label = c("195/75 R16C-Tire CORDIANT Business CA", 
"215/75 R17,5-Tires KAMA NR-201 driving axle", "235/70 R16-Tire KAMA-221", 
"275/70 R22,5-Tire TYREX ALL STEEL VC-1 (Я-646)", "315/80 R22,5-Tire TYREX ALL STEEL DR-1 driving axle", 
"315/80 R22,5-Tire TYREX ALL STEEL FR-401 steering axle"), class = "factor")), .Names = "id", class = "data.frame", row.names = c(NA, 
-6L))

df2

 df2= structure(list(id = structure(c(2L, 4L, 5L, 6L, 3L, 1L), .Label = c("Auto-cutting 245 / 70R16 K-214", 
    "Auto-rubber 195/75 R16C Cordiant Business CA 107 / 105R all-season", 
    "Auto-rubber 215 / 75R17,5 K-166", "Auto-rubber 275 / 70R22,5 (11 / 70R22,5) I-646 (Tyrex all steel VC-1)", 
    "Auto-rubber 315 / 80R22,5 DR-1Tyrex All Steel (Я-636)", "Auto-rubber 315 / 80R22,5 FR-401 Tyrex All Steel (Я-626)"
    ), class = "factor")), .Names = "id", class = "data.frame", row.names = c(NA, 
    -6L))

I use fuzzylogic

library("RecordLinkage")
    #get weights
rpairs_jar <- compare.linkage(df1, df2,
                              strcmp = c("id"),
                              strcmpfun = jarowinkler)

rpairs_epiwt <- epiWeights(rpairs_jar)
#get wright to data frame
b=rpairs_epiwt$pairs
View(b)

On output i see

enter image description here

we have weights between all id. For example, the weight id1 is calculated with respect to all 6 denominations. But we see that the greatest weight between the first item of df1(id1) and the first item of df2 (id1) (0,61).

What about second item(id2 of df1) the greatest weight between third item(id3 of df2)(0.58).

How to leave only those comparisons, between the id of which are the greatest weight?

I.E on output, we have table not with 36 entries, but six

id1 id2     id
1   1   0,6106743
2   3   0,5994314
3   3   0,5874915
4   4   0,6288133
5   4   0,5552018
6   6   0,5642857
psysky
  • 3,037
  • 5
  • 28
  • 64

1 Answers1

1

I would do something like this, although I'm not sure that I've understood your question entirely. Ah, I've just found it: your data is different than in the picture. Here are the first 12 rows that I get:

   id1 id2        id is_match
1    1   1 0.6106743       NA
2    1   2 0.5014481       NA
3    1   3 0.4869703       NA
4    1   4 0.4752647       NA
5    1   5 0.4671400       NA
6    1   6 0.4358859       NA
7    2   1 0.4428541       NA
8    2   2 0.5752810       NA
9    2   3 0.6090623       NA
10   2   4 0.5946931       NA
11   2   5 0.5320353       NA
12   2   6 0.5055347       NA

If you're after the largest id for each id1, here's what I'd do:

library(tidyverse)

b %>% 
  group_by(id1) %>% 
  summarise(maxId1 = max(id)) %>% 
  left_join(b, by = c("maxId1" = "id", "id1")) %>% 
  select(id1, id2, "id" = "maxId1")

And this is what it yields (which isn't what you've said above, given the different data, but I think is what you seek):

# A tibble: 6 x 3
    id1   id2        id
  <int> <int>     <dbl>
1     1     1 0.6106743
2     2     3 0.6090623
3     3     3 0.5837645
4     4     4 0.6249495
5     5     5 0.5889200
6     6     6 0.5642857

I hope this helps you.

p0bs
  • 1,004
  • 2
  • 15
  • 22