-1

My data is something like below

# dummy data 
ID = c(1,2,3,4,5,6,7,8,9,10,11,12)
addrs = c("3 xx road sg" , "4 yy road sg" , "5 apt 04-3 sg" , "Bung 2 , kl road sg","4 yy road sg" , "3 xx road sg" ,"Bung 2 , kl road sg" ,"5 apt 04-3 sg","3 xx road sg","Bung 2 , sg kl road","3xx Road sg","4 yy sg")
data.1=data.table(ID,addrs)

data looks like

   ID               addrs
1:  1        3 xx road sg
2:  2        4 yy road sg
3:  3       5 apt 04-3 sg
4:  4 Bung 2 , kl road sg
5:  5        4 yy road sg
6:  6        3 xx road sg
7:  7 Bung 2 , kl road sg
8:  8       5 apt 04-3 sg
9:  9        3 xx road sg

i want to get matching combinations ( based on addrs ) output required is ( only example for "3 xx road sg") - if Addr matches for A and B , table should have A-B - Match and B-A-Match

ID.1 ID.2 Match.1        Match.2      Accuracy  
 1    6   3 xx road sg   3 xx road sg   100%
 1    9   3 xx road sg   3 xx road sg   100% 
 6    9   3 xx road sg   3 xx road sg   100%
 9    6   3 xx road sg   3 xx road sg   100% 
 9    1   3 xx road sg   3 xx road sg   100%
 6    1   3 xx road sg   3 xx road sg   100% 

showing output where the text may differ by spaces , order of characters , or characters

ID.1   ID.2    Match.1                Match.2              Accuracy 
1       11    3 xx road sg            3xx Road sg            100 %
2       12    4 yy road sg            4 yy sg                 70 %    
4       10    Bung 2 , kl road sg    Bung 2 , sg kl road      100 %

Any further inputs on how to deal with the text matching when the data may be similar but written differently ?

Learner_seeker
  • 544
  • 1
  • 4
  • 21

1 Answers1

1
r <- merge(data.1, data.1, by="addrs", all=T, suffixes = c(".1",".2"))
r[r$ID.1 != r$ID.2,]

                  addrs ID.1 ID.2
2         3 xx road sg    1    6
3         3 xx road sg    1    9
4         3 xx road sg    6    1
6         3 xx road sg    6    9
7         3 xx road sg    9    1
8         3 xx road sg    9    6
11        4 yy road sg    2    5
12        4 yy road sg    5    2
15       5 apt 04-3 sg    3    8
16       5 apt 04-3 sg    8    3
19 Bung 2 , kl road sg    7    4
20 Bung 2 , kl road sg    4    7
Jean
  • 1,480
  • 15
  • 27
  • Thanks , any further suggestions when the data isn't so clean and text may be written differently. Any match which basis on text character match with a threshold say 90 % of character matches – Learner_seeker Aug 01 '17 at 09:42
  • even if the characters are in different order? You can look at computing levenshtein distance – Jean Aug 01 '17 at 09:43
  • yes particularly then. basically the column is address which has a block , road name , society name , city . Now all this text is concatenated . i can always break it into the 4 columns and then use the same merge with 4 columns . Does that make sense ? – Learner_seeker Aug 01 '17 at 09:44
  • Please add expected output in your question for your scenario where text is similar but not same. I can give it a go then – Jean Aug 01 '17 at 09:47
  • i have added some rows to the data with examples of text differing by space , order or missing characters . Showed output in table 2 with a generic accuracy just for understanding where we can put what you suggest levenshtein distance . – Learner_seeker Aug 01 '17 at 10:12