Finding Matching ID's based on similar string match

Question

I have a large pandas dataframe ( 10 million records) shown below (snapshot) :

CID   Address
100   22 park street springvale nsw2655 
101   U111 28 james road, Vic 2755 
102   22 park st. springvale, nsw-2655
103   29 Bino Avenue , Mac - 3990
104   Unit 111 28 James rd, Vic 2755 
105   Unit 111 28 James rd, Victoria 2755

I want to self-join with the same dataframe to get a list of matching CID (Customer IDs) having the same/similar addresses in a pandas dataframe.

I have tried using fuzzywuzzy but it's taking long time just to find the matches

Expected Output :

CID   Address
100   [102]  
101   [104,105]
102   [100]
103   
104   [101,105]
105   [101,104]

what is the best way to solve this ?

With such dirty data, it's doubtful you'll find an approach much better than fuzzywuzzy, unless there's some way to standardize your addresses to all have the same format. — BeRT2me, May 09 '22 at 04:49
Never used it, but this might be a useful tool: [au-address-parser](https://au-addr-parser.readthedocs.io/en/latest/) — BeRT2me, May 09 '22 at 04:53
This is unanswerable until you define "same/similar". PS After you define it, join on it. So, what is the problem? PS "best" means nothing until you define it. After you define it, where & how are you stuck applying/using/doing it? [mre] [ask] [Help] — philipxy, May 09 '22 at 05:40
I just needed some guidance on how to approach this and Bert already provided with the answer I needed, so all good. I will just use a parser to standardize my address and then join on it ; thanks @BeRT2me — A DUBEY, May 09 '22 at 06:00
You don't really even need to join on it, just `df.groupby()['CID'].agg(list)` — BeRT2me, May 09 '22 at 06:03

Finding Matching ID's based on similar string match

0 Answers0