0

I have a large pandas dataframe ( 10 million records) shown below (snapshot) :

CID   Address
100   22 park street springvale nsw2655 
101   U111 28 james road, Vic 2755 
102   22 park st. springvale, nsw-2655
103   29 Bino Avenue , Mac - 3990
104   Unit 111 28 James rd, Vic 2755 
105   Unit 111 28 James rd, Victoria 2755

I want to self-join with the same dataframe to get a list of matching CID (Customer IDs) having the same/similar addresses in a pandas dataframe.

I have tried using fuzzywuzzy but it's taking long time just to find the matches

Expected Output :

CID   Address
100   [102]  
101   [104,105]
102   [100]
103   
104   [101,105]
105   [101,104]

what is the best way to solve this ?

A DUBEY
  • 806
  • 6
  • 20
  • 2
    With such dirty data, it's doubtful you'll find an approach much better than fuzzywuzzy, unless there's some way to standardize your addresses to all have the same format. – BeRT2me May 09 '22 at 04:49
  • 1
    Never used it, but this might be a useful tool: [au-address-parser](https://au-addr-parser.readthedocs.io/en/latest/) – BeRT2me May 09 '22 at 04:53
  • This is unanswerable until you define "same/similar". PS After you define it, join on it. So, what is the problem? PS "best" means nothing until you define it. After you define it, where & how are you stuck applying/using/doing it? [mre] [ask] [Help] – philipxy May 09 '22 at 05:40
  • I just needed some guidance on how to approach this and Bert already provided with the answer I needed, so all good. I will just use a parser to standardize my address and then join on it ; thanks @BeRT2me – A DUBEY May 09 '22 at 06:00
  • 1
    You don't really even need to join on it, just `df.groupby()['CID'].agg(list)` – BeRT2me May 09 '22 at 06:03

0 Answers0