I have a large pandas dataframe ( 10 million records) shown below (snapshot) :
CID Address
100 22 park street springvale nsw2655
101 U111 28 james road, Vic 2755
102 22 park st. springvale, nsw-2655
103 29 Bino Avenue , Mac - 3990
104 Unit 111 28 James rd, Vic 2755
105 Unit 111 28 James rd, Victoria 2755
I want to self-join with the same dataframe to get a list of matching CID (Customer IDs) having the same/similar addresses in a pandas dataframe.
I have tried using fuzzywuzzy
but it's taking long time just to find the matches
Expected Output :
CID Address
100 [102]
101 [104,105]
102 [100]
103
104 [101,105]
105 [101,104]
what is the best way to solve this ?