I am trying to match the left and tight addresses (from separate tables) on a reference index (coClean) which I created with the following formula in #Python #JupyterNotebook
import pandas as pd
df1=pd.read_csv("/content/Addmatchdf1.csv")
df2=pd.read_csv("/content/Addmatchdf2.csv")
import re
def cleanAddress(series):
return series.str.lower().str.replace(r"[a-z\s\,]","")
df1["coClean"]=cleanAddress(df1["Address"])
df1["coClean"]=cleanAddress(df1["Address"])
df = pd.merge(df1, df2,
on =['coClean'],
how ='inner')
This produces a coClean as a reference index.
Address_x | coClean | Address_y |
---|---|---|
7 Pindara Bvd LANGWARRIN VIC 3910 | 73910 | 7 Pindara Blv, Langwarrin, VIC 3910 |
2a Manor St BACCHUS MARSH VIC 3340 | 23340 | 2a Manor Street, BACCHUS MARSH, VIC 3340 |
38 Sommersby Rd POINT COOK VIC 3030 | 383030 | 38 Sommersby Road, Point Cook, VIC 3030 |
17 Moira Avenue, Carnegie, Vic 3163 | 173163 | 17 Moira Avenue, Carnegie, Vic 3163 |
17 Moira Avenue, Carnegie, Vic 3163 | 173163 | 17 Newman Avenue, Carnegie, VIC 3163 |
17 Moira Avenue, Carnegie, Vic 3163 | 173163 | 17 Maroona Rd, Carnegie VIC 3163 |
The problem I am facing is, apparently, some addresses under the same postcode have the same house number. But than the joining becomes difficult as the reference index is same.
How can i revise this function so that the reference index contains only
a. the house numbers
b. first four letters
c. postcode
So a new reference for '23340'(2a manor street bacchus marsh vic 3340) becomes '2aman3340'? So a list is returned like:
coClean |
---|
7pind3910 |
2aman3340 |
38somm3030 |
17moir3163 |
17newm3163 |
17maroo3163 |
I attempted to revise the function to include all letters and numbers
def cleanAddress(series):
return series.str.lower().str.replace(r"[^a-z\d]","")
But including all letters will not solve the problem as different tables contain street as st. and road as rd. Hence, a better strategy is to rely on house numbers and postcodes with some initial letters.
Thank you for your kind suggestions.
Update: I replaced
def cleanAddress(series):
return series.str.lower().str.replace(r"[a-z\s\,]","")
df1["coClean"]=cleanAddress(df1["Address"])
with
def cleanAddress(series):
coclen=""
number_of_letters=0
if series:
for i in range(len(series)):
if series[i].isnumeric():
coclen+=series[i]
elif series[i].isalpha():
number_of_letters+=1
coclen+=series[i]
if number_of_letters==4:
break
for i in range(i,len(series)):
if series[i].isnumeric():
coclen+=series[i]
return coclen
This returns an error while I execute
cleanAddress(df1["Address"])
The full error is as follows:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-17-b653a19f5638> in <module>()
----> 1 df1["coClean"]=cleanAddress(df1["Address"])
1 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py in __nonzero__(self)
1328 def __nonzero__(self):
1329 raise ValueError(
-> 1330 f"The truth value of a {type(self).__name__} is ambiguous. "
1331 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
1332 )
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().