1

I am trying to match the left and tight addresses (from separate tables) on a reference index (coClean) which I created with the following formula in #Python #JupyterNotebook

import pandas as pd
df1=pd.read_csv("/content/Addmatchdf1.csv")
df2=pd.read_csv("/content/Addmatchdf2.csv")

import re
def cleanAddress(series):
return series.str.lower().str.replace(r"[a-z\s\,]","")
df1["coClean"]=cleanAddress(df1["Address"])

df1["coClean"]=cleanAddress(df1["Address"]) 
df = pd.merge(df1, df2, 
                      on =['coClean'], 
                      how ='inner') 

This produces a coClean as a reference index.

Address_x coClean Address_y
7 Pindara Bvd LANGWARRIN VIC 3910 73910 7 Pindara Blv, Langwarrin, VIC 3910
2a Manor St BACCHUS MARSH VIC 3340 23340 2a Manor Street, BACCHUS MARSH, VIC 3340
38 Sommersby Rd POINT COOK VIC 3030 383030 38 Sommersby Road, Point Cook, VIC 3030
17 Moira Avenue, Carnegie, Vic 3163 173163 17 Moira Avenue, Carnegie, Vic 3163
17 Moira Avenue, Carnegie, Vic 3163 173163 17 Newman Avenue, Carnegie, VIC 3163
17 Moira Avenue, Carnegie, Vic 3163 173163 17 Maroona Rd, Carnegie VIC 3163

The problem I am facing is, apparently, some addresses under the same postcode have the same house number. But than the joining becomes difficult as the reference index is same.

How can i revise this function so that the reference index contains only

a. the house numbers
b. first four letters
c. postcode

So a new reference for '23340'(2a manor street bacchus marsh vic 3340) becomes '2aman3340'? So a list is returned like:

coClean
7pind3910
2aman3340
38somm3030
17moir3163
17newm3163
17maroo3163

I attempted to revise the function to include all letters and numbers

def cleanAddress(series):
return series.str.lower().str.replace(r"[^a-z\d]","")

But including all letters will not solve the problem as different tables contain street as st. and road as rd. Hence, a better strategy is to rely on house numbers and postcodes with some initial letters.

Thank you for your kind suggestions.

Update: I replaced

def cleanAddress(series):
return series.str.lower().str.replace(r"[a-z\s\,]","")
df1["coClean"]=cleanAddress(df1["Address"])

with

def cleanAddress(series):
    coclen=""
    number_of_letters=0
    if series:
        for i in range(len(series)):
            if series[i].isnumeric():
                coclen+=series[i]
            elif series[i].isalpha():
                number_of_letters+=1
                coclen+=series[i]
                if number_of_letters==4:
                    break
        for i in range(i,len(series)):
            if series[i].isnumeric():
                coclen+=series[i]
    return coclen

This returns an error while I execute

cleanAddress(df1["Address"])

The full error is as follows:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-17-b653a19f5638> in <module>()
----> 1 df1["coClean"]=cleanAddress(df1["Address"])

1 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py in __nonzero__(self)
   1328     def __nonzero__(self):
   1329         raise ValueError(
-> 1330             f"The truth value of a {type(self).__name__} is ambiguous. "
   1331             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
   1332         )

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Dr. Islam
  • 47
  • 7
  • 1
    Welcome to Stack Overflow. Please read [ask]; you are expected to make some attempt to solve the problem yourself. For example, can you write code the figures out what the first four letters are in the string? (Hint: can you write code that figures out what all the letters are? Can you write code that figures out the first four elements of that?) – Karl Knechtel Jun 11 '21 at 22:51
  • Thank you @KarlKnechtel. To demonstrate I have attempted "to write code that figures out what all the letters are? " I have revised the post including the new code. – Dr. Islam Jun 11 '21 at 22:56

1 Answers1

1
import pandas as pd
df1 = pd.DataFrame({"Address_x":["7 Pindara Bvd LANGWARRIN VIC 3910","2a Manor St BACCHUS MARSH VIC 3340","38 Sommersby Rd POINT COOK VIC 3030","17 Moira Avenue, Carnegie, Vic 3163"],"Address_y":["7 Pindara Blv, Langwarrin, VIC 3910","2a Manor Street, BACCHUS MARSH, VIC 3340","38 Sommersby Road, Point Cook, VIC 3030","17 Moira Avenue, Carnegie, Vic 3163"]})
def cleanAddress(series):
    cocleans=[]
    for address in series:
        number_of_letters=0
        coclean=""
        for i in range(len(address)):
            if address[i].isnumeric():
                coclean+=address[i]
            elif address[i].isalpha():
                number_of_letters+=1
                coclean+=address[i]
                if number_of_letters==4:
                    break
        for i in range(i,len(address)):
            if address[i].isnumeric():
                coclean+=address[i]
        cocleans.append(coclean.lower())
    return cocleans
df1["coClean"]=cleanAddress(df1["Address_x"])
user2952903
  • 365
  • 2
  • 10
  • 1
    Thank you so much for this update. This is very very kind. I updated the function as you suggested and tried running the function as cleanAddress(df1["Address"]) I get the following error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). – Dr. Islam Jun 12 '21 at 02:48
  • 1
    This function returns string that contains a.house numbers, b. first four letters and c. postcode, as I understood you want. Maybe I could explain this error if you posted more code or full errormessage as edit to you post. – user2952903 Jun 12 '21 at 03:06
  • 1
    thank you so much for your kindness. I am grateful. I have updated the post above as per your advice my friend. I truly appreciate your help as a new user. – Dr. Islam Jun 12 '21 at 03:36
  • 1
    If you want to add new column to Pandas dataframe, you have to assign it a list, not just 1 value. Add all coClean addresse, you got from my function to a list and then assign this list to the new column. Use Method I.1 from this page: https://re-thought.com/how-to-add-new-columns-in-a-dataframe-in-pandas/. I will make a edit to my post that shows how to do it. – user2952903 Jun 12 '21 at 04:03
  • 1
    Thank you my friend. Very useful. However even if I just run the cleanAddress function, i still get that error. I have edited my post not including the list in the dataframe. cleanAddress is expected to return a list. I have edited the list that is expected to return for each address. Thank you for very useful suggestions. The dataframe link is really useful. Thank you. – Dr. Islam Jun 12 '21 at 06:01
  • 1
    @Dr.Islam, did you try my code after last edit? – user2952903 Jun 12 '21 at 06:34
  • 1
    I tried your code. The result is as follows: array(['2aMan3340', '6AdleyPlHAMPTONPARKVIC39766', '1544BanksdaleRdHANSONVILLEVIC36755', '31WestminsterDrWERRIBEEVIC30300', '39KirrakStNORTHWONTHAGGIVIC39955', '121ScottsRdCAPECLEARVIC33511', '21CanberraRdTOORAKVIC31422', '7PindaraBvdLANGWARRINVIC39100', '38SommersbyRdPOINTCOOKVIC30300', '17MoiraAvenueCarnegieVic31633', '17NewmanAvenueCarnegieVIC31633', '17TranmereAveneueCarnegieVIC31633', '17MaroonaRdCarnegieVIC31633'], dtype=object). I require the format in '17maroo3163' my friend. – Dr. Islam Jun 12 '21 at 06:57
  • 1
    @Dr.Islam, sorry. I edited it now. I am quite sure it should finally work as expected now. – user2952903 Jun 12 '21 at 11:17
  • 1
    Thank you So much. This did it perfectly. – Dr. Islam Jun 12 '21 at 13:24