Parse DataFrame of postal addresses - remove country and unit number

Question

I have a dataframe with a column of postal addresses (generated with geopy.geocoders GoogleV3 - I used it to parse my dataframe). The output of geolocator.geocode, however, has the country name - which I don't want. It also contains Unit number - which I don't want.

How can I do it?

I have tried:

test_add['clean address'] = test_add.apply(lambda x: x['clean address'][:-5], axis = 1)

and

def remove_units(X):
    X = X.split()
    X_new = [x for x in X if not x.startswith("#")]
    return ' '.join(X_new)

test_add['parsed addresses'] = test_add['clean address'].apply(remove_units)

It works for:

data = ["941 Thorpe St, Rock Springs, WY 82901, USA",
    "2809 Harris Dr, Antioch, CA 94509, USA",
    "7 Eucalyptus, Newport Coast, CA 92657, USA",
    "725 Mountain View St, Altadena, CA 91001, USA",
    "1966 Clinton Ave #234, Calexico, CA 92231, USA",
    "431 6th St, West Sacramento, CA 95605, USA",
    "5574 Old Goodrich Rd, Clarence, NY 14031, USA",
    "Valencia Way #1234, Valley Center, CA 92082, USA"]
test_df = pd.DataFrame(data, columns=['parsed addresses'])

but get an error: "AttributeError: 'float' object has no attribute 'split'" when I use a larger dataframe with 150k such addresses.

Ultimately, I require only street number, street name, city, state and zipcode.

score 1 · Answer 1 · answered Aug 15 '23 at 17:41

Assuming that the unit number always starts with a #, and that the country name is always the last comma separated entry in the address, then this should work:

def parse_address(address: str) -> str:
    # Remove the final comma separated entry assumed to be the country
    address_without_country = " ".join([x for x in address.split(",")[:-1]])
    
    return " ".join([x for x in address_without_country.split() 
                     if not x.startswith("#")])

def main():
    ...
    parsed_addresses = []
    for address in raw_addresses:
        # Either cast to string or try catch the case where not a string
        parsed_addresses.append(str(address))

PaulS · Accepted Answer · 2023-08-15T17:52:18.447

1

Another possible solution:

test_df['parsed addresses'].str.replace(r',\D+$|\s#\d+', '', regex=True)

EXPLANATION

\D means non-digit character.
\D+ means one or more non-digit character
$ means end of string
| means logical OR
\s means space character
\d+ means one or more digit character

For a more comprehensive treatment of regex, please see Regular Expression HOWTO.

Output:

0       941 Thorpe St, Rock Springs, WY 82901
1           2809 Harris Dr, Antioch, CA 94509
2       7 Eucalyptus, Newport Coast, CA 92657
3    725 Mountain View St, Altadena, CA 91001
4        1966 Clinton Ave, Calexico, CA 92231
5       431 6th St, West Sacramento, CA 95605
6    5574 Old Goodrich Rd, Clarence, NY 14031
7       Valencia Way, Valley Center, CA 92082
Name: parsed addresses, dtype: object

edited Aug 15 '23 at 17:52

answered Aug 15 '23 at 17:41

PaulS

21,159
2
9
26

Works perfectly. Could you please tell me in short what this function does? The part `r',\D+$|\s#\d+'`? – Saania Aug 15 '23 at 17:46
1

Yes, @Saania: I am going to add an explanation to my answer. – PaulS Aug 15 '23 at 17:48
1

That would be useful for others too like me, thanks! – Saania Aug 15 '23 at 17:50
Done, @Saania. If something else needs explanation, please let me know it. – PaulS Aug 15 '23 at 17:53
1

Yes, @Saania: Just add the following to the end of my line of code: `.to_frame()`. Or `test_df['parsed addresses'] = test_df['parsed addresses'].str.replace(r',\D+$|\s#\d+', '', regex=True)`. – PaulS Aug 15 '23 at 18:07
1

Yes, I just figured it out, thank you very much. – Saania Aug 15 '23 at 18:08

Parse DataFrame of postal addresses - remove country and unit number

2 Answers2