2

I have a dataframe with a column of postal addresses (generated with geopy.geocoders GoogleV3 - I used it to parse my dataframe). The output of geolocator.geocode, however, has the country name - which I don't want. It also contains Unit number - which I don't want.

How can I do it?

I have tried:

test_add['clean address'] = test_add.apply(lambda x: x['clean address'][:-5], axis = 1)

and

def remove_units(X):
    X = X.split()
    X_new = [x for x in X if not x.startswith("#")]
    return ' '.join(X_new)

test_add['parsed addresses'] = test_add['clean address'].apply(remove_units)

It works for:

data = ["941 Thorpe St, Rock Springs, WY 82901, USA",
    "2809 Harris Dr, Antioch, CA 94509, USA",
    "7 Eucalyptus, Newport Coast, CA 92657, USA",
    "725 Mountain View St, Altadena, CA 91001, USA",
    "1966 Clinton Ave #234, Calexico, CA 92231, USA",
    "431 6th St, West Sacramento, CA 95605, USA",
    "5574 Old Goodrich Rd, Clarence, NY 14031, USA",
    "Valencia Way #1234, Valley Center, CA 92082, USA"]
test_df = pd.DataFrame(data, columns=['parsed addresses'])

but get an error: "AttributeError: 'float' object has no attribute 'split'" when I use a larger dataframe with 150k such addresses.

Ultimately, I require only street number, street name, city, state and zipcode.

Saania
  • 623
  • 5
  • 6

2 Answers2

1

Assuming that the unit number always starts with a #, and that the country name is always the last comma separated entry in the address, then this should work:

def parse_address(address: str) -> str:
    # Remove the final comma separated entry assumed to be the country
    address_without_country = " ".join([x for x in address.split(",")[:-1]])
    
    return " ".join([x for x in address_without_country.split() 
                     if not x.startswith("#")])

def main():
    ...
    parsed_addresses = []
    for address in raw_addresses:
        # Either cast to string or try catch the case where not a string
        parsed_addresses.append(str(address))

Mouse
  • 395
  • 1
  • 7
1

Another possible solution:

test_df['parsed addresses'].str.replace(r',\D+$|\s#\d+', '', regex=True)

EXPLANATION

  • \D means non-digit character.
  • \D+ means one or more non-digit character
  • $ means end of string
  • | means logical OR
  • \s means space character
  • \d+ means one or more digit character

For a more comprehensive treatment of regex, please see Regular Expression HOWTO.

Output:

0       941 Thorpe St, Rock Springs, WY 82901
1           2809 Harris Dr, Antioch, CA 94509
2       7 Eucalyptus, Newport Coast, CA 92657
3    725 Mountain View St, Altadena, CA 91001
4        1966 Clinton Ave, Calexico, CA 92231
5       431 6th St, West Sacramento, CA 95605
6    5574 Old Goodrich Rd, Clarence, NY 14031
7       Valencia Way, Valley Center, CA 92082
Name: parsed addresses, dtype: object
PaulS
  • 21,159
  • 2
  • 9
  • 26