Clean up city names in python

Question

in a dataframe, there is a 'City' column containing different city names as well as various formats/misspellings. Here are the unique values in the column:

array(['somerville', 'hyde park', 'lexington', 'brookline', 'wellesley',
       'dover ', 'newton', 'westford', 'boston', 'needham', 'arlington',
       'wayland', 'waltham', 'cambridge', 'chestnuthill', 'salisbury ',
       'waban', 'weston', 'neeham', 'auburndale', 'belmont', 'allston',
       'auberdale', 'bedford', 'dover', 'lawrence', 'wilmington',
       'woburn', 'braintree', 'acton', 'winchester', 'middleton',
       'west newton', 'watertown', 'newton center', 'northfield',
       'roslindale', 'westwood', 'groton', 'natick', 'concord',
       'chestnut hill', 'westborough', 'sudbury', 'sherborn', 'quincy',
       'burlington', 'andover', 'littleton', 'stoughton'], dtype=object)

I want to clean up only four cities names using mapping, and leave other city names unchanged.

I used this code below:

cities_names = (('Newton', ['west newton', 'newton center', 'chestnut hill', 'chestnuthill', 'waban', 'auberdale', 'auburndale']),
              ('Dover', ['dover ']), 
              ('Needham', ['neeham']), 
              ('Wellesley', ['wellesly']))

cities_map = {y:x[0] for x in cities_tup for y in x[1]}

df_MA.City = df_MA.City.map(cities_map)
df_MA.City.unique()

But the output is : array([nan, 'Dover', 'Newton', 'Needham'], dtype=object)

So basically, it changed all other city names into nan which is not I want. Are there any other methods or packages to clean up city names?

score 2 · Accepted Answer · answered Mar 22 '23 at 03:55

2

Use replace:

df_MA['City'] = df_MA['City'].replace(cities_map)

Or combine map and fillna:

df_MA['City'] = df_MA['City'].map(cities_map).fillna(df_MA['City'])

NB. Don't assign to df_MA.City, always use the square brackets notation: df_MA['City'].

answered Mar 22 '23 at 03:55

mozway

194,879
13
39
75

Thanks, this works well. Could you further explain the difference between using df_MA.City and df_MA['City']? I thought they were equivalent, both indicating selecting a column. – Orchid9 Mar 23 '23 at 15:27
In short, never use the `df.xxx` notation, it's full of pitfalls. You can't use special characters, you can't assign... – mozway Mar 23 '23 at 15:33
This is very helpful, I will keep that in mind, just use [ ] moving forward. Thank you! – Orchid9 Mar 23 '23 at 16:12

score 0 · Answer 2 · answered Mar 22 '23 at 02:19

Your code is only mapping the four specified cities so the others became NaN

I modified your code to use a loop and statement to check if a city needs to be mapped:

cities_names = (('Newton', ['west newton', 'newton center', 'chestnut hill', 'chestnuthill', 'waban', 'auberdale', 'auburndale']),
              ('Dover', ['dover ']), 
              ('Needham', ['neeham']), 
              ('Wellesley', ['wellesly']))

for city_name, variants in cities_names:
    for variant in variants:
        df_MA.loc[df_MA.City.str.lower() == variant, 'City'] = city_name

df_MA.City.unique()

Clean up city names in python

2 Answers2