2

in a dataframe, there is a 'City' column containing different city names as well as various formats/misspellings. Here are the unique values in the column:

array(['somerville', 'hyde park', 'lexington', 'brookline', 'wellesley',
       'dover ', 'newton', 'westford', 'boston', 'needham', 'arlington',
       'wayland', 'waltham', 'cambridge', 'chestnuthill', 'salisbury ',
       'waban', 'weston', 'neeham', 'auburndale', 'belmont', 'allston',
       'auberdale', 'bedford', 'dover', 'lawrence', 'wilmington',
       'woburn', 'braintree', 'acton', 'winchester', 'middleton',
       'west newton', 'watertown', 'newton center', 'northfield',
       'roslindale', 'westwood', 'groton', 'natick', 'concord',
       'chestnut hill', 'westborough', 'sudbury', 'sherborn', 'quincy',
       'burlington', 'andover', 'littleton', 'stoughton'], dtype=object)

I want to clean up only four cities names using mapping, and leave other city names unchanged.

I used this code below:

cities_names = (('Newton', ['west newton', 'newton center', 'chestnut hill', 'chestnuthill', 'waban', 'auberdale', 'auburndale']),
              ('Dover', ['dover ']), 
              ('Needham', ['neeham']), 
              ('Wellesley', ['wellesly']))

cities_map = {y:x[0] for x in cities_tup for y in x[1]}

df_MA.City = df_MA.City.map(cities_map)
df_MA.City.unique()

But the output is : array([nan, 'Dover', 'Newton', 'Needham'], dtype=object)

So basically, it changed all other city names into nan which is not I want. Are there any other methods or packages to clean up city names?

Orchid9
  • 37
  • 3

2 Answers2

2

Use replace:

df_MA['City'] = df_MA['City'].replace(cities_map)

Or combine map and fillna:

df_MA['City'] = df_MA['City'].map(cities_map).fillna(df_MA['City'])

NB. Don't assign to df_MA.City, always use the square brackets notation: df_MA['City'].

mozway
  • 194,879
  • 13
  • 39
  • 75
  • Thanks, this works well. Could you further explain the difference between using df_MA.City and df_MA['City']? I thought they were equivalent, both indicating selecting a column. – Orchid9 Mar 23 '23 at 15:27
  • In short, never use the `df.xxx` notation, it's full of pitfalls. You can't use special characters, you can't assign... – mozway Mar 23 '23 at 15:33
  • This is very helpful, I will keep that in mind, just use [ ] moving forward. Thank you! – Orchid9 Mar 23 '23 at 16:12
0

Your code is only mapping the four specified cities so the others became NaN

I modified your code to use a loop and statement to check if a city needs to be mapped:

cities_names = (('Newton', ['west newton', 'newton center', 'chestnut hill', 'chestnuthill', 'waban', 'auberdale', 'auburndale']),
              ('Dover', ['dover ']), 
              ('Needham', ['neeham']), 
              ('Wellesley', ['wellesly']))

for city_name, variants in cities_names:
    for variant in variants:
        df_MA.loc[df_MA.City.str.lower() == variant, 'City'] = city_name

df_MA.City.unique()
Saxtheowl
  • 4,136
  • 5
  • 23
  • 32