The only way to "standardize" values is to know what to match and replace, which involves "looping over" your data to find what values exist at all. Once you have this list, then you can design the appropriate matching strings to replace.
So here, we can "loop" through the data frame (i.e., using a function that does the looping and counting for us) and tally the values in our column. Once we have that tally sheet, then we can copy and paste which values appear to match our country name. You can do this part manually because it is easier than automating this at this point. An important part of this step is to lowercase the values (with .str.lower()
) so that you can better compare and count your strings.
>>> df = pd.DataFrame({'Country': ['united-states', 'usa', 'USA', 'UNITED STATES',
'united states', 'us', 'us', 'us']})
>>> df
Country
0 united-states
1 usa
2 USA
3 UNITED STATES
4 united states
5 us
6 us
7 us
>>> df['Country'].value_counts()
us 3
united states 1
usa 1
united-states 1
UNITED STATES 1
USA 1
Name: Country, dtype: int64
>>> # Take a look at values above to generate replace list
>>> df['Country'].str.lower().replace(['united-states', 'usa', 'united states'], 'us')
0 us
1 us
2 us
3 us
4 us
5 us
6 us
7 us
Name: Country, dtype: object
Read more here https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html
The solution above works for just the Country
column, but this can be down for any other column of your data. This is tedious, but this kind of data wrangling has to be done manually, otherwise, you run into the possibility of missing data you should have normalized. And if you fail to catch some of those values initially, you'll need to invest even more time to fix those missing ones.
This story changes if your data happens to change regularly. In that case, the above will be a good first approach. But then you should be able to use techniques like fuzzy matching with Python packages like fuzzywuzzy
. This will allow you to figure out which strings are the more similar and then parse it out that way. However, this requires a lot more detailed work.