1

Right now I am working on open datasets, one task is to 'standardize' values. We only care about 'age', 'gender', 'race', 'ethnicity', and 'country' attributes.

For 'Country', we could have 'united-states', 'united states','usa', 'us' as valid expressions matched 'US'. There are many other countries with different expression values in different datasets.

The goal is to 'convert' all these values to, for example, 'US'. However, I don't know whether there exist other expressions of 'US' in other datasets. The storage of all datasets is 2.3 TB. I cannot iterate all datasets to get all different expressions for 'US', then iterate again to convert all values.

Is there any tool to detect and convert such things?

Right now things are all under Python.

Best

Max

Max Wang
  • 21
  • 5
  • Hey Max, welcome to SO, is better if you show us how does the data on `Gender` looks like... Please add the input and the expected output. All in all perhaps you can solve your problem with a combination of lower and replace, etc. – Dani Mesejo Dec 27 '20 at 18:03
  • @DaniMesejo I edited my expression, if it is not clear enough, I might make a graph to explain the aim. All I want to solve is replacement, but the question is that I do not have all unique expressions for each value, for example, I have no idea how 'US' will be expressed in different datasets, so I might miss some values with simple lower and replace. It is hard for me to detect whether this value as a valid expression of 'US'. – Max Wang Dec 27 '20 at 18:25
  • If you can only iterate once, then you need a function to tell if a name is equivalent to 'US'. If it is then you replace it on the fly. – Ismael EL ATIFI Dec 27 '20 at 20:12
  • @IsmaelELATIFI So seems like I have to iterate at least once. Thank you for the help. – Max Wang Dec 27 '20 at 21:02

1 Answers1

1

The only way to "standardize" values is to know what to match and replace, which involves "looping over" your data to find what values exist at all. Once you have this list, then you can design the appropriate matching strings to replace.

So here, we can "loop" through the data frame (i.e., using a function that does the looping and counting for us) and tally the values in our column. Once we have that tally sheet, then we can copy and paste which values appear to match our country name. You can do this part manually because it is easier than automating this at this point. An important part of this step is to lowercase the values (with .str.lower()) so that you can better compare and count your strings.

>>> df = pd.DataFrame({'Country': ['united-states', 'usa', 'USA', 'UNITED STATES',
                                   'united states', 'us', 'us', 'us']})
>>> df
         Country
0  united-states
1            usa
2            USA
3  UNITED STATES
4  united states
5             us
6             us
7             us
>>> df['Country'].value_counts()
us               3
united states    1
usa              1
united-states    1
UNITED STATES    1
USA              1
Name: Country, dtype: int64
>>> # Take a look at values above to generate replace list
>>> df['Country'].str.lower().replace(['united-states', 'usa', 'united states'], 'us')
0    us
1    us
2    us
3    us
4    us
5    us
6    us
7    us
Name: Country, dtype: object

Read more here https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html

The solution above works for just the Country column, but this can be down for any other column of your data. This is tedious, but this kind of data wrangling has to be done manually, otherwise, you run into the possibility of missing data you should have normalized. And if you fail to catch some of those values initially, you'll need to invest even more time to fix those missing ones.

This story changes if your data happens to change regularly. In that case, the above will be a good first approach. But then you should be able to use techniques like fuzzy matching with Python packages like fuzzywuzzy. This will allow you to figure out which strings are the more similar and then parse it out that way. However, this requires a lot more detailed work.

Eric Leung
  • 2,354
  • 12
  • 25