Using Python 3, I work with a data frame which requires text preprocessing.
The data frame consists of historical sales for many different medical products with many different strengths. For simplification, the code below only shows a part of the strength column.
df = pd.DataFrame({'Strength': ['20 mg / 120 mg', ' 40/320 mg', '20mg/120mg', '150+750mg', '20/120MG', '62.5mg/375mg', '100 mg', 'Product1 20 mg, Product2 120 mg', '40mg/320mg', 'Product 20mg/120mg', 'Product1 20mg Product2 120mg', '100mg/1ml', '20 mg./ 120 mg..', '62.5 mg / 375 mg', '40/320mg 9s', '40/320', '50/125', '100mg..' '20/120']})
Strength
0 20 mg / 120 mg
1 40/320 mg
2 20mg/120mg
3 150+750mg
4 20/120MG
5 62.5mg/375mg
6 100 mg
7 Product1 20 mg, Product2 120 mg
8 40mg/320mg
9 Product 20mg/120mg
10 Product1 20mg Product2 120mg
11 100mg/1ml
12 20 mg./ 120 mg..
13 62.5 mg / 375 mg
14 40/320mg 9s
15 40/320
16 50/125
17 100mg..20/120
As you can see, there are different spellings for products which actually belong to the same Strength. For example, '20 mg / 120 mg' and 'Artemether 20 mg, Lumefantrine 120 mg' actually have the same strength.
Setting the text to lowercase, removing whitespaces and replacing + by / shown by the following code brings some standardization, but there are still lines with clearly the same strength.
df['Strength'] = df['Strength'].str.lower()
df['Strength'] = df['Strength'].str.replace(' ', '')
df['Strength'] = df['Strength'].str.replace('+', '/')
Adding commands like the following allows to further reduce the number of different notations, but this is way too manual.
df['Strength'].loc[df['Strength'].str.contains('Product1', case=False)
& df['Strength'].str.contains('Product2', case=False)] = '20mg/120mg'
Do you have any approaches for removing the number of unique notations in an efficient way?