Preprocessing text data with different notation in Python

Question

Using Python 3, I work with a data frame which requires text preprocessing.

The data frame consists of historical sales for many different medical products with many different strengths. For simplification, the code below only shows a part of the strength column.

df = pd.DataFrame({'Strength': ['20 mg / 120 mg', ' 40/320 mg', '20mg/120mg', '150+750mg', '20/120MG', '62.5mg/375mg', '100 mg', 'Product1 20 mg, Product2 120 mg', '40mg/320mg', 'Product 20mg/120mg', 'Product1 20mg Product2 120mg', '100mg/1ml', '20 mg./ 120 mg..', '62.5 mg / 375 mg', '40/320mg 9s', '40/320', '50/125', '100mg..' '20/120']})

                                 Strength
0                          20 mg / 120 mg
1                               40/320 mg
2                              20mg/120mg
3                               150+750mg
4                                20/120MG
5                            62.5mg/375mg
6                                  100 mg
7         Product1 20 mg, Product2 120 mg
8                              40mg/320mg
9                      Product 20mg/120mg
10           Product1 20mg Product2 120mg
11                              100mg/1ml
12                       20 mg./ 120 mg..
13                       62.5 mg / 375 mg
14                            40/320mg 9s
15                                 40/320
16                                 50/125
17                          100mg..20/120

As you can see, there are different spellings for products which actually belong to the same Strength. For example, '20 mg / 120 mg' and 'Artemether 20 mg, Lumefantrine 120 mg' actually have the same strength.

Setting the text to lowercase, removing whitespaces and replacing + by / shown by the following code brings some standardization, but there are still lines with clearly the same strength.

df['Strength'] = df['Strength'].str.lower()
df['Strength'] = df['Strength'].str.replace(' ', '')
df['Strength'] = df['Strength'].str.replace('+', '/')

Adding commands like the following allows to further reduce the number of different notations, but this is way too manual.

df['Strength'].loc[df['Strength'].str.contains('Product1', case=False)
                   & df['Strength'].str.contains('Product2', case=False)] = '20mg/120mg'

Do you have any approaches for removing the number of unique notations in an efficient way?

score 0 · Answer 1 · answered Jan 02 '20 at 10:00

0

Add a new column with fixed labels for each strength and train it based on a suitable ml classifier and predict the appropriate strength for the new item.

For each new notation, manually assign a new label and retrain again...

answered Jan 02 '20 at 10:00

Laxmikant

2,046
3
30
44

So I would add a column with for example "20mg/120mg" for all the training examples with that strength? Do you know a suitable ML classifier for this? – frederikwillersinn Jan 03 '20 at 07:40
@Fred - yes for first question. Choosing an ML Classifier is totally depends on the data and its variation. You try with multiple classifiers and select the one which gives more accurate results for you. (Replying from mobile so could not provide links which could help u) – Laxmikant Jan 03 '20 at 09:17

Preprocessing text data with different notation in Python

1 Answers1