python dataframe income column cleanup

Question

This maybe a simple solution, but I am finding it hard to make this function work for my dataset.

I have a salary column with variety of data in it. Example dataframe below:

ID   Income                              desired Output         
1    26000                               26000
2    45K                                 45000
3    -                                   NaN
4    0                                   NaN
5    N/A                                 NaN
6    2000                                2000   
7    30000 - 45000                       37500 (30000+45000/2)   
8    21000 per Annum                     21000                
9    50000 per annum                     50000
10   21000 to 30000                      25500 (21000+30000/2)
11                                       NaN
12   21000 To 50000                      35500 (21000+50000/2)
13   43000/year                          43000
14                                       NaN
15   80000/Year                          80000
16   12.40 p/h                           12896 (12.40 x 20 x 52)
17   12.40 per hour                      12896 (12.40 x 20 x 52)
18   45000.0 (this is a float value)     45000

@user34974 - has been very helpful in providing the workable solution (below). However, the solution provides me with an error because the dataframe column also consists of float values. Can anyone help in catering for float values in the function that can be taken care of in dataframe column? In the end the output in updated column should be float values.

Normrep = ['N/A','per Annum','per annum','/year','/Year','p/h','per hour',35000.0]

def clean_income(value):
    for i in Normrep:
        value = value.replace(i,"")



    if len(value) == 0 or value.isspace() or value == '-': #- cannot be clubbed to array as used else where in data
        return np.nan
    elif value == '0':
        return np.nan

    # now there should not be any extra letters with K hence can be done below step
    if value.endswith('K'):
        value = value.replace('K','000')
    
    # for to and -
    vals = value.split(' to ')
    if len(vals) != 2:
        vals = value.split(' To ')
        if len(vals) != 2:
            vals = value.split(' - ')

    if len(vals) == 2:
        return (float(vals[0]) + float(vals[1]))/2

    try:
        a = float(value)
        return a
    except:
        return np.nan    # Either not proper data or need to still handle some fromat of inputs.


testData = ['26000','45K','-','0','N/A','2000','30000 - 45000','21000 per Annum','','21000 to 30000','21000 To 50000','43000/year', 35000.0]


df = pd.DataFrame(testData)
print(df)

df[0] = df[0].apply(lambda x: clean_income(x))

print(df)

can you please post your desired output? I think in some cases it will result in NaN. For example, are you expecting `30000 - 45000` to result in average of the two? Similarly for values with `21000 to 30000` are you expecting average of the two? — Joe Ferndz, Nov 23 '20 at 03:13
Good. Can you please post the desired results in your question section so we can review what you are looking for. The solution may be much simpler than the big code you have. — Joe Ferndz, Nov 23 '20 at 03:59
Essentially, I am aiming to convert: 1) 40K - 40000 2) 0 - NaN 3) Blank - NaN 4) 21000 to 22000 - 22000 + 21000/2 5) 80000/Year - 80000 6) N/A - NaN 7) 10000 per annum - 10000 8) N/A - NaN 9) 90000/year - 90000 10) 10000 per Annum - 10000 11) 30000 - 45000 - 30000+45000/2 12) - - NaN 13) 45000.0 (float) - 45000.0 14) 12.40 p/h - 12.40 * 20 *52 - 12896 15) 12.40 per hour - 12.40 * 20 *52 - 12896 In the end, the whole column should be converted to float values. — DarkKnight, Nov 23 '20 at 04:07
Can you post this in the main Question section please? The text in the comments section does not have proper indentation. I need to see the data in the normal dataframe format. — Joe Ferndz, Nov 23 '20 at 04:09
are all the original values in 'Income' column strings or does it have integers and floats? For example, is the data in row 1 stored as `'26000'` or `26000` — Joe Ferndz, Nov 23 '20 at 05:08
I have posted my answer. Please review and let me know if you have questions. — Joe Ferndz, Nov 23 '20 at 07:30

score 1 · Answer 1 · answered Nov 23 '20 at 06:11

I would like to reiterate if this are only possible combinations of the data, then i have done and provided the below code.

Even if there is any small change you will need to edit to cater to new change. Let me explain what i have done, for all the strings that you want to replace with "" i have created a array Normrep. So, if you some more strings to be removed you can add in the elements. Also, for 'K','p/h','per hour' they need to specifically handled and the conversion needs to be done. So, if the string in your data may change then you need to handle that over here.

import pandas as pd
import numpy as np

Normrep = ['N/A', 'per Annum', 'per annum', '/year', '/Year']


def clean_income(value):
    if isinstance(value,float):
        return value
    else:
        isHourConversionNeeded = False;
        
        for i in Normrep:
            value = value.replace(i, "")

        if len(value) == 0 or value.isspace() or value == '-':  # - cannot be clubbed to array as used else where in data
            return np.nan
        elif value == '0':
            return np.nan

        # now there should not be any extra letters with K hence can be done below step
        if value.endswith('K'):
            value = value.replace('K', '000')
        elif value.endswith('p/h') or value.endswith('per hour'):
            isHourConversionNeeded = True
            value = value.replace('p/h',"")
            value = value.replace('per hour',"")

        # for to and -
        vals = value.split(' to ')
        if len(vals) != 2:
            vals = value.split(' To ')
            if len(vals) != 2:
                vals = value.split(' - ')

        if len(vals) == 2:
            return (float(vals[0]) + float(vals[1])) / 2

        try:
            a = float(value)
            if isHourConversionNeeded:
                a = a * 20 * 52
            return a
        except:
            return np.nan  # Either not proper data or need to still handle some fromat of inputs.


testData = ['26000', '45K', '-', '0', 'N/A', '2000', '30000 - 45000', '21000 per Annum', '', '21000 to 30000',
            '21000 To 50000', '43000/year', 35000.0,'12.40 p/h','12.40 per hour']
df = pd.DataFrame(testData)
print(df)

df[0] = df[0].apply(lambda x: clean_income(x))

print(df)

if you convert `value` to lowercase, you can eliminate a few options. Also, you may need to convert floats to int as OP wants only integers. — Joe Ferndz, Nov 23 '20 at 07:30
I initially used this solution. Worked fine as well with a few more extra steps as I had to apply it for a dataframe which included more columns than just the income column. Both this and the above solutions were very helpful! Thanks. — DarkKnight, Nov 23 '20 at 08:21
no need for looping here, this would blow up pretty quick on a medium size df. try to use the `.str` methods that pandas provides. — Umar.H, Nov 23 '20 at 09:52

Joe Ferndz · Accepted Answer · 2020-11-23T07:26:31.993

Here's how I will do it without all the looping.

c = ['ID','Income']
d = [
[1, 26000],  
[2, '45K'],
[3, '-'],
[4, 0],  
[5, 'N/A'],     
[6, 2000],         
[7, '30000 - 45000'],
[8, '21000 per Annum'],
[9, '50000 per annum'],
[10, '21000 to 30000'],
[11, ''],
[12, '21000 To 50000'],
[13, '43000/year'],
[14, ''],
[15, '80000/Year'],
[16, '12.40 p/h'],
[17, '12.40 per hour'],
[18, 45000.00]]

import pandas as pd
import numpy as np
df = pd.DataFrame(d,columns=c)

df['Income1'] = df['Income'].astype(str).str.lower()

df['Income1'].replace({'n/a' : '0', '':'0', '-':'0', 0:'0'}, regex=False, inplace=True)

df['Income1'].replace({'k$': '000','to': '+', '-': '+', ' per annum': '', 'p/h' : 'per hour', '/year': ''}, regex=True, inplace=True)

df['Income1'].replace(' per hour', ' * 12 * 52', regex=True, inplace=True)

df.loc[df.astype(str).Income1.str.contains('\+'),'Income1'] = '(' + df['Income1'].astype(str) + ') / 2'

df['Income1'] = df['Income1'].apply(lambda x: eval(x) if (pd.notnull(x)) else x)

df['Income1'] = (df['Income1'].fillna(0)
                 .astype(int)
                 .astype(object)
                 .where(df['Income1'].notnull()))

print (df)

The output of this will be:

    ID           Income Income1
0    1            26000   26000
1    2              45K   45000
2    3                -     NaN
3    4                0     NaN
4    5              N/A     NaN
5    6             2000    2000
6    7    30000 - 45000   37500
7    8  21000 per Annum   21000
8    9  50000 per annum   50000
9   10   21000 to 30000   25500
10  11                      NaN
11  12   21000 To 50000   35500
12  13       43000/year   43000
13  14                      NaN
14  15       80000/Year   80000
15  16        12.40 p/h    7737
16  17   12.40 per hour    7737
17  18            45000   45000

python dataframe income column cleanup

2 Answers2