This maybe a simple solution, but I am finding it hard to make this function work for my dataset.
I have a salary column with variety of data in it. Example dataframe below:
ID Income desired Output
1 26000 26000
2 45K 45000
3 - NaN
4 0 NaN
5 N/A NaN
6 2000 2000
7 30000 - 45000 37500 (30000+45000/2)
8 21000 per Annum 21000
9 50000 per annum 50000
10 21000 to 30000 25500 (21000+30000/2)
11 NaN
12 21000 To 50000 35500 (21000+50000/2)
13 43000/year 43000
14 NaN
15 80000/Year 80000
16 12.40 p/h 12896 (12.40 x 20 x 52)
17 12.40 per hour 12896 (12.40 x 20 x 52)
18 45000.0 (this is a float value) 45000
@user34974 - has been very helpful in providing the workable solution (below). However, the solution provides me with an error because the dataframe column also consists of float values. Can anyone help in catering for float values in the function that can be taken care of in dataframe column? In the end the output in updated column should be float values.
Normrep = ['N/A','per Annum','per annum','/year','/Year','p/h','per hour',35000.0]
def clean_income(value):
for i in Normrep:
value = value.replace(i,"")
if len(value) == 0 or value.isspace() or value == '-': #- cannot be clubbed to array as used else where in data
return np.nan
elif value == '0':
return np.nan
# now there should not be any extra letters with K hence can be done below step
if value.endswith('K'):
value = value.replace('K','000')
# for to and -
vals = value.split(' to ')
if len(vals) != 2:
vals = value.split(' To ')
if len(vals) != 2:
vals = value.split(' - ')
if len(vals) == 2:
return (float(vals[0]) + float(vals[1]))/2
try:
a = float(value)
return a
except:
return np.nan # Either not proper data or need to still handle some fromat of inputs.
testData = ['26000','45K','-','0','N/A','2000','30000 - 45000','21000 per Annum','','21000 to 30000','21000 To 50000','43000/year', 35000.0]
df = pd.DataFrame(testData)
print(df)
df[0] = df[0].apply(lambda x: clean_income(x))
print(df)