0

I'm doing data cleaning and found there are different formats in the year column: e.g. 2011, 2012-2013, 2010-14. How to correct these errors and show only the latest year in cell, i.e. 2011, 2013, 2014.

I tried the below codes. It works for '2012-2013', the dataset is updated to 2013, but for '2010-14', the output is '0-14' instead of '2014'. How to fix it? Thanks.

def clean_year(year):
    if len(year) == 4:
        return year
    elif '-' in year:
        start, end = year.split('-')
        if len(end) == 2:
            return ('20'+end)
        else:
            return end.strip()

dataset1['Year'] = dataset1['Year'].apply(clean_year)
Pawel Kam
  • 1,684
  • 3
  • 14
  • 30
libraG
  • 1
  • 2

1 Answers1

0

For me your solution working, here is alternative solution:

dataset1 = pd.DataFrame({'Year': ['2011', '2012-2013', '2010-14']})

#split values to 2 columns DataFrame
df = dataset1['Year'].str.split('-', expand=True).astype(float)
print (df)
        0       1
0  2011.0     NaN
1  2012.0  2013.0
2  2010.0    14.0

#if less values like 30 add 2000 and then maximal value
dataset1['Year'] = df.mask(df.lt(30), df.add(2000)).max(axis=1).astype(int)
print (dataset1)
   Year
0  2011
1  2013
2  2014
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252