3

Good morning! I am trying to convert a column that has multiple dates in various formats into a datetime column.

import pandas as pd

data = {
'c1':['2020/10/01','10/01/2020','10/1/2020','31/08/2020','12-21-2020','5-3-2020','05-03-2020','ERRER']    
}

df = pd.DataFrame (data, columns = ['c1'])

The code above creates my dataframe to test on. If I run the following code, I get an error because 'ERRER' is not a valid date:

df['c2'] = df.apply(lambda x: pd.to_datetime(x['c1']), axis=1)

Is there a way to skip a row in the apply function if it can't be converted to a datetime? Or convert error rows to a default date (i.e. '1900-01-01')?

NLMDEJ
  • 385
  • 1
  • 3
  • 14

3 Answers3

2

If use errors='coerce' in to_datetime get NaT (missing values for datetimes) if not datetime-like values - yoou can pass column for improve performance, not apply for looping:

df['c2'] = pd.to_datetime(df['c1'], errors='coerce')
print (df)
           c1         c2
0  2020/10/01 2020-10-01
1  10/01/2020 2020-10-01
2   10/1/2020 2020-10-01
3  31/08/2020 2020-08-31
4  12-21-2020 2020-12-21
5    5-3-2020 2020-05-03
6  05-03-2020 2020-05-03
7       ERRER        NaT

Then remove rows with NaTs by column c2:

df1 = df.dropna(subset=['c2'])
print (df1)
           c1         c2
0  2020/10/01 2020-10-01
1  10/01/2020 2020-10-01
2   10/1/2020 2020-10-01
3  31/08/2020 2020-08-31
4  12-21-2020 2020-12-21
5    5-3-2020 2020-05-03
6  05-03-2020 2020-05-03

Or you can replace them to some datetime (not string '1900-01-01'):

df['c2'] = pd.to_datetime(df['c1'], errors='coerce').fillna(pd.Timestamp('1900-01-01'))
print (df)
           c1         c2
0  2020/10/01 2020-10-01
1  10/01/2020 2020-10-01
2   10/1/2020 2020-10-01
3  31/08/2020 2020-08-31
4  12-21-2020 2020-12-21
5    5-3-2020 2020-05-03
6  05-03-2020 2020-05-03
7       ERRER 1900-01-01

print (df.dtypes)
c1            object
c2    datetime64[ns]
dtype: object
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks! This does exactly what I was looking for and takes it one step further. Its always the simple solutions that seem to be the best. – NLMDEJ Oct 27 '20 at 13:15
1

Pass errors with coerce, for the one cannot convert will return NaT

df['c2'] = pd.to_datetime(df['c1'], errors='coerce')
df
Out[76]: 
           c1         c2
0  2020/10/01 2020-10-01
1  10/01/2020 2020-10-01
2   10/1/2020 2020-10-01
3  31/08/2020 2020-08-31
4  12-21-2020 2020-12-21
5    5-3-2020 2020-05-03
6  05-03-2020 2020-05-03
7       ERRER        NaT
BENY
  • 317,841
  • 20
  • 164
  • 234
1

You may want to convert ERRER beforehand using replace. This means that other rows with missing values NaT will remain NaT.

df['c1'] = df['c1'].replace('ERRER', '01/01/1900')

After that it should work:

df['c2'] = df.apply(lambda x: pd.to_datetime(x['c1']), axis=1)
Ruthger Righart
  • 4,799
  • 2
  • 28
  • 33