Pandas Dataframe: df.apply ignore error rows

Question

Good morning! I am trying to convert a column that has multiple dates in various formats into a datetime column.

import pandas as pd

data = {
'c1':['2020/10/01','10/01/2020','10/1/2020','31/08/2020','12-21-2020','5-3-2020','05-03-2020','ERRER']    
}

df = pd.DataFrame (data, columns = ['c1'])

The code above creates my dataframe to test on. If I run the following code, I get an error because 'ERRER' is not a valid date:

df['c2'] = df.apply(lambda x: pd.to_datetime(x['c1']), axis=1)

Is there a way to skip a row in the apply function if it can't be converted to a datetime? Or convert error rows to a default date (i.e. '1900-01-01')?

jezrael · Answer 1 · 2020-10-27T13:26:43.297

If use errors='coerce' in to_datetime get NaT (missing values for datetimes) if not datetime-like values - yoou can pass column for improve performance, not apply for looping:

df['c2'] = pd.to_datetime(df['c1'], errors='coerce')
print (df)
           c1         c2
0  2020/10/01 2020-10-01
1  10/01/2020 2020-10-01
2   10/1/2020 2020-10-01
3  31/08/2020 2020-08-31
4  12-21-2020 2020-12-21
5    5-3-2020 2020-05-03
6  05-03-2020 2020-05-03
7       ERRER        NaT

Then remove rows with NaTs by column c2:

df1 = df.dropna(subset=['c2'])
print (df1)
           c1         c2
0  2020/10/01 2020-10-01
1  10/01/2020 2020-10-01
2   10/1/2020 2020-10-01
3  31/08/2020 2020-08-31
4  12-21-2020 2020-12-21
5    5-3-2020 2020-05-03
6  05-03-2020 2020-05-03

Or you can replace them to some datetime (not string '1900-01-01'):

df['c2'] = pd.to_datetime(df['c1'], errors='coerce').fillna(pd.Timestamp('1900-01-01'))
print (df)
           c1         c2
0  2020/10/01 2020-10-01
1  10/01/2020 2020-10-01
2   10/1/2020 2020-10-01
3  31/08/2020 2020-08-31
4  12-21-2020 2020-12-21
5    5-3-2020 2020-05-03
6  05-03-2020 2020-05-03
7       ERRER 1900-01-01

print (df.dtypes)
c1            object
c2    datetime64[ns]
dtype: object

Thanks! This does exactly what I was looking for and takes it one step further. Its always the simple solutions that seem to be the best. — NLMDEJ, Oct 27 '20 at 13:15

score 1 · Answer 2 · answered Oct 27 '20 at 13:12

1

Pass errors with coerce, for the one cannot convert will return NaT

df['c2'] = pd.to_datetime(df['c1'], errors='coerce')
df
Out[76]: 
           c1         c2
0  2020/10/01 2020-10-01
1  10/01/2020 2020-10-01
2   10/1/2020 2020-10-01
3  31/08/2020 2020-08-31
4  12-21-2020 2020-12-21
5    5-3-2020 2020-05-03
6  05-03-2020 2020-05-03
7       ERRER        NaT

answered Oct 27 '20 at 13:12

BENY

317,841
20
164
234

Thanks! This does exactly what I was looking for. Its always the simple solutions that seem to be the best. – NLMDEJ Oct 27 '20 at 13:13
@NLMDEJ yw happy coding ~ – BENY Oct 27 '20 at 13:16

score 1 · Answer 3 · answered Oct 27 '20 at 13:24

You may want to convert ERRER beforehand using replace. This means that other rows with missing values NaT will remain NaT.

df['c1'] = df['c1'].replace('ERRER', '01/01/1900')

After that it should work:

df['c2'] = df.apply(lambda x: pd.to_datetime(x['c1']), axis=1)

Pandas Dataframe: df.apply ignore error rows

3 Answers3