3

I have a large data set with two datetime columns and as I’m trying to clean it, I created a function that apply to each cell of these two columns:

#dropoff_datetime
def date_time_dropoff(cell):
    if pd.isnull(cell) == True:
        return('Missing Date Time dropoff')
    elif type(cell) == datetime.datetime:
        return ('Clean Date Time dropoff')
    elif type(cell) != datetime.datetime:
        return ('Type error Date Time dropoff')

When I run it first time on my dataframe, it returns that the cells have the wrong type (indeed, I see that it is of type object when I use df.dtypes). So next I convert the df using pandas method:

df2['Statue_dropoff_datetime'] = df['dropoff_datetime'].apply(date_time_dropoff)

and run the function again, but it stills return Type error because it is not type datetim.datetime. However, once I use df.dtypes after the conversion I obtain the correct datetimetype:

dropoff_datetime datetime64[ns]

Without success, I tried answers suggested for similar questions dealing with convesion, mainly: Converting object column in pandas dataframe to datetime Convert Pandas Column to DateTime Converting object to datetime format in python How to change the datetime format in pandas

Btw, when I import at the beginning, I use: import datetime

Here's my df:

dropoff_datetime
2020-06-10 11:16:00
2020-06-21 11:53:06
2020-06-09 13:25:21
2020-06-01 19:06:05

#dropoff_datetime
def date_time_dropoff(cell):
    if pd.isnull(cell) == True:
        return('Date Time dropoff missing')
    elif type(cell) == datetime:
        return ('Date Time dropoff clean')
    elif type(cell) != datetime:
        return ('Date Time dropoff Type error')

df1[‘Statue_dropoff_datetime'] = df['dropoff_datetime'].apply(date_time_dropoff)

and when I print:

print(pd.value_counts(df1.values.ravel()))
Type error Date Time dropoff    47703

while using df.dtypes gives:

df.dtypes
dropoff_datetime        datetime64[ns]
Hibou
  • 196
  • 2
  • 3
  • 10
  • 1
    Could you please make sure to have a complete working example so it's easier for others to reproduce the problem? Thanks! https://stackoverflow.com/help/minimal-reproducible-example – fischmalte Jan 29 '21 at 17:17
  • Is it better now ? @fischmalte – Hibou Jan 29 '21 at 17:35
  • related: [Python pandas: how to obtain the datatypes of objects in a mixed-datatype column?](https://stackoverflow.com/q/64195782/10197418) – FObersteiner Jan 29 '21 at 18:41

1 Answers1

4

Basically, to check if an object is of a certain type, use isinstance. So your function could look like

def date_time_dropoff(cell):
    if pd.isnull(cell):
        return('Date Time dropoff missing')
    elif isinstance(cell, datetime):
        return ('Date Time dropoff clean')
    else:
        return ('Date Time dropoff Type error')

But I think you can do this more easily only with the call to isinstance, which you use to check for type datetime.datetime. A check for pd.Timestamp would be the "more narrow" criterion, cleaning out datetime.datetime as well.

Ex:

Given a DataFrame with a mixed dtype column,

from datetime import datetime
import pandas as pd
import numpy as np

df = pd.DataFrame({'mixedcol': ['2021-01-01', datetime(2021,1,1), 
                                pd.Timestamp('2021-01-01'), 'something else', 42, np.nan]})

you can mask all datetime values like

m = df['mixedcol'].apply(lambda v: isinstance(v, datetime))

...and use the mask to set all other values to NaN. Then convert to datetime and you should be where you want to be:

df['cleaned'] = pd.to_datetime(np.where(m, df['mixedcol'], np.NaN))

# df['cleaned']
# 0          NaT
# 1   2021-01-01
# 2   2021-01-01
# 3          NaT
# 4          NaT
# 5          NaT
# Name: cleaned, dtype: datetime64[ns]
FObersteiner
  • 22,500
  • 8
  • 42
  • 72