3

I have a date column in a pandas.DataFrame in various date time formats and stored as list object, like the following:

            date
1    [May 23rd, 2011]
2    [January 1st, 2010]
    ...
99   [Apr. 15, 2008]
100  [07-11-2013]
    ...
256  [9/01/1995]
257  [04/15/2000]
258  [11/22/68]
    ...
360  [12/1997]
361  [08/2002]
     ...
463  [2014]
464  [2016]

For the sake of convenience, I want to convert them all to MM/DD/YYYY format. It doesn't seem possible to use regex replace() function to do this, since one cannot execute this operation over list objects. Also, to use strptime() for each cell will be too time-consuming.

What will be the easier way to convert them all to the desired MM/DD/YYYY format? I found it very hard to do this on list objects within a dataframe.

Note: for cell values of the form [YYYY] (e.g., [2014] and [2016]), I will assume they are the first day of that year (i.e., January 1, 1968) and for cell values such as [08/2002] (or [8/2002]), I will assume they the first day of the month of that year (i.e., August 1, 2002).

Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
Chris T.
  • 1,699
  • 7
  • 23
  • 45

3 Answers3

9

Given your sample data, with the addition of a NaT, this works:

Code:

df.date.apply(lambda x: pd.to_datetime(x).strftime('%m/%d/%Y')[0])

Test Code:

import pandas as pd

df = pd.DataFrame([
    [['']],
    [['May 23rd, 2011']],
    [['January 1st, 2010']],
    [['Apr. 15, 2008']],
    [['07-11-2013']],
    [['9/01/1995']],
    [['04/15/2000']],
    [['11/22/68']],
    [['12/1997']],
    [['08/2002']],
    [['2014']],
    [['2016']],
], columns=['date'])

df['clean_date'] = df.date.apply(
    lambda x: pd.to_datetime(x).strftime('%m/%d/%Y')[0])

print(df)

Results:

                   date  clean_date
0                    []         NaT
1      [May 23rd, 2011]  05/23/2011
2   [January 1st, 2010]  01/01/2010
3       [Apr. 15, 2008]  04/15/2008
4          [07-11-2013]  07/11/2013
5           [9/01/1995]  09/01/1995
6          [04/15/2000]  04/15/2000
7            [11/22/68]  11/22/1968
8             [12/1997]  12/01/1997
9             [08/2002]  08/01/2002
10               [2014]  01/01/2014
11               [2016]  01/01/2016
Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
  • Thanks for helping on this. I suppose this solves my problem, although I could not apply this really straightforward code to my own data (the data is a pandas series contains hundred of list objects like the one shown in the example data). I thought I could have converted that to dataframe and apply you code, but it says 'Series' object has no attribute 'to_DataFrame' – Chris T. Aug 07 '17 at 14:48
  • So this code works on a panda.Series of lists, each list only having one item. The expression `df.date.apply` does an apply to the Series which is in the column `date`. If you have only a Series and not a DataFrame you can should be able to do `my_series.apply(....)`. – Stephen Rauch Aug 07 '17 at 14:54
  • Yes, this code does work on pandas dataframe with list objects. I am trying to think of a way to convert my series object to dataframe, so that I can apply your code. When I used my_series.apply(), I got this following error message "index 0 is out of bounds for axis 0 with size 0" – Chris T. Aug 07 '17 at 14:58
  • You do not need to convert. In fact this code converts the column of a DataFrame to a Series. That is what `df.date` does. If takes one column named `date` and makes it a series. So as I tried to indicate in my previous comment, take this code, and change `df.date` to be the name of your Series. Should just work. – Stephen Rauch Aug 07 '17 at 15:04
  • Yes, I did, but I received the same error message "index 0 is out of bounds for axis 0 with size 0." – Chris T. Aug 07 '17 at 15:08
  • Well it appears that there is something else non-uniform about the data in your Series. – Stephen Rauch Aug 07 '17 at 15:13
  • I couldn't solve it. The dataset just doesn't make much sense to me. But the original data series does look every bit like the example data shown in this thread. – Chris T. Aug 07 '17 at 15:15
  • @KhalilAlHooti, I just tested with that string and it works fine for me. I got `08/24/2014` – Stephen Rauch Oct 05 '18 at 03:21
  • @StephenRauch Thank you it worked finally with me. not sure what was the problem!! – Khalil Al Hooti Oct 05 '18 at 03:27
  • @Stephen Rauch. This one works great. how we can apply same function to date columns of same dataframe. for example; process_date & order_date. Do we need to specify or call this lambda function twice? – vikrant rana Jul 28 '19 at 17:34
2

It would be better if you use this it'll give you the date format in MM-DD-YYYY the you can apply strftime:

df['Date_ColumnName'] = pd.to_datetime(df['Date_ColumnName'], dayfirst = False, yearfirst = False)
Ashu007
  • 745
  • 1
  • 9
  • 13
0

Provided code will work for following scenarios.

  • Change date format from M/D/YY to MM/DD/YY (5/2/2009 to 05/02/2009)
  • change form ANY FORMAT to MM/DD/YY

import pandas as pd

'''
       * checking provided input file date format correct or not
       * if format is correct change date format from M/D/YY to MM/DD/YY
       * else date format is not correct in input file
         Date format  change form ANY FORMAT to MM/DD/YY
  '''
input_file_name = 'C:/Users/Admin/Desktop/SarenderReddy/predictions.csv'
dest_file_name = 'C:/Users/Admin/Desktop/SarenderReddy/Enrich.csv'
#input_file_name = 'C:/Users/Admin/Desktop/SarenderReddy/enrichment.csv'
read_data = pd.read_csv(input_file_name)
print(pd.to_datetime(read_data['Date'], format='%m/%d/%Y', errors='coerce').notnull().all())

if pd.to_datetime(read_data['Date'], format='%m/%d/%Y', errors='coerce').notnull().all():
    print("Provided correct input date format in input file....!")
    read_data['Date'] = pd.to_datetime(read_data['Date'],format='%m/%d/%Y')
    read_data['Date'] = read_data['Date'].dt.strftime('%m/%d/%Y')
    read_data.to_csv(dest_file_name,index=False)
    print(read_data['Date'])
else:
    print("NOT... Provided correct input date format in input file....!")
    data_format = pd.read_csv(input_file_name,parse_dates=['Date'], dayfirst=True)
    #print(df['Date'])
    data_format['Date'] = pd.to_datetime(data_format['Date'],format='%m/%d/%Y')
    data_format['Date'] = data_format['Date'].dt.strftime('%m/%d/%Y')
    data_format.to_csv(dest_file_name,index=False)
    print(data_format['Date'])