I used regex .findall(' ') function to extract date time elements of different formats from strings within a pandas data frame to 'list' objects and made them into a new column named "new." However, there are date time list objects that only of 4-digit YYYY format, lacking month (MM) and day (DD) (e.g., df['new'].iloc[99]), and objects where day (DD) are missing, such as df['new'].iloc[221], like the following:
new
0 [6/12/2009]
1 [12-10-2013]
2 [7/8/71]
3 [9-27-75]
4 [23rd May, 96]
5 [7/06/79]
...
99 [1968]
...
221 [8/2009]
...
470 [May 22nd, 2015]
Note: each cell is a list object.
As I want to sort them by chronological order after all these date extraction and formatting are done, so for the sake of convenience, for cell values like [1968], I will assume this to be the first day of that year (i.e., January 1, 1968) and for cell values like [8/2009] (or [08/2009]), I will assume it is the first day of the month of that year (i.e., August 1, 2009).
So I am asking is there any way to write a simple function to convert [YYYY] and [M/YYYY] (or [MM/YYYY]) formats all into [MM/DD/YYYY] format, such as
[1968] to [01/01/1968]
[8/2009](or [08/2009]) to [08/01/2009]
What will be the easiest way to do this conversion over (perhaps) hundreds of list objects [] in the column df['new'] that are missing month and day information?
[Edit] I used the following code (with the change_format() function that Bandi A kindly provided)
import pandas as pd
import re
import datetime as dt
#load txt file (no header)
doc = []
with open('dates.txt') as file:
for line in file:
doc.append(line)
df = pd.Series(doc)
#use regex findall() to extract datetime from df
df['new'] = df.str.findall(r'\b(\d{1,2}\/\d{1,2}\/\d{2,4}|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*-\d{2}-\d{2,4}|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{2}. )\d{4}|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{2}, )\d{4}|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{2} )\d{4}|(?:\d{2} )(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{4})|(?:\d{2} )(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z.]* (?:\d{4})|(?:\d{2} )(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z,]* (?:\d{4})|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{1,2})[a-z,]* (?:\d{4})|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{4})|\d{1,2}\/\d{4}|\d{4}|\d{1,2}-\d{1,2}-\d{2,4})\b')
from datetime import date
def change_format(a):
c = re.split('[-/ ]',a[0])
b= len(c)
if b==1:
return date(int(c[0]),1,1).strftime('%d/%m/%Y')
elif b==2:
return date(int(c[1]),int(c[0]),1).strftime('%d/%m/%Y')
else:
return a[0]
df['modified_new'] = map(change_format,df['new'])
df['modified_new']
The extracted date time output (i.e., df['new']) looks like this (note the empty cell [] because the original text string contains irregular format)
0 [03/25/93]
1 [6/18/85]
2 [7/8/71]
3 [9/27/75]
4 [2/6/96]
5 [7/06/79]
6 [5/18/78]
7 [10/24/89]
8 [3/7/86]
9 [4/10/71]
10 [5/11/85]
...
490 [2007]
491 [2009]
492 [1986]
493 []
494 [2002]
495 [1979]
496 [2006]
497 [2008]
498 [2005]
499 [1980]