-2

I used regex .findall(' ') function to extract date time elements of different formats from strings within a pandas data frame to 'list' objects and made them into a new column named "new." However, there are date time list objects that only of 4-digit YYYY format, lacking month (MM) and day (DD) (e.g., df['new'].iloc[99]), and objects where day (DD) are missing, such as df['new'].iloc[221], like the following:

               new
0        [6/12/2009]
1        [12-10-2013]
2        [7/8/71]
3        [9-27-75]
4        [23rd May, 96]
5        [7/06/79]
         ...
99       [1968]
         ...
221      [8/2009]
         ...
470      [May 22nd, 2015]

Note: each cell is a list object.

As I want to sort them by chronological order after all these date extraction and formatting are done, so for the sake of convenience, for cell values like [1968], I will assume this to be the first day of that year (i.e., January 1, 1968) and for cell values like [8/2009] (or [08/2009]), I will assume it is the first day of the month of that year (i.e., August 1, 2009).

So I am asking is there any way to write a simple function to convert [YYYY] and [M/YYYY] (or [MM/YYYY]) formats all into [MM/DD/YYYY] format, such as

[1968] to [01/01/1968]

[8/2009](or [08/2009]) to [08/01/2009]

What will be the easiest way to do this conversion over (perhaps) hundreds of list objects [] in the column df['new'] that are missing month and day information?

[Edit] I used the following code (with the change_format() function that Bandi A kindly provided)

import pandas as pd
import re
import datetime as dt

#load txt file (no header)
doc = []
with open('dates.txt') as file:
    for line in file:
        doc.append(line)
df = pd.Series(doc)

#use regex findall() to extract datetime from df

df['new'] = df.str.findall(r'\b(\d{1,2}\/\d{1,2}\/\d{2,4}|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*-\d{2}-\d{2,4}|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{2}. )\d{4}|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{2}, )\d{4}|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{2} )\d{4}|(?:\d{2} )(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{4})|(?:\d{2} )(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z.]* (?:\d{4})|(?:\d{2} )(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z,]* (?:\d{4})|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{1,2})[a-z,]* (?:\d{4})|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{4})|\d{1,2}\/\d{4}|\d{4}|\d{1,2}-\d{1,2}-\d{2,4})\b')


from datetime import date
def change_format(a): 
    c = re.split('[-/ ]',a[0])
    b= len(c)
    if b==1:
        return date(int(c[0]),1,1).strftime('%d/%m/%Y')
    elif b==2:
        return date(int(c[1]),int(c[0]),1).strftime('%d/%m/%Y')
    else:
        return a[0]

df['modified_new'] = map(change_format,df['new'])
df['modified_new']

The extracted date time output (i.e., df['new']) looks like this (note the empty cell [] because the original text string contains irregular format)

0                 [03/25/93]
1                  [6/18/85]
2                   [7/8/71]
3                  [9/27/75]
4                   [2/6/96]
5                  [7/06/79]
6                  [5/18/78]
7                 [10/24/89]
8                   [3/7/86]
9                  [4/10/71]
10                 [5/11/85]
   ...
490                   [2007]
491                   [2009]
492                   [1986]
493                       []
494                   [2002]
495                   [1979]
496                   [2006]
497                   [2008]
498                   [2005]
499                   [1980]
khelwood
  • 55,782
  • 14
  • 81
  • 108
Chris T.
  • 1,699
  • 7
  • 23
  • 45

1 Answers1

1
import pandas as pd
import re
from datetime import datetime 

def helper(a ,f):
    return datetime.strptime(a,f).strftime('%m-%d-%Y')

def change_format(a):
    #print a
    if 'Janaury' in a:
        a = a[:3]+a[7:]
    if 'Decemeber' in a:
        a = a[:3]+a[9:]
    c = re.split('/|-| ',a)
    b = len(c)
    if re.match(r'\d\d [A-Z]',a) != None:
        if len(c[1]) == 3:
            return helper(a,'%d %b %Y')
        else:
            return helper(a, '%d %B %Y')
    elif re.match(r'[A-Z]',a) != None:
        if len(c) == 2:
            if len(c[0]) == 3: 
                return helper(a+' 1','%b %Y %d')
            else:
                return helper(a+' 1','%B %Y %d')
        if len(c[0]) == 3:
            if ',' in a:
                return helper(a,'%b %d, %Y')
            else:
                return  helper(a,'%b %d %Y')
        else:
            if ',' in a:
                return helper(a,'%B %d, %Y')
            else:
                return helper(a,'%B %d %Y')
    else:
        if b==3:
            if len(c[2]) == 2:
                if '-' in a:
                    return helper(a,'%m-%d-%y')
                else:
                    return helper(a ,'%m/%d/%y')
            elif len(c[2]) == 4:
                return date(int(c[2]),int(c[0]),int(c[1])).strftime('%m-%d-%Y')
        elif b==2:
            return date(int(c[1]),int(c[0]),1).strftime('%m-%d-%Y')
        else:
            return date(int(c[0]),1,1).strftime('%m-%d-%Y')

with open('dates.txt') as f:
    d = f.read()
    f.close()
k = re.findall(r'\b(\d{1,2}\/\d{1,2}\/\d{2,4}|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*-\d{2}-\d{2,4}|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{2}. )\d{4}|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{2}, )\d{4}|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{2} )\d{4}|(?:\d{2} )(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{4})|(?:\d{2} )(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z.]* (?:\d{4})|(?:\d{2} )(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z,]* (?:\d{4})|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{1,2})[a-z,]* (?:\d{4})|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{4})|\d{1,2}\/\d{4}|\d{4}|\d{1,2}-\d{1,2}-\d{2,4})\b',d)   
k.remove('7787')
dates =map(change_format,k)
dates.remove(None)
df = pd.DataFrame(dates,columns= ['date'])
df['date'] =pd.to_datetime(df.date)
df = df.sort_values('date').reset_index(drop=True)
Anirudh Bandi
  • 1,171
  • 10
  • 20
  • Thanks for your reply. I converted my data frame to list using df['new'].tolist(), creating a list object names df_list, and applied map(change_format,df_list), it didn't change anything in the list but added one more entry (a nan) at the bottom of the list. – Chris T. Aug 06 '17 at 09:17
  • Also, one cannot 'split' a list object. I think that's part of the reason why I cannot apply this function over my data frame column. – Chris T. Aug 06 '17 at 09:24
  • Hi, they still the same and an additional 'nan' was also added at the end of the list. – Chris T. Aug 06 '17 at 12:12
  • Hi, thanks again for your quick response. I used the your code on dataframe, but the output of df['modified_new'] returns this message – Chris T. Aug 06 '17 at 12:37
  • Hi, it still doesn't work. My jupyter editor returns 501 rows of '' and, again, one additional row (nan) was added to the original dataframe. I was trying to figure what went wrong in the last hour, but still couldn't figure out. – Chris T. Aug 06 '17 at 14:56
  • I am wondering if I lost something in the transition from the original text string data 'df = pd.Series(doc)' to 'df.str.findall(r' ')' which I used to extract date time element from text string, causing the error? If that step was wrongly executed, then creating a new pd.DataFrame() won't be of much help. – Chris T. Aug 06 '17 at 15:20
  • i've started a chat room, can you access it? its named 'date formatting' – Anirudh Bandi Aug 06 '17 at 15:49