I'm trying to parse dates from individual health records. Since the entries appear to be manual, the date formats are all over the place. My regex patterns are apparently not making the cut for several observations. Here's the list of tasks I need to accomplish along with the accompanying code. Dataframe has been subsetted to 15 observations for convenience.
- Parse dates:
#Create DF:
health_records = ['08/11/78 CPT Code: 90801 - Psychiatric Diagnosis Interview',
'Lithium 0.25 (7/11/77). LFTS wnl. Urine tox neg. Serum tox + fluoxetine 500; otherwise neg. TSH 3.28. BUN/Cr: 16/0.83. Lipids unremarkable. B12 363, Folate >20. CBC: 4.9/36/308 Pertinent Medical Review of Systems Constitutional:',
'28 Sep 2015 Primary Care Doctor:',
'06 Mar 1974 Primary Care Doctor:',
'none; but currently has appt with new HJH PCP Rachel Salas, MD on October. 11, 2013 Other Agency Involvement: No',
'.Came back to US on Jan 24 1986, saw Dr. Quackenbush at Beaufort Memorial Hospital. Checked VPA level and found it to be therapeutic and confirmed BPAD dx. Also, has a general physician exam and found to be in good general health, except for being slightly overwt',
'September. 15, 2011 Total time of visit (in minutes):',
'sLanguage based learning disorder, dyslexia. Placed on IEP in 1st grade through Westerbrook HS prior to transitioning to VFA in 8th grade. Graduated from VF Academy in May 2004. Attended 1.5 years college at Arcadia.Employment Currently employed: Yes',
') - Zoloft 100 mg daily: February, 2010 : self-discontinued due to side effects (unknown)',
'6/1998 Primary Care Doctor:',
'12/2008 Primary Care Doctor:',
'ran own business for 35 years, sold in 1985',
'011/14/83 Audit C Score Current:',
'safter evicted in February 1976, hospitalized at Pemberly for 1 mo.Hx of Outpatient Treatment: No',
'. Age 16, 1991, frontal impact. out for two weeks from sports.',
's Mr. Moss is a 27-year-old, Caucasian, engaged veteran of the Navy. He was previously scheduled for an intake at the Southton Sanitorium in January, 2013 but cancelled due to ongoing therapy (see Psychiatric History for more details). He presents to the current intake with primary complaints of sleep difficulties, depressive symptoms, and PTSD.']
import numpy as np
import pandas as pd
df = pd.DataFrame(health_records, columns=['records'])
#Date parsing: patten 1:
df['new'] = (df['records'].str.findall(r'\d{1,2}.\d{1,2}.\d{2,4}')
.astype(str).str.replace(r'\[|\]|\(|\)|,|\'', '').str.strip())
#Date parsing pattern 2:
df['new2'] = (df['records'].str.findall(r'(?:\d{2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{2}, )?\d{4}')
.astype(str).str.replace(r'\[|\]|\'|\,','').str.strip())
df['date'] = df['new']+df['new2']
and here is the output:
df['date']
0 08/11/78
1 7/11/77 16/0.83 4.9/36
2 28 Sep 2015
3 06 Mar 1974
4
5 24 1986
6
7 May 2004
8
9 6/1998
10 12/2008
11
12 011/14
13 February 1976
14
15
As you can see in some places the code works perfectly, but in complex sentences my pattern is not working or spitting out inaccurate results. Here is a list of all possible date combinations:
- 04/20/2009; 04/20/09; 4/20/09; 4/3/09;
- Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
- 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009 Mar 20th, 2009;
- Mar 21st, 2009; Mar 22nd, 2009;
- Feb 2009; Sep 2009; Oct 2010; 6/2008; 12/2009
- 2009; 2010
- Clean dates
Next I tried to clean dates, using a solution provided here - while it should work, since my format is similar to one in that problem, but it doesn't.
#Clean dates to date format
df['clean_date'] = df.date.apply(
lambda x: pd.to_datetime(x).strftime('%m/%d/%Y'))
df['clean_date']
The above code does not work. Any help would be deeply appreciated. Thanks for your time!