In one of my datasets, there is a "test date" variable that is HORRIBLY messy. Seriously, this can not be understated. It is a character string, and the field from which it is pulled is open-text with little to no uniformity. Unfortunately, I have no control over how this data is collected because it is a secondary dataset pulled from a health record management system.
Examples of some of the formats used are listed below. Some dates are represented numerically, others are spelled out, and others are recorded as being some duration ago (from the appointment date field, which is luckily clean).
last year
November 2022
6/16/2022
21 mo ago?
a few months ago
2008
1/2022
~ 2 years ago
8.2022
7/2022
1 year ago at [name of location]
First, how would you first go about this issue? Are there certain formats you'd start with or prioritize over the others? Are there particular packages you'd recommend? The day portion of the date isn't needed; I'm hoping to have a month (mm) column and a year (yyyy) column, or a mm/yyyy column I can then separate.
Obviously, I know I'm not going to be able to capture all these entries; however, I would like to capture as many as feasibly possible because we have a funder requesting minimal unknowns/NAs. I've tried looking for answers on here, but I couldn't find anything that has addressed a date variable THIS messy. Any help is much appreciated!
What I've done so far: Subset the data into four different dataframes based on patterns I noticed in the formatting.
- entries that contain a numerical date (28% of raw df)
- entries that list a month and a year (10% of raw df)
- entries that contain "ago" (37% of raw df)
- catch-all "rest" category that doesn't fit any pattern (24% of raw df)
However, even within these groups, entries are not uniform (e.g. dates can be written as mm/dd/yyyy vs. mm/yyyy vs. mm/dd/yy vs. mm/yy, months can be abbreviated vs. spelled out, and some characters are superfluous like location), so I am unsure how best to proceed.