This is my code, and it is almost functional, with the exception of the 3 lambda functions in charge of the replacements and reordering of the string capture groups (day, month and year). I need help so that they can receive the necessary parameters and make the replacements according to the input examples indicated below in this question.
import re
def converter_to_date_format(input_text, date_format_write_type):
year, month, days_intervale_or_day = "", "", ""
#Change the date restructuring structure according to the value sent as a parameter to the 2nd argument of the function:
if (date_format_write_type == 'dd-mm-aa'): date_restructuring_structure = days_intervale_or_day + "-" + month + "-" + year
elif(date_format_write_type == 'mm-dd-aa'): date_restructuring_structure = month + "-" + days_intervale_or_day + "-" + year
elif(date_format_write_type == 'aa-mm-dd'): date_restructuring_structure = year + "-" + month + "-" + days_intervale_or_day
#Set the identification restrictions to decide whether or not to perform any of the replacements:
#if it has the date interval "[first date -- second date]" and it have one or mode year digits r"\d*"
detection_regex_not_obligatory_preposition_01 = r"(?:[" + r"\d{2}" + " -- " + r"\d{2})" + "]" + r"|" + r"\d{2}" + r")" + r"[\s|](?:del|de[\s|]el|de |)[\s|]" + r"\d{2}" + r"[\s|](?:del|de[\s|]el|de |)[\s|]" + r"\d*"
#if it have NOT the date interval "[first date -- second date]" but it have a year with 4 numeric digits r"\d{4}" restriction
detection_regex_not_obligatory_preposition_02 = r"\d{2}" + r"[\s|](?:del|de[\s|]el|de |)[\s|]" + r"\d{2}" + r"[\s|](?:del|de[\s|]el|de |)[\s|]" + r"\d{4}"
#If and only if it does not have a date range and if it does not have a 4-digit year, then you must restrict the change to only those substrings that have all the prepositions indicated, so these will no longer be optional
detection_regex_obligatory_preposition = r"\d{2}" + r"[\s|](?:del|de[\s|]el|de )[\s|]" + r"\d{2}" + r"[\s|](?:del|de[\s|]el|de )[\s|]" + r"\d*"
#Do the replacement if any of the identification constraints are met, and according to the parameters that will be sent to the corresponding lambda function
input_text = re.sub(detection_regex_not_obligatory_preposition_01, lambda m: re.sub(r"", date_restructuring_structure, m.group(), 1), input_text)
input_text = re.sub(detection_regex_not_obligatory_preposition_02, lambda m: re.sub(r"", date_restructuring_structure, m.group(), 1), input_text)
input_text = re.sub(detection_regex_obligatory_preposition , lambda m: re.sub(r"", date_restructuring_structure, m.group(), 1), input_text)
return input_text # --> re-structured date string
Since the input sentences are in Spanish, it will be assumed that the order in which the dates always appear is "days-months-years", then depending on the value of the date_format_write_type
parameter that we send to the function called as converter_to_date_format()
, the return format will change to the desired one. This is an advantage since the order of the capture groups will always be the day(numbers) or an interval of days( [2 numbers -- 2 numbers] ), then the month (numbers) and finally the year(numbers).
Here some examples to simulate the possible input cases that the regex could receive:
#for simple dates
input_text = "14 de 09 de 2022" #example 1
input_text = "son 78 del 30 del 10 de 2021 del 2021" #example 2
input_text = "serian del 30 del 30 10 del 2021 2027" #example 3
input_text = "14 09 2022" #example 4
# If years havent 4 numerical digits
input_text = "14 14 de 09 220 2250" #Not modify! - example 5
input_text = "14 del 14 09 del 220 2250" #Yes modify! - example 6
#For day intervals
input_text = "[01 -- 31] de 10 del 2022" #example 7
input_text = "5454 [01 -- 02] 10 20222 445" #example 8
#Choose the output format for standardized dates
#date_format_write_type = 'dd-mm-aa'
#date_format_write_type = 'mm-dd-aa'
date_format_write_type = 'aa-mm-dd'
input_text = converter_to_date_format(input_text, date_format_write_type)
print(repr(input_text)) # --> output
Correct outputs for each of these cases (in this case for the format 'aa-mm-dd'
):
#for simple dates
"2022-09-14" #for example 1
"son 78 del 2021-10-30 del 2021" #for example 2
"serian del 30 del 30 10 del 2021 2027" #for example 3
"14 09 2022" #for example 4
# If years havent 4 numerical digits
"14 14 09 220 2250" #Not modify! - for example 5
"14 220-09-14 2250" #Yes modify! - for example 6
#For day intervals
"2022-10-[01 -- 31]" #for example 7
"5454 20222-10-[01 -- 02] 445" #for example 8
How should I make the capture groups with regex structures in the lambda functions to make the replacements in the order determined with the information that is passed to the variable date_format_write_type
?
And what should be the structure of these lambda functions in charge of extracting the capturing groups and replacing them with the desired substrings?
In this lines of the code:
re.sub(detection_regex_not_obligatory_preposition_01, lambda m: re.sub(r"", date_restructuring_structure, m.group(), 1), input_text)