0

This is my code, and it is almost functional, with the exception of the 3 lambda functions in charge of the replacements and reordering of the string capture groups (day, month and year). I need help so that they can receive the necessary parameters and make the replacements according to the input examples indicated below in this question.

import re

def converter_to_date_format(input_text, date_format_write_type):

    year, month, days_intervale_or_day = "", "", ""

    #Change the date restructuring structure according to the value sent as a parameter to the 2nd argument of the function:
    if  (date_format_write_type == 'dd-mm-aa'):  date_restructuring_structure = days_intervale_or_day + "-" + month + "-" + year
    elif(date_format_write_type == 'mm-dd-aa'):  date_restructuring_structure = month + "-" + days_intervale_or_day + "-" + year
    elif(date_format_write_type == 'aa-mm-dd'):  date_restructuring_structure = year + "-" + month + "-" + days_intervale_or_day


    #Set the identification restrictions to decide whether or not to perform any of the replacements:

    #if it has the date interval "[first date -- second date]" and it have one or mode year digits r"\d*"
    detection_regex_not_obligatory_preposition_01 = r"(?:[" + r"\d{2}" + " -- " + r"\d{2})" + "]" + r"|" + r"\d{2}" + r")" + r"[\s|](?:del|de[\s|]el|de |)[\s|]" + r"\d{2}" + r"[\s|](?:del|de[\s|]el|de |)[\s|]" + r"\d*"

    #if it have NOT the date interval "[first date -- second date]" but it have a year with 4 numeric digits r"\d{4}" restriction
    detection_regex_not_obligatory_preposition_02 = r"\d{2}" + r"[\s|](?:del|de[\s|]el|de |)[\s|]" + r"\d{2}" + r"[\s|](?:del|de[\s|]el|de |)[\s|]" + r"\d{4}"

    #If and only if it does not have a date range and if it does not have a 4-digit year, then you must restrict the change to only those substrings that have all the prepositions indicated, so these will no longer be optional 
    detection_regex_obligatory_preposition = r"\d{2}" + r"[\s|](?:del|de[\s|]el|de )[\s|]" + r"\d{2}" + r"[\s|](?:del|de[\s|]el|de )[\s|]" + r"\d*"


    #Do the replacement if any of the identification constraints are met, and according to the parameters that will be sent to the corresponding lambda function
    input_text = re.sub(detection_regex_not_obligatory_preposition_01, lambda m: re.sub(r"", date_restructuring_structure, m.group(), 1), input_text)

    input_text = re.sub(detection_regex_not_obligatory_preposition_02, lambda m: re.sub(r"", date_restructuring_structure, m.group(), 1), input_text)

    input_text = re.sub(detection_regex_obligatory_preposition , lambda m: re.sub(r"", date_restructuring_structure, m.group(), 1), input_text)


    return input_text # --> re-structured date string

Since the input sentences are in Spanish, it will be assumed that the order in which the dates always appear is "days-months-years", then depending on the value of the date_format_write_type parameter that we send to the function called as converter_to_date_format(), the return format will change to the desired one. This is an advantage since the order of the capture groups will always be the day(numbers) or an interval of days( [2 numbers -- 2 numbers] ), then the month (numbers) and finally the year(numbers).

Here some examples to simulate the possible input cases that the regex could receive:

#for simple dates
input_text = "14 de 09 de 2022" #example 1
input_text = "son 78 del 30 del 10 de 2021 del 2021" #example 2
input_text = "serian del 30 del 30 10 del 2021 2027" #example 3
input_text = "14 09 2022" #example 4

# If years havent 4 numerical digits
input_text = "14 14 de 09 220 2250" #Not modify! - example 5
input_text = "14 del 14 09 del 220 2250" #Yes modify! - example 6

#For day intervals
input_text = "[01 -- 31] de 10 del 2022" #example 7
input_text = "5454 [01 -- 02] 10 20222 445" #example 8


#Choose the output format for standardized dates
#date_format_write_type = 'dd-mm-aa'
#date_format_write_type = 'mm-dd-aa'
date_format_write_type = 'aa-mm-dd'

input_text = converter_to_date_format(input_text, date_format_write_type)
print(repr(input_text))  # --> output

Correct outputs for each of these cases (in this case for the format 'aa-mm-dd'):

#for simple dates
"2022-09-14" #for example 1
"son 78 del 2021-10-30 del 2021" #for example 2
"serian del 30 del 30 10 del 2021 2027" #for example 3
"14 09 2022" #for example 4

# If years havent 4 numerical digits
"14 14 09 220 2250" #Not modify! - for example 5
"14 220-09-14 2250" #Yes modify! - for example 6

#For day intervals
"2022-10-[01 -- 31]" #for example 7
"5454 20222-10-[01 -- 02] 445" #for example 8

How should I make the capture groups with regex structures in the lambda functions to make the replacements in the order determined with the information that is passed to the variable date_format_write_type ?

And what should be the structure of these lambda functions in charge of extracting the capturing groups and replacing them with the desired substrings?

In this lines of the code: re.sub(detection_regex_not_obligatory_preposition_01, lambda m: re.sub(r"", date_restructuring_structure, m.group(), 1), input_text)

Matt095
  • 857
  • 3
  • 9
  • It's incomprehensible and unclear. Also you need to provide a [mre] and debugging details. Anyway, do you want to reorder subgroups in a certain order? Why not just replacing 3 times like ```'yy-mm-aa'.replace('yy', m.group(1))``` in a function. – relent95 Oct 21 '22 at 16:24
  • @relent95 The program does work, the only thing that needs to be changed is what is indicated in the 3 lines of the lambda functions. Above each line, I've added a comment about what each of those lines of code is for. You can't do that because yy is just a reference, it's meant to replace timestamps. yy-mm-yy is a generic abbreviation for putting years-months-days, and as indicated in the examples, those spaces will be numeric values.. Check the examples and comments in the question please. – Matt095 Oct 21 '22 at 16:52
  • Your code is not MINIMAL. I meant replacing 3 times to be something like ```f = date_format_write_type; f.replace('dd', m.group(0)).replace('mm', m.group(1)).f.replace('aa', m.group(2)) ```. – relent95 Oct 21 '22 at 17:17
  • @relent95 What you say cannot be done, because they are not those values but they are numbers. `"yy-mm-dd"` is just a template, eg `"el 10 del 09 de 2022 555"` should be converted to `"el 10-09-2022 555"`. If you find a shorter way it would help me a lot, but it is not that I put unnecessary code, but that in my way of posing the problem all the code is necessary – Matt095 Oct 21 '22 at 17:25
  • You should extract each 'dd', 'mm' and 'aa' field from the pattern, doing like ```m.group(0)``` not ```m.group()```. In my example, ```f``` is a template but the result of ```f.repace(...).replace(...).replace(...)``` is a string with numbers like '10-09-2022' , what you wanted. – relent95 Oct 21 '22 at 17:38
  • @relent95 I try with your code, but I dont understand this, and give me errors :S – Matt095 Oct 22 '22 at 03:38

0 Answers0