0

This is my code where I indicate some possible examples to simulate the environment where this program will work

import re, datetime

#Example input cases
input_text = '[26 -- 31] de 10 del 200 de 2022' #example 1
input_text = '[26 -- 31] de 12 del 206 del 2022' #example 2
input_text = '[06 -- 11] del 09 del ano 2020 del 2022' #example 3
input_text = '[06 -- 06] del mes 09 del ano 20 del ano 2022' #example 4
input_text = '[16 -- 06] del mes 09 del 2022' #example 5 (not modify)


possible_year_num = r"\d*" #I need one or more numbers (one or more numeric digits but never any number)

current_year = datetime.datetime.today().strftime('%Y')

month_context_regex = r"[\s|]*(?:del[\s|]*mes|de[\s|]*el[\s|]*mes|de[\s|]*mes|del|de[\s|]*el|de)[\s|]*"
year_context_regex = r"[\s|]*(?:del[\s|]*año|de[\s|]*el[\s|]*año|de[\s|]*año|del[\s|]*ano|de[\s|]*el[\s|]*ano|de[\s|]*ano|del|de[\s|]*el|de)[\s|]*"

#I combine those modular regex expressions to build a regex that serves to identify the substring in which the replacement must be performed
identity_replacement_case_regex = r"\[\d{2}" + " -- " + r"\d{2}]" + month_context_regex + r"\d{2}" + year_context_regex + possible_year_num + year_context_regex + current_year

#Only in this cases, I need replace with re.sub() and obtain this output string, for example 1, '[26 -- 31] de 10 del 200'
replacement_without_current_year = r"\[\d{2}" + " -- " + r"\d{2}]" + month_context_regex + r"\d{2}" + year_context_regex + possible_year_num
input_text = re.sub(identity_replacement_case_regex, replacement_without_current_year, input_text)

print(repr(input_text))  # --> output

The correct outputs should look like this:

'[26 -- 31] de 10 del 200' #for example 1
'[26 -- 31] de 12 del 206' #for example 2
'[06 -- 11] del 09 del ano 2020' #for example 3
'[06 -- 06] del mes 09 del ano 20' #for example 4
'[16 -- 06] del mes 09 del 2022' #for example 5 (not modified)

How should I put this replacement in the re.sub() function to get these outputs?

I get this error, when I try this replacement

Traceback (most recent call last):
input_text_substring = re.sub(identity_replacement_case_regex, replacement_without_current_year, input_text_substring)
raise s.error('bad escape %s' % this, len(this))
re.error: bad escape \d at position 2
Matt095
  • 857
  • 3
  • 9
  • What is `r"[\d{2} -- \d{2}]" `? Looks like some corrupt pattern since you are trying to use quantifiers in a character class. – Wiktor Stribiżew Oct 16 '22 at 19:55
  • this is a time periods, `[01 - 09]` are the days `01`, `02`, `03`, `04`, `05`, `06`, `07`, `08` and `09`; I have included that in the patterns because my system detects the intervals of days that way – Matt095 Oct 16 '22 at 20:51
  • 1
    But `[\d{2} -- \d{2}]` = `[{}\d -]`. This makes no sense. – Wiktor Stribiżew Oct 16 '22 at 20:52
  • from day number `01` to day `09`, is a list of days, all of them belonging to the indicated month, all the examples are all examples are standardized intervals, ranging from one day number to another. (each day number is standardized to 2 digits `\d{2}`, for examples `01-01-2022` to `03-01-2022`, are `[01 -- 03] del 01 de 2022`, and in regex `[\d{2} -- \d{2}]\sde\s\d{2}\sde\s\d*` – Matt095 Oct 16 '22 at 20:58
  • @WiktorStribiżew I try with a regex replacement but I get that error `raise s.error('bad escape %s' % this, len(this)) re.error: bad escape \d at position 2`, do you know what is that error? – Matt095 Oct 16 '22 at 22:08
  • @MatiasNicolasRodriguez I couldn't improve the given regex rule because it is too complex. However, I created a new regex rule which gives the expected output. You can look at [here](https://regex101.com/r/NRUEYO/1). If it is ok, I will add a response to explain the rule. – Onur Uslu Oct 16 '22 at 22:53

1 Answers1

1

Rule:

\D+(?<!\D\d{2} \S{3} )(?<!\D\d{2} \S{2} )2022

Demo: https://regex101.com/r/NRUEYO/1

Code:

import re

regex = r"\D+(?<!\D\d{2} \S{3} )(?<!\D\d{2} \S{2} )2022"

input_text = '[26 -- 31] de 10 del 200 de 2022' #example 1
input_text = '[26 -- 31] de 12 del 206 del 2022' #example 2
input_text = '[06 -- 11] del 09 del ano 2020 del 2022' #example 3
input_text = '[06 -- 06] del mes 09 del ano 20 del ano 2022' #example 4
input_text = '[16 -- 06] del mes 09 del 2022' #example 5 (not modify)

replace_text = ""

result = re.sub(regex, replace_text, input_text)

if result:
    print (result)
  • \D => Any non-digit character
  • \d => Any digit character
  • \d{2} => Two digit character
  • \S => Any non-whitespace character
  • \S{3} => Three non-whitespace character
  • (?<!A)2022 => There must not be an "A" character before 2022
  • (?<!\D\d{2} \S{3} )2022 => There must not be an three character word before the 2022 and after the two-digit characters.
  • (?<!\D\d{2} \S{3} )(?<!\D\d{2} \S{2} )2022 => There must not be an three or two character word before the 2022 and after the two-digit characters.
  • \D+(?<!\D\d{2} \S{3} )(?<!\D\d{2} \S{2} )2022 => Capture all non-digit characters before the (?<!\D\d{2} \S{3} )(?<!\D\d{2} \S{2} )2022
Onur Uslu
  • 1,044
  • 1
  • 7
  • 11