0
import re

input_text = 'el dia corrimos juntas hasta el 11° nivel de aquella montaña hasta el 2022_-_12_-_13' 
#input_text = 'desde el  corrimos juntas hasta el 11° nivel de aquella montaña y luego bajamos hasta la salida, hasta el 2022_-_12_-_01 21:00 hs caminamos juntas' #example 2


date_format = r"(?:\(|)\s*(\d*)_-_(\d{2})_-_(\d{2})\s*(?:\)|)"

#text in the middle associated with the date range...
#some_text = r"(?:(?!\.\s*?\n)[^;])*" #but cannot contain ";", ".\s*\n"
some_text = r"(?:(?!\.\s*)[^;])*" #but cannot contain ";", ".\s*"
#some_text = r"(?:[^.;])*" #but cannot contain ";", "."

identification_re_0 = r"(?:el dia|dia|el)\s*(?:del|de\s*el|de |)\s*(" + some_text + r")\s*(?:,\s*hasta|hasta|al|a )\s*(?:el|la|)\s*" + date_format

input_text = re.sub(identification_re_0,
                    lambda m: print(m[1]),
                    input_text, re.IGNORECASE)

#print(repr(input_text)) # --> output

These are the incorrect outputs that I got:

'corrimos juntas hasta el 11° nivel de aquella montaña hast'
'corrimos juntas hasta el 11° nivel de aquella montaña y luego bajamos hasta la salida, hast'

And these would be the correct outputs that you should get with this examples:

'corrimos juntas hasta el 11° nivel de aquella montaña'
'corrimos juntas hasta el 11° nivel de aquella montaña y luego bajamos hasta la salida'

Why does the (?:,\s*hasta|hasta|al|a ) capture group try its options backwards? Why is it trying to conform to the greedy behavior of the above regex, in this case (?:(?!\.\s*)[^;])*?


Edit with a possible solution:

I have achieved more or less close results except with example 3 where I could not make it so that if there was not something captured by some_text the () are not placed

import re

input_text = 'desde el 2022_-_12_-_10 corrimos juntas hasta el 11° nivel de aquella montaña hasta el 2022_-_12_-_13' #example 1
#input_text = 'desde el 2022_-_11_-_10 18:30 pm corrimos juntas hasta el 11° nivel de aquella montaña y luego bajamos hasta la salida, hasta el 2022_-_12_-_01 21:00 hs caminamos juntas' #example 2
#input_text = 'desde el 2022_-_11_-_10 18:30 pm hasta el 2022_-_12_-_01 21:00 hs' #example 3

#text in the middle associated with the date range...
#some_text = r"(?:(?!\.\s*?\n)[^;])*" #but cannot contain ";", ".\s*\n"
some_text = r"(?:(?!\.\s*)[^;])*" #but cannot contain ";", ".\s*"
#some_text = r"(?:[^.;])*" #but cannot contain ";", "."

identificate_hours = r"(?:a\s*las|a\s*la|)\s*(?:\(|)\s*(\d{1,2}):(\d{1,2})\s*(?:(am)|(pm))\s*(?:\)|)" #acepta que no se le indicase el 'am' o el 'pm'
identificate_hours = r"(?:a\s*las|a\s*la|)\s*(?:\(|)\s*(\d{1,2}):(\d{1,2})\s*(?:(am)|(pm)|)\s*(?:\)|)" #no acepta que no se le indicase el 'am' o el 'pm'

date_format = r"(?:\(|)\s*(\d*)_-_(\d{2})_-_(\d{2})\s*(?:\)|)"

# (?:,\s*hasta|hasta|al|a )
some_text_limiters = [r",\s*hasta", r"hasta", r"al", r"a "]

for some_text_limiter in some_text_limiters:

    identification_re_0 = r"(?:(?<=\s)|^)(?:desde\s*el|desde|del|de\s*el|de\s*la|de |)\s*(?:día|dia|fecha|)\s*(?:del|de\s*el|de |)\s*" + date_format + r"\s*(?:" + identificate_hours + r"|)\s*(?:\)|)\s*(" + some_text + r")\s*" + some_text_limiter + r"\s*(?:el|la|)\s*(?:fecha|d[íi]a|)\s*(?:del|de\s*el|de|)\s*" + date_format + r"\s*(?:" + identificate_hours + r"|)\s*(?:\)|)"

    input_text = re.sub(identification_re_0,
                        lambda m: (f"({m[1]}_-_{m[2]}_-_({m[3]}({m[4] or '00'}:{m[5] or '00'} {m[6] or m[7] or 'am'})_--_{m[9]}_-_{m[10]}_-_({m[11]}({m[12] or '00'}:{m[13] or '00'} {m[14] or m[15] or 'am'})))({m[8]})").replace(" )", ")").replace("( ", "("),
                        input_text, re.IGNORECASE)


print(repr(input_text))
Matt095
  • 857
  • 3
  • 9
  • I don't think this is something you can set. Either the regexp engine does left-to-right or longest match, you can't change it. – Barmar Jan 09 '23 at 21:50
  • I don't understand why in this case instead of starting with `",\s*hasta"`, it is starting directly with the last option `"a "`, I guess it's because of the greedy behavior that regex have by default and the `(?:,\s*hasta|hasta|al|a )` regex adjusts to the `(?:(?!\.\s*)[^;])*` regex, so that the first regex gets as many characters in its capture as possible. And because of this the malfunction of this code occurs. :S – Matt095 Jan 09 '23 at 22:09
  • See https://www.regular-expressions.info/alternation.html and the discussion of regex-directed and text-directed engines. – Barmar Jan 09 '23 at 22:10
  • If, as it says there, it was only for the eager aspect of the regex, the program should already work since *It stops searching as soon as it finds a valid match. The consequence is that in certain situations, the order of the alternatives matters.* . Is it possible to customize the regex so that it behaves eager and not greedy? – Matt095 Jan 09 '23 at 22:15
  • 1
    No, that's what I said in my first comment. Either the engine is text-directed or regex-directed. text-directed engines return the longest match, regex-directed engines return the leftmost match. – Barmar Jan 09 '23 at 22:20
  • Thanks, in that case I think all that remains is to restructure the identification regex pattern to something equivalent that can be processed with this regex engine – Matt095 Jan 09 '23 at 22:27
  • I don't think you can do it in a single regexp. If you want to give precedence to certain alternatives that aren't longest, check for them first. Then try another regexp with the longer alternative, and so on. – Barmar Jan 09 '23 at 22:30
  • @Barmar I was trying with partial regex I have been validating one by one however, I have not managed to get together the restrictions of both patterns to obtain this type of outputs – Matt095 Jan 10 '23 at 00:09
  • 1
    @MatiasNicolasRodriguez this `(?:el dia|dia|el)\s*(?:del|de\s*el|de |)\s` part in your regex is to remove the before `corrimos`? – Ramesh Jan 10 '23 at 04:44
  • @Ramesh There I have edited the question with the code that I put together to avoid that the word "hast" is captured incorrectly. Except for a detail with example 3, the code that I just added works more or less well – Matt095 Jan 10 '23 at 05:03
  • your updated code is not working properly and i was getting like the output same as input. based on your expected solution.. you are trying to capture the text btw two words like `el dia` and `hasta el 2022_-_12_-_13`. – Ramesh Jan 10 '23 at 05:18
  • In the question I simplified the code, but in the end I have added the complete code, which is the same but instead of limiting itself to capturing what the pattern housed in the variable some_text is looking for, and then printing it with a print(). In the full code, what is extracted by the some_text pattern is `{m[8]}` and replaces it – Matt095 Jan 10 '23 at 05:27
  • I don't know if it will be the best solution, but now I have edited it and posted it as a possible answer to the question. Since I couldn't find a way to do the capture with a single regex, I decided to include a for loop that iterates the possibilities and not let (?: | |) take care of deciding the check order – Matt095 Jan 10 '23 at 05:31

1 Answers1

1

you can validate the date strings and then replace the date strings with symbols(make sure it won't repeat in the text) and extract the text between them.

import re

re_exp = r'((?:hasta el))?\s\d{4}\_\-\_\d{2}\_\-\_\d{2}\s?((?:\d{2}\:\d{2}\s(?:am|pm)?)?)'
input_text = 'desde el 2022_-_12_-_10 corrimos juntas hasta el 11° nivel de aquella montaña hasta el 2022_-_12_-_13'
input_text = 'desde el 2022_-_11_-_10 18:30 pm corrimos juntas hasta el 11° nivel de aquella montaña y ' \
             'luego bajamos hasta la salida, hasta el 2022_-_12_-_01 21:00 hs caminamos juntas'
input_text = "desde el 2022_-_11_-_10 18:30 pm hasta el 2022_-_12_-_01 21:00 hs"
data = re.sub(re_exp, "@*@", input_text)
text_btw_dates = [i.replace('@', '').strip().strip(".,") for i in data.split('*') if
                  i.startswith('@') and i.endswith('@') and len(i) > 1]
print(text_btw_dates)

>>> ['corrimos juntas hasta el 11° nivel de aquella montaña']
>>> ['corrimos juntas hasta el 11° nivel de aquella montaña y luego bajamos hasta la salida']
>>> [""]
Ramesh
  • 635
  • 2
  • 15
  • 1
    Thank you very much for the help, enclosing the phrase inside symbols is a way of limiting it internally to extract it – Matt095 Jan 10 '23 at 08:47