0
import re

input_text = "((PL_ADVB)alrededor ((NOUN)(del auto rojizo, dentro de algo grande y completamente veloz)). Luego dentro del baúl rápidamente abajo de una caja por sobre ello vimos una caña." #example input

place_reference = r"(?i:[\w,;.]\s*)+?"
list_all_adverbs_of_place = ["adentro", "dentro", "al rededor", "alrededor", "abajo", "hacía", "hacia", "por sobre", "sobre"]
list_limiting_elements = list_all_adverbs_of_place + ["vimos", "hemos visto", "encontramos", "hemos encontrado", "rápidamente", "rapidamente", "intensamente", "durante", "luego", "ahora", ".", ":", ";", ",", "(", ")", "[", "]", "¿", "?", "¡", "!", "&", "="]

pattern = re.compile(rf"(?:(?<=\s)|^)({'|'.join(re.escape(x) for x in list_all_adverbs_of_place)})?(\s+{place_reference})\s*((?={'|'.join(re.escape(x) for x in list_limiting_elements)}))", flags = re.IGNORECASE)

input_text = re.sub(pattern,
                    lambda m: f"((PL_ADVB){m[1]}{m[2]}){m[3]}" if m[2] else f"((PL_ADVB){m[1]} NO_DATA){m[3]}",
                    input_text)

print(repr(input_text)) #--> output

How to make the regex in the variable called as pattern only capture if and only if the text to capture is not in the middle of ((NOUN) "the captured text" )

This way you should prevent this string

((NOUN)(del auto rojizo, dentro de algo grande y completamente veloz))

become in this other string...

((NOUN)(del auto rojizo, ((PL_ADVB)dentro de algo grande y completamente veloz)))

And this because dentro de algo grande y completamente veloz is in the middle oF ((NOUN) and ). It is very important that for the regex to decide not to capture the blocking area in the string, it is in the middle of these 2 limiters.

The correct output would be:

'((PL_ADVB)alrededor ((NOUN)(del auto rojizo, dentro de algo grande y completamente veloz)). Luego ((PL_ADVB)dentro del baúl) rápidamente ((PL_ADVB)abajo de una caja) ((PL_ADVB)por sobre ello) vimos una caña.'

As can be seen in the rest of the areas of the string where the blocking pattern (in this case is ((NOUN) blocking area ) ) was not present, the replacements were made.

Matt095
  • 857
  • 3
  • 9
  • Regular expressions are generally very poor at determining if a match is "inside" something. – Barmar Feb 03 '23 at 21:43
  • @Barmar I was trying to place a simple negative lookahead `(?!s\b)`, or try to be more specific by establishing some concrete boundaries `\b(\(\(NOUN\))?` and `\b(\))?`, but the problem I was having is that I can't be sure where the match will be located within those boundaries, so I was having a hard time with that. Perhaps there is nothing better than this for this particular case. – Matt095 Feb 03 '23 at 21:52

0 Answers0