import re
input_text = "((PL_ADVB)alrededor ((NOUN)(del auto rojizo, dentro de algo grande y completamente veloz)). Luego dentro del baúl rápidamente abajo de una caja por sobre ello vimos una caña." #example input
place_reference = r"(?i:[\w,;.]\s*)+?"
list_all_adverbs_of_place = ["adentro", "dentro", "al rededor", "alrededor", "abajo", "hacía", "hacia", "por sobre", "sobre"]
list_limiting_elements = list_all_adverbs_of_place + ["vimos", "hemos visto", "encontramos", "hemos encontrado", "rápidamente", "rapidamente", "intensamente", "durante", "luego", "ahora", ".", ":", ";", ",", "(", ")", "[", "]", "¿", "?", "¡", "!", "&", "="]
pattern = re.compile(rf"(?:(?<=\s)|^)({'|'.join(re.escape(x) for x in list_all_adverbs_of_place)})?(\s+{place_reference})\s*((?={'|'.join(re.escape(x) for x in list_limiting_elements)}))", flags = re.IGNORECASE)
input_text = re.sub(pattern,
lambda m: f"((PL_ADVB){m[1]}{m[2]}){m[3]}" if m[2] else f"((PL_ADVB){m[1]} NO_DATA){m[3]}",
input_text)
print(repr(input_text)) #--> output
How to make the regex in the variable called as pattern
only capture if and only if the text to capture is not in the middle of ((NOUN)
"the captured text" )
This way you should prevent this string
((NOUN)(del auto rojizo, dentro de algo grande y completamente veloz))
become in this other string...
((NOUN)(del auto rojizo, ((PL_ADVB)dentro de algo grande y completamente veloz)))
And this because dentro de algo grande y completamente veloz
is in the middle oF ((NOUN)
and )
. It is very important that for the regex to decide not to capture the blocking area in the string, it is in the middle of these 2 limiters.
The correct output would be:
'((PL_ADVB)alrededor ((NOUN)(del auto rojizo, dentro de algo grande y completamente veloz)). Luego ((PL_ADVB)dentro del baúl) rápidamente ((PL_ADVB)abajo de una caja) ((PL_ADVB)por sobre ello) vimos una caña.'
As can be seen in the rest of the areas of the string where the blocking pattern (in this case is ((NOUN)
blocking area )
) was not present, the replacements were made.