0
import re

input_text = "En esta alejada ciudad por la tarde circulan muchos camiones con aquellos acoplados rojos, grandes y bastante pesados, llevándolos por esos trayectos bastante empedrados, polvorientos, y un tanto arenosos. Y incluso bastante desde lejos ya se les puede ver." #example string

list_verbs_in_this_input = ["serías", "serían", "sería", "ser", "es", "llevándoles", "llevándole", "llevándolos", "llevándolo", "circularías", "circularía", "circulando", "circulan", "circula", "consiste", "consistían", "consistía", "consistió", "visualizar", "ver", "empolvarle", "empolvar", "verías", "vería", "vieron", "vió", "vio", "ver", "podrías" , "podría", "puede"]

exclude = rf"(?!\b(?:{'|'.join(list_verbs_in_this_input)})\b)"
direct_subject_modifiers, noun_pattern = exclude + r"\w+" , exclude + r"\w+"

#modifier_connectors = r"(?:(?:,\s*|)y|(?:,\s*|)y|,)\s*(?:(?:(?:a[úu]n|todav[íi]a|incluso)\s+|)(?:de\s*gran|bastante|un\s*tanto|un\s*poco|)\s*(?:m[áa]s|menos)\s+|)"
modifier_connectors = r"(?:(?:,\s*|)y|(?:,\s*|)y|,)\s*(?:(?:(?:a[úu]n|todav[íi]a|incluso)\s+|)(?:(?:de\s*gran|bastante|un\s*tanto|un\s*poco|)\s*(?:m[áa]s|menos)|bastante)\s+|)"

enumeration_of_noun_modifiers = direct_subject_modifiers + "(?:" + modifier_connectors  + direct_subject_modifiers + "){2,}"

sentence_capture_pattern = r"(?:aquellas|aquellos|aquella|aquel|los|las|el|la|esos|esas|este|ese|otros|otras|otro|otra)\s+" + noun_pattern + r"\s+" + enumeration_of_noun_modifiers


input_text = re.sub(sentence_capture_pattern, r"((NOUN)\g<0>)", input_text, flags=re.I|re.U)
print(repr(input_text)) # --> output

Capturing a word r"\w+" that is before the pattern enumeration_of_noun_modifiers, and then everything that is inside the pattern enumeration_of_noun_modifiers places it inside some ' ', leaving the string restructured in this way...

((NOUN='acoplados rojos, grandes y bastante pesados')aquellos)

((NOUN='trayectos bastante empedrados, polvorientos, y un tanto arenosos')esos)

Keep in mind that in front of r"\w+" in the direct_subject_modifiers pattern and in the noun_pattern pattern I have placed exclude since it is in charge of checking that the elements within the capture group do not match any element within that string (in order to avoid false positives )

The string that would be obtained as output that should be obtained after identifying and restructuring those substrings, is the following:

'En esta alejada ciudad por la tarde circulan muchos camiones con ((NOUN='acoplados rojos, grandes y bastante pesados')aquellos), llevándolos por ((NOUN='trayectos bastante empedrados, polvorientos, y un tanto arenosos')esos). Y incluso bastante desde lejos ya se les puede ver.'

What is it that makes these substrings not be identified and my regex sentence_capture_pattern doesn't work?


EDIT CODE:

It is an edition of the code after some modifications, even so it continues to have some bugs..

import re

input_text = "En esta alejada ciudad por la tarde circulan muchos camiones con aquellos acoplados rojos, grandes y bastante pesados, llevándolos por esos trayectos bastante empedrados, polvorientos, y un tanto arenosos. Y incluso bastante desde lejos ya se les puede ver." #example string

list_verbs_in_this_input = ["serías", "serían", "sería", "ser", "es", "llevándoles", "llevándole", "llevándolos", "llevándolo", "circularías", "circularía", "circulando", "circulan", "circula", "consiste", "consistían", "consistía", "consistió", "visualizar", "ver", "empolvarle", "empolvar", "verías", "vería", "vieron", "vió", "vio", "ver", "podrías" , "podría", "puede"]

exclude = rf"(?!\b(?:{'|'.join(list_verbs_in_this_input)})\b)"
direct_subject_modifiers, noun_pattern = exclude + r"\w+" , exclude + r"\w+"

#includes the word "bastante" as an optional case independent of its happening from the words "(m[áa]s|menos)"
modifier_connectors = r"(?:(?:,\s*|)y|(?:,\s*|)y|,)\s*(?:(?:(?:a[úu]n|todav[íi]a|incluso)\s+|)(?:(?:de\s*gran|bastante|un\s*tanto|un\s*poco|)\s*(?:m[áa]s|menos)|bastante)\s+|)"

#enumeration_of_noun_modifiers = direct_subject_modifiers + "(?:" + modifier_connectors  + direct_subject_modifiers + "){2,}"
enumeration_of_noun_modifiers = direct_subject_modifiers + "(?:" + modifier_connectors  + direct_subject_modifiers + ")*"


#sentence_capture_pattern = r"(?:aquellas|aquellos|aquella|aquel|los|las|el|la|esos|esas|este|ese|otros|otras|otro|otra)\s+" + noun_pattern + r"\s+" + enumeration_of_noun_modifiers
sentence_capture_pattern = r"(?:aquellas|aquellos|aquella|aquel|los|las|el|la|esos|esas|este|ese|otros|otras|otro|otra)\s+" + noun_pattern + r"\s+" + modifier_connectors + direct_subject_modifiers + r"\s+(?:" + enumeration_of_noun_modifiers + r"|)"

# ((NOUN)'    ')
input_text = re.sub(sentence_capture_pattern, r"((NOUN)'\g<0>')", input_text, flags=re.I|re.U)
print(repr(input_text)) # --> output
Matt095
  • 857
  • 3
  • 9
  • 1
    I don't know your language, but your regex will only consider "bastante" where it occurs in your regex, when it is followed by either "mas" or "menos". This is not the case in "bastante pesados" nor in "bastante empedrados", so I don't know what should happen... – trincot Jan 28 '23 at 13:52
  • Secondly, the two `\w+` parts of the regex would match with `trayectos bastante`, so I don't see how you would expect to continue the match. This time I won't be able to post an answer for your query... it is not clear to me how you expect that result. – trincot Jan 28 '23 at 14:08
  • @trincot you are right, i think i should add the option `"bastante" ` (only without `"mas"` or `"menos"`) within the pattern in `modifier_connectors` – Matt095 Jan 28 '23 at 14:16
  • @trincot I have edited the question contemplating the case where the word `"bastante"` is alone `modifier_connectors = r"(?:(?:,\s*|)y|(?:,\s*|)y|,)\s*(?:(?:(?:a[úu]n|todav[íi]a|incluso)\s+|)(?:(?:de\s*gran|bastante|un\s*tanto|un\s*poco|)\s*(?:m[áa]s|menos)|bastante)\s+|)"` , still the code keeps failing. – Matt095 Jan 28 '23 at 14:22
  • Yes, there I several issues I found that would partially fix the situation, but my second comment above remains a problem where I don't know what is supposed to happen. – trincot Jan 28 '23 at 14:23
  • this capture structure will look something like this `"esos"` + `noun_pattern` = `"trayectos"` + `enumeration_of_noun_modifiers` = `"bastante empedrados, polvorientos, y un tanto arenosos"` , where the connectors would be `"bastante"` , `","` , `",y"` – Matt095 Jan 28 '23 at 14:25
  • Yes, but the second `\w+` will already match `bastante` and so it will not be matched by the `enumeration_of_noun_modifiers`. – trincot Jan 28 '23 at 14:28
  • I hope you can see that this kind of parsing is just impossible to maintain. You should really not pursue this type of coding. It is a project that is deemed to fail. – trincot Jan 28 '23 at 14:34
  • Something that occurred to me is to modify the sentence of the pattern, so that after the noun there is an adjective without the need for a connector in the middle, and after this adjective, if the intermediate connectors will be necessary since it will be an enumeration `sentence_capture_pattern = r"(?:aquellas|aquellos|aquella|aquel|los|las|el|la|esos|esas|este|ese|otros|otras|otro|otra)\s+" + noun_pattern + r"\s+" + r"(?:" + modifier_connectors + direct_subject_modifiers + r")" + enumeration_of_noun_modifiers` – Matt095 Jan 28 '23 at 14:35
  • @trincot Something that I could do in that case is that 2 enumerations are no longer necessary so that if there is a match, that is, I should replace this `"){2,}"` with this `")*"`, and make the pattern that is stored inside enumeration_of_noun_modifiers optional, that is, place it like this `r"(?:" + enumeration_of_noun_modifiers + r"|)"` – Matt095 Jan 28 '23 at 14:39
  • 1
    Sorry, I have already given up on this question, realising that this is just not the right way to parse language. – trincot Jan 28 '23 at 14:42
  • @trincot Really thank you very much for all the help, anyway I'll be doing a couple more tries. I have edited the question again, attaching a version of the code at the end of this question, taking into account the considerations that we discussed in comments. – Matt095 Jan 28 '23 at 14:51

0 Answers0