0

I was trying several ways to divide these strings according to the separators that are in the separator_symbols variable, but only if the content in the middle meets the fact that there is a substring that meets the sequence of the pattern "\(\(VERB\)\s*\ w+(?:\s+\w+)*\)" and that also finds within that substring 3 different words of that pattern (I interpret the word as a sequence of text, with uppercase or/and lowercase letters, which is separated from the rest of the text by at least one whitespace)

import re

def this_substring_has_a_verb_substring(substring):
    pattern = r"\(\(VERB\)\s*\w+(?:\s+\w+)*\)"
    return re.search(pattern, substring) is not None


#example 1
input_string = 'El árbol ((VERB es)) grande, las hojas ((VERB)son) doradas y ((VERB)son) secas, los juegos del parque ((VERB)estan) algo oxidados y ((VERB)es) peligroso subirse a ellos'
#example 2
input_string = 'hay que ((VERB) correr), ((VERB)saltar), ((VERB)volar) y ((VERB)caminar) para llegar a ese lugar',


separator_symbols = r'(?:(?:,|;|\.|)\s*y\s+|,\s*|;\s*)(?:[A-Z]|l[oa]s|la|[eé]l)'

In order to divide these strings and obtain the outputs that are at the end of this question, I have tried 2 ways to achieve it, although I found limitations in both.

  • As a first option try to create a very generic pattern that is enclosed in the middle of the symbols of the pattern stored in separator_symbols or that is limited by the start or end of the original string.
#OPTION 1
# "\(\(VERB\)\s*\w+(?:\s+\w+)*\)" #((VERB)asfdgfg)
# "((?:(?:\w+))?){3}" # 3 words

#the identification pattern should not tolerate so many possibilities, since it would be useless in many cases in its role of information validation
captured_sentence_part = r"(.)*"

substrings = re.findall(separator_symbols + captured_sentence_part + r'(?:' + separator_symbols + r'|$)', string)
  • In the second option try to use the split() method to separate the words in a list, and then use the len() function to count the number of items in this list of hypothetical divisions of the original input string, although in reality everything is done in an auxiliary variable, since you still have to confirm with an if that it meets both conditions

But the problem with this option 2 is that it's too complicated to put the split results together and put them together in a list as seen at the end of the question

#OPTION 2
pattern = r'(?:(?:,|;|\.|)\s*y\s+|,\s*|;\s*)(?:[A-Z]|l[oa]s|la|[eé]l)'
sub_sentences_list = re.split(pattern, input_string )

for i_sub_input_text in sub_sentences_list:
    words = i_sub_input_text.split()
    word_count = len(words)
    
    #conditions validation
    if(word_count > int(number_of_words) + 1) and this_substring_has_a_verb_substring(i_sub_input_text) == True:
    print("extraction!")

The outputs in each of the examples should look like these lists with the divisions of the original string:

#for example 1:
['El árbol ((VERB)es) grande,', 
'las hojas ((VERB)son) doradas y ((VERB)son) secas,', 
'los juegos del parque ((VERB)estan) algo oxidados y ((VERB)es) peligroso subirse a ellos']

#for example 2:
['hay que ((VERB) correr), ((VERB)saltar), ((VERB)volar) y ((VERB)caminar) para llegar a ese lugar']

Note that in example 2, the string was not split, since it did not meet the condition of having present a sequence ((VERB) ) and 3 other words from it, in the middle of the separator_symbols

What method would be the most recommended? And how should I fix it to get this list as output?

Matt095
  • 857
  • 3
  • 9
  • 1
    is there a distinction between verbs written like `((VERB correr))` vs `((VERB)son)` or is that a typo – Alexander Feb 13 '23 at 06:29
  • @Alexander It's a typographical error, I'll fix it there – Matt095 Feb 13 '23 at 06:31
  • I fix the output, always the verb capsule pattern is `((VERB)/s*something)` and I use the regex pattern `r"\(\(VERB\)\s*\w+(?:\s+\w+)*\)"` to identify such "standardized" sequences – Matt095 Feb 13 '23 at 06:35

0 Answers0