I was trying several ways to divide these strings according to the separators that are in the separator_symbols
variable, but only if the content in the middle meets the fact that there is a substring that meets the sequence of the pattern "\(\(VERB\)\s*\ w+(?:\s+\w+)*\)"
and that also finds within that substring 3 different words of that pattern (I interpret the word as a sequence of text, with uppercase or/and lowercase letters, which is separated from the rest of the text by at least one whitespace)
import re
def this_substring_has_a_verb_substring(substring):
pattern = r"\(\(VERB\)\s*\w+(?:\s+\w+)*\)"
return re.search(pattern, substring) is not None
#example 1
input_string = 'El árbol ((VERB es)) grande, las hojas ((VERB)son) doradas y ((VERB)son) secas, los juegos del parque ((VERB)estan) algo oxidados y ((VERB)es) peligroso subirse a ellos'
#example 2
input_string = 'hay que ((VERB) correr), ((VERB)saltar), ((VERB)volar) y ((VERB)caminar) para llegar a ese lugar',
separator_symbols = r'(?:(?:,|;|\.|)\s*y\s+|,\s*|;\s*)(?:[A-Z]|l[oa]s|la|[eé]l)'
In order to divide these strings and obtain the outputs that are at the end of this question, I have tried 2 ways to achieve it, although I found limitations in both.
- As a first option try to create a very generic pattern that is enclosed in the middle of the symbols of the pattern stored in separator_symbols or that is limited by the start or end of the original string.
#OPTION 1
# "\(\(VERB\)\s*\w+(?:\s+\w+)*\)" #((VERB)asfdgfg)
# "((?:(?:\w+))?){3}" # 3 words
#the identification pattern should not tolerate so many possibilities, since it would be useless in many cases in its role of information validation
captured_sentence_part = r"(.)*"
substrings = re.findall(separator_symbols + captured_sentence_part + r'(?:' + separator_symbols + r'|$)', string)
- In the second option try to use the
split()
method to separate the words in a list, and then use thelen()
function to count the number of items in this list of hypothetical divisions of the original input string, although in reality everything is done in an auxiliary variable, since you still have to confirm with an if that it meets both conditions
But the problem with this option 2 is that it's too complicated to put the split results together and put them together in a list as seen at the end of the question
#OPTION 2
pattern = r'(?:(?:,|;|\.|)\s*y\s+|,\s*|;\s*)(?:[A-Z]|l[oa]s|la|[eé]l)'
sub_sentences_list = re.split(pattern, input_string )
for i_sub_input_text in sub_sentences_list:
words = i_sub_input_text.split()
word_count = len(words)
#conditions validation
if(word_count > int(number_of_words) + 1) and this_substring_has_a_verb_substring(i_sub_input_text) == True:
print("extraction!")
The outputs in each of the examples should look like these lists with the divisions of the original string:
#for example 1:
['El árbol ((VERB)es) grande,',
'las hojas ((VERB)son) doradas y ((VERB)son) secas,',
'los juegos del parque ((VERB)estan) algo oxidados y ((VERB)es) peligroso subirse a ellos']
#for example 2:
['hay que ((VERB) correr), ((VERB)saltar), ((VERB)volar) y ((VERB)caminar) para llegar a ese lugar']
Note that in example 2, the string was not split, since it did not meet the condition of having present a sequence ((VERB) )
and 3 other words from it, in the middle of the separator_symbols
What method would be the most recommended? And how should I fix it to get this list as output?