import re
word = ""
input_text = "Creo que July no se trata de un nombre" #example 1, should match with the Case 00
#input_text = "Creo que July Moore no se trata de un nombre" #example 2, should not match any case
#input_text = "Efectivamente esa es una lista de nombres. July Moore no se trata de un nombre" #example 3, should match with the Case 01
#input_text = "July Moore no se trata de un nombre" #example 4, should match with the Case 01
name_capture_pattern_00 = r"((?:\w+))?" # does not tolerate whitespace in middle
#name_capture_pattern_01 = r"((?:\w\s*)+)"
name_capture_pattern_01 = r"(^[A-Z](?:\w\s*)+)" # tolerates that there are spaces but forces it to be a word that begins with a capital letter
#Case 00
regex_pattern_00 = name_capture_pattern_00 + r"\s*(?i:no)\s*(?i:se\s*tratar[íi]a\s*de\s*un\s*nombre|se\s*trata\s*de\s*un\s*nombre|(?:ser[íi]a|es)\s*un\s*nombre)"
#Case 01
regex_pattern_01 = r"(?:^|[.;,]\s*)" + name_capture_pattern_01 + r"\s*(?i:no)\s*(?i:se\s*tratar[íi]a\s*de\s*un\s*nombre|se\s*trata\s*de\s*un\s*nombre|(?:ser[íi]a|es)\s*un\s*nombre)"
#Taking the regex pattern(case 00 or case 01), it will search the string and then try to extract the substring of interest using capturing groups.
n0 = re.search(regex_pattern_00, input_text)
if n0 and word == "":
word, = n0.groups()
word = word.strip()
print(repr(word)) # --> print the substring that I captured with the capturing group
n1 = re.search(regex_pattern_01, input_text)
if n1 and word == "":
word, = n1.groups()
word = word.strip()
print(repr(word)) # --> print the substring that I captured with the capturing group
If in front of the pattern there is a .\s*
, a ,\s*
, a ;\s*
, or if it is simply the beginning of the input string, then use this capture pattern name_capture_pattern_01 = r"((?:\w\s*)+)?"
, but if that is not the case, use this other capture pattern name_capture_pattern_00 = r"((?:\w+))?"
I think that in case 00 you should add something like this at the beginning of the pattern (?:(?<=\s)|^)
That way you would get these 2 possible resulting patterns after concatenate, where perhaps an or
condition |
can be set inside the search pattern:
In Case 00
...
(?:\.|\;|\,)
or the start of the string
+
((?:\w\s*)+)?
+
r"\s*(?i:no)\s*(?i:se\s*tratar[íi]a\s*de\s*un\s*nombre|se\s*trata\s*de\s*un\s*nombre|(?:ser[íi]a|es)\s*un\s*nombre)"
In other case (Case 01
)...
((?:\w+))??
+
r"\s*(?i:no)\s*(?i:se\s*tratar[íi]a\s*de\s*un\s*nombre|se\s*trata\s*de\s*un\s*nombre|(?:ser[íi]a|es)\s*un\s*nombre)"
But in both cases (Case 00
or Case 01
, depending on what the program identifies) it should match the pattern and extract the capturing group to store it in the variable called as word
.
And the correct output for each of these cases would be the capture group that should be obtained and printed in each of these examples:
'July' #for the example 1
'' #for the example 2
'July Moore' #for the example 3
'July Moore' #for the example 4
EDIT CODE:
This code, although it appears that the regex patterns are well established, fails by returning as output only the last part of the name, in this case "Moore"
, and not the full name "July Moore"
import re
#Here are 2 examples where you can see this "capture error"
input_text = "HghD djkf ; July Moore no se trata de un nombre"
input_text = "July Moore no se trata de un nombre"
word = ""
#name_capture_pattern_01 = r"((?:\w\s*)+)"
name_capture_pattern_01 = r"([A-Z][a-z]+(?:\s*[A-Z][a-z]+)*)"
#Case 01
regex_pattern_01 = r"(?:^|[.;,]\s*)" + name_capture_pattern_01 + r"\s*(?i:no)\s*(?i:se\s*tratar[íi]a\s*de\s*un\s*nombre|se\s*trata\s*de\s*un\s*nombre|(?:ser[íi]a|es)\s*un\s*nombre)"
n1 = re.search(regex_pattern_01, input_text)
if n1 and word == "":
word, = n1.groups()
word = word.strip()
print(repr(word))
In both examples, since it complies with starting with (?:^|[.;,]\s*)
and starting with a capital letter like this pattern ([A-Z][a-z]+(?:\s*[A-Z][a-z]+)*)
, it should print the full name in the console July Moore
. It's quite curious but placing this pattern makes it impossible for me to capture a complete name under these conditions established by the search pattern.