Set regex pattern that concatenates one capture group or another depending on whether or not the input string starts with certain symbols

Question

import re
word = ""

input_text = "Creo que July no se trata de un nombre" #example 1, should match with the Case 00
#input_text = "Creo que July Moore no se trata de un nombre" #example 2, should not match any case
#input_text = "Efectivamente esa es una lista de nombres. July Moore no se trata de un nombre" #example 3, should match with the Case 01
#input_text = "July Moore no se trata de un nombre" #example 4, should match with the Case 01

name_capture_pattern_00 = r"((?:\w+))?"         # does not tolerate whitespace in middle

#name_capture_pattern_01 = r"((?:\w\s*)+)"
name_capture_pattern_01 = r"(^[A-Z](?:\w\s*)+)"      # tolerates that there are spaces but forces it to be a word that begins with a capital letter

#Case 00
regex_pattern_00 = name_capture_pattern_00 + r"\s*(?i:no)\s*(?i:se\s*tratar[íi]a\s*de\s*un\s*nombre|se\s*trata\s*de\s*un\s*nombre|(?:ser[íi]a|es)\s*un\s*nombre)"
#Case 01
regex_pattern_01 = r"(?:^|[.;,]\s*)" + name_capture_pattern_01 + r"\s*(?i:no)\s*(?i:se\s*tratar[íi]a\s*de\s*un\s*nombre|se\s*trata\s*de\s*un\s*nombre|(?:ser[íi]a|es)\s*un\s*nombre)"

#Taking the regex pattern(case 00 or case 01), it will search the string and then try to extract the substring of interest using capturing groups.

n0 = re.search(regex_pattern_00, input_text)
if n0 and word == "":
    word, = n0.groups()
    word = word.strip()

print(repr(word)) # --> print the substring that I captured with the capturing group

n1 = re.search(regex_pattern_01, input_text)
if n1 and word == "":
    word, = n1.groups()
    word = word.strip()

print(repr(word)) # --> print the substring that I captured with the capturing group

If in front of the pattern there is a .\s* , a ,\s* , a ;\s* , or if it is simply the beginning of the input string, then use this capture pattern name_capture_pattern_01 = r"((?:\w\s*)+)?", but if that is not the case, use this other capture pattern name_capture_pattern_00 = r"((?:\w+))?"

I think that in case 00 you should add something like this at the beginning of the pattern (?:(?<=\s)|^)

That way you would get these 2 possible resulting patterns after concatenate, where perhaps an or condition | can be set inside the search pattern:

In Case 00...

In other case (Case 01)...

((?:\w+))?? + r"\s*(?i:no)\s*(?i:se\s*tratar[íi]a\s*de\s*un\s*nombre|se\s*trata\s*de\s*un\s*nombre|(?:ser[íi]a|es)\s*un\s*nombre)"

But in both cases (Case 00 or Case 01, depending on what the program identifies) it should match the pattern and extract the capturing group to store it in the variable called as word .

And the correct output for each of these cases would be the capture group that should be obtained and printed in each of these examples:

'July'         #for the example 1
''             #for the example 2
'July Moore'   #for the example 3
'July Moore'   #for the example 4

EDIT CODE:

This code, although it appears that the regex patterns are well established, fails by returning as output only the last part of the name, in this case "Moore", and not the full name "July Moore"

import re

#Here are 2 examples where you can see this "capture error"
input_text = "HghD djkf ; July Moore no se trata de un nombre"
input_text = "July Moore no se trata de un nombre"

word = ""

#name_capture_pattern_01 = r"((?:\w\s*)+)"
name_capture_pattern_01 = r"([A-Z][a-z]+(?:\s*[A-Z][a-z]+)*)"

#Case 01
regex_pattern_01 = r"(?:^|[.;,]\s*)" + name_capture_pattern_01 + r"\s*(?i:no)\s*(?i:se\s*tratar[íi]a\s*de\s*un\s*nombre|se\s*trata\s*de\s*un\s*nombre|(?:ser[íi]a|es)\s*un\s*nombre)"

n1 = re.search(regex_pattern_01, input_text)
if n1 and word == "":
    word, = n1.groups()
    word = word.strip()

print(repr(word))

In both examples, since it complies with starting with (?:^|[.;,]\s*) and starting with a capital letter like this pattern ([A-Z][a-z]+(?:\s*[A-Z][a-z]+)*), it should print the full name in the console July Moore. It's quite curious but placing this pattern makes it impossible for me to capture a complete name under these conditions established by the search pattern.

You'll have to differentiate between what is preceeding your target strings. In your case either `que ` or `(?:^|[.]\s)` where the character class could hold any valid character preceeding your target. [Example](https://regex101.com/r/UGH7MO/1). — oriberu, Jan 15 '23 at 00:59
@oriberu based on that pattern that you sent, I think I should set this `(?:^|[.;,]\s*)` in front of this pattern `name_capture_pattern_00 + r"\s*(?i:no)\s*(?i:se\s*tratar[íi]a\s*de\s*un\s*nombre|se\s*trata\s*de\s*un\s*nombre|(?:ser[íi]a|es)\s*un\s*nombre)"`. You use the caret sign `^` to indicate the beginning of the string, right? — Matt095, Jan 15 '23 at 01:11
@oriberu I wasnt trying to catch the `"no"` or the `"?"`, I was just trying to make that part of the pattern case-insensitive, and I didn't know that I could just use `r"(?i)something(?-1)"` when it was a pattern with a single option, instead of having to resort to a construct like those of multi-option `r"(?i:option_a|option_b|option_c)"`. I have used the caret `^`, in the **new code that I have placed in my updated question**. I have made an edit to try to simplify it and implement the caret `r"(?:^|[.;,]\s*)"`, but still the program keeps failing when identifying the correct word to extract — Matt095, Jan 15 '23 at 10:23
Something else I noticed, you use `\s*` when you probably mean `\s+`, unless you actually do want to match concatenated words (without spaces between them), too. — oriberu, Jan 15 '23 at 11:14
@oriberu It is important to distinguish between upper and lower case, since in the capture groups, names are being captured, however in the rest of the pattern the upper and lower case were established as optional. The thing about putting `\s*` is because I assume the possibility that some users omit (erroneously) some spaces when entering data. Although now that I notice it, I think I should put something like this `(?i)(?:a|b|c)(?-i)` to establish that only in that specific pattern (setting a start and an end) it should ignore the presence of upper and lower case letters — Matt095, Jan 15 '23 at 11:50
If you are using `\w+` to match names, case insensitivity is already active anyway, since `\w` is usually implemented as a variation of `[a-zA-Z0-9_]`. If you don't want that, you'll have to be more specific with the words to match, e.g. `[A-Z][a-z]+` uppercase letter followed by lowercase letters, etc. — oriberu, Jan 15 '23 at 11:58
@oriberu Why does it fail when I try to fetch names that start with a capital letter, and may (or may not) have spaces in between using `r"(^[A-Z](?:\w\s*)+)"` ? . For example, with the name `"Karina Bela"`, this capturing group `r"(?:^|[.;,]\s*)(^[A-Z](?:\w\s*)+)"` should work capturing both parts of the name, but... it doesn't work. — Matt095, Jan 15 '23 at 12:19
You don't need the caret inside the capturing group, but that _should_ actually work, although I would write it as something like `(?:^|[.;,]\s*)([A-Z][a-z]+(?:\s*[A-Z][a-z]+)*)` to make sure you only get words starting with an uppercase letter. Check the rest of your code for other errors. I would really encourage you to use a resource like regex101 where you can space out and test your expressions. If you enter your own example from your last comment there, you'll see it works. — oriberu, Jan 15 '23 at 12:26
@oriberu It's really quite curious but even removing that, and leaving the pattern `(?:^|[.;,]\s*)([A-Z][a-z]+(?:\s*[A-Z][a-z]+)*)`, the program still doesn't work. At the end of the question I have added an edit, with a code that shows the error in the capture group. I feel that it is very likely that the pattern is not well defined or that the capture of one of the patterns interferes with the capture group that is after it. — Matt095, Jan 15 '23 at 12:40
Sorry, I got confused because I tried to validate the other condition n0 first and then the condition n1, but the pattern works perfectly. Really thank you very much for the help. And although it works anyway, that last part would look like this `(?i)\s*no\s*(?i:se\s*tratar[íi]a\s*de\s*un\s*nombre|se\s*trata\s*de\s*un\s*nombre|(?:ser[íi]a|es)\s*un\s*nombre)(?-i)` — Matt095, Jan 15 '23 at 13:13
When I run your "edit" code, it outputs `July Moore`, not just `Moore` as you claim. — trincot, Jan 15 '23 at 13:39
@trincot Yes, it was my mistake, because inside the script where I tested it was the validation of the other condition — Matt095, Jan 15 '23 at 13:45
@oriberu If I have considered that, I think the best way to solve it is like or leave in the edit of the question. Because if not, there would be problems with the detection of the capital letters of the name. If the capital letters in the names were not important, I think that there if the regex could have been simplified more — Matt095, Jan 15 '23 at 13:46
Can we summarise the requirement as follows: a *name* (with proper case) consisting of two or more words should only match when it occurs at the start of a sentence. A name consisting of just one word can match wherever it occurs in the sentence (start or not). In both cases a text variant of "no se trata de un nombre" should follow. — trincot, Jan 15 '23 at 14:37
It seems that Python's `re` module now actually supports [non-capturing groups with inline modifiers](https://docs.python.org/3/library/re.html#index-17), which I'm pretty sure it didn't use to. I'll remove those of my comments that will be wrong in light of that knowledge. Sorry for the confusion. — oriberu, Jan 15 '23 at 15:13

score 1 · Accepted Answer · answered Jan 15 '23 at 14:48

If I understood correctly, you want to exclude cases where both of the following are true:

The name consists of more than one word; AND
The name does not occur at the start of a sentence

You could use just one regex and then inspect the match to decide whether the above condition occurs.

Here is a script I tested with:

import re

texts = [
    # Name is NOT at start of sentence, Name has SINGLE word: 
    "Creo que July no se trata de un nombre", 
    # Name is NOT at start of sentence, Name has MULTIPLE words: 
    "Creo que July Moore no se trata de un nombre", 
    # Name is at START of sentence, Name has MULTIPLE words: 
    "Efectivamente esa es una lista de nombres. July Moore no se trata de un nombre", 
    "July Moore Donald no se trata de un nombre",
    # Name is at START of sentence, Name has SINGLE word: 
    "July no se trata de un nombre",
]

for input_text in texts:
    regex = r"(^|[.;,]\s*)?([A-Z][a-z]+(\s*[A-Z][a-z]+)*)\s*(?i:no)\s*(?i:se\s*tratar[íi]a\s*de|se\s*trata\s*de|(?:ser[íi]a|es))\s*un\s*nombre"
    
    print("input:", input_text)
    for match in re.finditer(regex, input_text):
        word = ""
        # match[1] is not None => match is at start of a sentence.
        # match[3] is not None => match has name with more than one word.
        if match[1] is not None or not match[3]:
            word = match[2]
        print("    match:", repr(word) if word else "(no match)")

Notes:

I used finditer as in theory there might be more than one match in an input string
The use of \s* instead of \s+ is odd, but in comments you indicated that this is intended as you want to capture cases where some space separation is left out.
Names can look more complex than just [A-Z][a-z]+. Some names include hyphens, apostrophes or other characters, not to mention letters from other alphabets. The letter following a hyphen might be upper or lower case... etc.

Set regex pattern that concatenates one capture group or another depending on whether or not the input string starts with certain symbols

1 Answers1

Notes: