How to build this regex so that it extracts a word that starts with a capital letter if only if it appears after a previous pattern?

Question

I need a regex that extracts all the names (we will consider that they are all the words that start with a capital letter and respect having certain conditions prior to their appearance within the sentence) that are in a sentence. This must be done respecting the pattern that I clarify below, also extracting the content before and after this name, so that it can be printed next to the name that was extracted within that sequence or pattern.

This is the pseudo-regex pattern that I need:

the beginning of the input sentence or (,|;|.|y)

associated_sense_1: "some character string (alphanumeric)" or "nothing"

(con |juntos a |junto a |en compania de )

identified_person: "some word that starts with a capital letter (the name that I must extract)" and it ends when the regex find one or more space

associated_sense_2: "some character string (alphanumeric)" or "nothing"

the end o the input sentence or (,|;|.|y |con |juntos a |junto a |en compania de )

the (,|;|.|y) are just person connectors that are used to build a regex pattern, but they do not provide information beyond indicating the sequence of belonging, then they can be eliminated with a .replace( , "")

And with this regex I need extract this 3 string groups

associated_sense_1

identified_person

associated_sense_2


associated_sense = associated_sense_1 + " " + associated_sense_2

This is the proto-code:

import re

#Example 1
sense = "puede ser peligroso ir solas, quizas sea mejor ir con Adrian y seguro que luego podemos esperar por Melisa, Marcos y Lucy en la parada"
#Example 2
#sense = "Adrian ya esta en la parada; y alli probablemente esten Lucy y May en la parada esperandonos"

person_identify_pattern = r"\s*(con |por |, y |, |,y |y )\s*[A-Z][^A-Z]*"
#person_identify_pattern = r"\s*(con |por |, y |, |,y |y )\s*[^A-Z]*"


for identified_person in re.split(person_identify_pattern, sense):
    identified_person = identified_person.strip()
    if identified_person:
        try:
            print(f"Write '{associated_sense}' to {identified_person}.txt")
        except:
            associated_sense = identified_person

The wrong output I get...

Write 'puede ser peligroso ir solas, quizas sea mejor ir' to con.txt
Write 'puede ser peligroso ir solas, quizas sea mejor ir' to Melisa.txt
Write 'puede ser peligroso ir solas, quizas sea mejor ir' to ,.txt
Write 'puede ser peligroso ir solas, quizas sea mejor ir' to Lucy en la parada.txt

Correct output for example 1:

Write 'quizas sea mejor ir con' to Adrian.txt
Write 'y seguro que luego podemos esperar por en la parada' to Melisa.txt
Write 'y seguro que luego podemos esperar por en la parada' to Marcos.txt
Write 'y seguro que luego podemos esperar por en la parada' to Lucy.txt

Correct output for example 2:

Write 'ya esta en la parada' to Adrian.txt
Write 'alli probablemente esten en la parada esperandonos' to Lucy.txt
Write 'alli probablemente esten en la parada esperandonos' to May.txt

I was trying with this other regex but I still have problems with this code:

import re

sense = "puede ser peligroso ir solas, quizas sea mejor ir con Adrian y seguro que luego podemos esperar por Melisa, Marcos y Lucy en la parada"

person_identify_pattern = r"\s*(?:,|;|.|y |con |juntos a |junto a |en compania de |)\s*((?:\w\s*)+)\s*(?<=con|por|a, | y )\s*([A-Z].*?\b)\s*((?:\w\s*)+)\s*(?:,|;|.|y |con |juntos a |junto a |en compania de )\s*"

for m in re.split(person_identify_pattern, sense):
    m = m.strip()
    if m:
        try:
            print(f"Write '{content}' to {m}.txt")
        except:
            content = m

But I keep getting this wrong output

Write 'puede ser peligroso ir solas' to quizas sea mejor ir con Adrian y seguro que luego podemos esperar por.txt
Write 'puede ser peligroso ir solas' to Melisa,.txt
Write 'puede ser peligroso ir solas' to Marcos y Lucy en la parad.txt

`ya esta en la parada` doesn't appear in the text for Example1, is the first line of Correct output for example1 correct? — sniperd, Aug 03 '22 at 14:41
@sniperd Sorry that was my mistake, I copied the one from example 2, but for example 1 it should have been **'quizas sea mejor ir con'** instead . Now I edit the question with the correct output for the example1. — , Aug 03 '22 at 15:15

score 0 · Accepted Answer · answered Aug 04 '22 at 09:06

0

import re

sense = "puede ser peligroso ir solas, quizas sea mejor ir con Adrian y seguro que luego podemos esperar por Melisa, Marcos y Lucy en la parada"
if match := re.findall(r"(?<=con|por|a, | y )\s*([A-Z].*?\b)", sense):
    print(match)

it result = ['Adrian', 'Melisa', 'Marcos', 'Lucy']

answered Aug 04 '22 at 09:06

RedApple

159
4

I have updated the question combining your regex with mine but I still have problems extracting the content before and the content after the names (word without spaces that starts with capital letters) – Aug 04 '22 at 09:44
This is called a lookahead assertion, click the url: https://docs.python.org/3/library/re.html ctrl+F search ?= – RedApple Aug 04 '22 at 10:35
You provided the wrong string, and I tried your description again and got the following results ```python import re sense = "puede ser peligroso ir solas, quizas sea mejor ir con Adrian y seguro que luego podemos esperar por Melisa, Marcos y Lucy en la parada" if match := re.findall(r"(.*?)(?!con|por|a, | y )(?<=con|por|a, | y )\s*([A-Z].*?\b)", sense): print(match) ``` it print [('puede ser peligroso ir solas, quizas sea mejor ir con', 'Adrian'), (' y seguro que luego podemos esperar por', 'Melisa'), (', ', 'Marcos'), (' y ', 'Lucy')] – RedApple Aug 04 '22 at 10:42
I have tried something similar, to then decompose the resulting list into pairs of elements, but the problem is that **Melisa**, **Marcos** and **Lucy** are 3 name of people, and all of them are assigned the text, in this case, this text is *"y seguro que luego podemos esperar por"*, the signs like **,** or **y** are just person connectors that are part of the recognition regex pattern and must be removed by a replace( ,"") after the recognition, that's why those connectors don't appear in the final result. – Aug 04 '22 at 10:59
I need that this code print this list ` [('quizas sea mejor ir con', 'Adrian'), (' y seguro que luego podemos esperar por', 'Melisa'), (' y seguro que luego podemos esperar por' ', 'Marcos'), (' y seguro que luego podemos esperar por' ', 'Lucy')] ` – Aug 04 '22 at 11:00
the pattern indicates that the regex extracts after **(,|;|.|y)** or if is the the beginning of the input sentence , and must end with a word that begins with a capital letter (which will be the name of the associated person. Then there is another condition indicates that you can associate more than one person to a context if **commas** are placed and the last person is placed a **y**. Keep in mind that the original sentence in English translates as **"it can be dangerous to go alone, maybe it's better to go with Adrian and surely then we can wait for Melisa, Marcos and Lucy at the stop"** – Aug 04 '22 at 11:07

How to build this regex so that it extracts a word that starts with a capital letter if only if it appears after a previous pattern?

1 Answers1