0
import re

sentences_list = ["El coche ((VERB) es) rojo, la bicicleta ((VERB)está) allí; el monopatín ((VERB)ha sido pintado) de color rojo, y el camión también ((VERB)funciona) con cargas pesadas", "El árbol ((VERB es)) grande, las hojas ((VERB)son) doradas y ((VERB)son) secas, los juegos del parque ((VERB)estan) algo oxidados y ((VERB)es) peligroso subirse a ellos"]

aux_list = []
for i_input_text in sentences_list:

    #separator_symbols = r'(?:(?:,|;|\.|\s+)\s*y\s+|,\s*|;\s*)'
    separator_symbols = r'(?:(?:,|;|\.|)\s*y\s+|,\s*|;\s*)(?:[A-Z]|l[oa]s|la|[eé]l)'
    
    pattern = r"\(\(VERB\)\s*\w+(?:\s+\w+)*\)"
    
    # Separar la frase usando separator_symbols
    frases = re.split(separator_symbols, i_input_text)
    
    aux_frases_list = []
    # Buscar el patrón en cada frase separada
    for i_frase in frases:
        verbos = re.findall(pattern, i_frase)
        if verbos:
            #print(f"Frase: {i_frase}")
            #print(f"Verbos encontrados: {verbos}")
            aux_frases_list.append(i_frase)
    aux_list = aux_list + aux_frases_list
    
sentences_list = aux_list
print(sentences_list)

How to make these separations without what is identified by (?:[A-Z]|l[oa]s|la|[eé]l) be removed from the following string after the split?

Using this code I am getting this wrong output:

['El coche ((VERB) es) rojo', ' bicicleta ((VERB)está) allí', ' monopatín ((VERB)ha sido pintado) de color rojo', ' camión también ((VERB)funciona) con cargas pesadas', ' hojas ((VERB)son) doradas y ((VERB)son) secas', ' juegos del parque ((VERB)estan) algo oxidados y ((VERB)es) peligroso subirse a ellos']

It is curious that the sentence "El árbol ((VERB es)) grande" directly dasappeared from the final list, although it should be

Instead you should get this list of strings:

["El coche ((VERB) es) rojo", "la bicicleta ((VERB)está) allí", "el monopatín ((VERB)ha sido pintado) de color rojo", "el camión también ((VERB)funciona) con cargas pesadas", "El árbol ((VERB es)) grande", "las hojas ((VERB)son) doradas y ((VERB)son) secas", "los juegos del parque ((VERB)estan) algo oxidados y ((VERB)es) peligroso subirse a ellos"]
Matt095
  • 857
  • 3
  • 9
  • 1
    Did you mean to use extend() instead of append()? – B Remmelzwaal Feb 13 '23 at 21:25
  • I don't know if conceptually using `append()` is correct, but it seems to work to obtain the result that I got there, Would anything change if I used `extend()`? In any case, I think that the disappearance of the `(?:[A-Z]|l[oa]s|la|[eé]l)` thing occurs during `re.split()` – Matt095 Feb 13 '23 at 21:30
  • _Append_ adds an item as an element. E.g. `[1, 2, 3].append([4, 5]) = [1, 2, 3, [4, 5]]`. _Extend_ adds the elements of the container to the list. E.g. `[1, 2, 3].append([4, 5]) = [1, 2, 3, 4, 5]`. – B Remmelzwaal Feb 13 '23 at 21:35
  • Mmm... So in this way I could avoid resorting to the auxiliary lists to later add them `aux_list = aux_list + aux_frases_list` ? And still get the same result – Matt095 Feb 13 '23 at 21:37
  • Haven't read into your code too much, why not try it? – B Remmelzwaal Feb 13 '23 at 21:39
  • @BRemmelzwaal It is a way of reducing the code, but even so the regex is responsible for the fact that instead of splitting this string `"El coche ((VERB) es) rojo, la bicicleta ((VERB)está) allí"` into this list of strings `["El coche ((VERB) es) rojo", "la bicicleta ((VERB)está) allí"]` , it splits it incorrectly like this `[["El coche ((VERB) es) rojo", " bicicleta ((VERB)está) allí"]]`. – Matt095 Feb 13 '23 at 21:44
  • if I use `aux_frases_list.extend(i_frase)` I get this output even worse than the previous one `['E', 'l', ' ', 'c', 'o', 'c', 'h', 'e', ' ', '(', '(', 'V', 'E', 'R', 'B', ')', ' ', 'e', 's', ')', ' ', 'r', 'o', 'j', 'o', ' ', 'b', 'i', 'c', 'i', 'c', 'l', 'e', 't', 'a', .....` That's why I used `append()` – Matt095 Feb 13 '23 at 21:47
  • I see, I would use append instead of the +. – B Remmelzwaal Feb 13 '23 at 22:01
  • the problem with this is the regex – Matt095 Feb 13 '23 at 22:22
  • There is only 1 sentence in the sentences list. Were you expecting it to all work the first time ? – sln Feb 13 '23 at 23:37
  • @sln There are **2 strings (or 2 sentences)** in `sentences_list` variable, and that are `"El coche ((VERB) es) rojo, la bicicleta ((VERB)está) allí; el monopatín ((VERB)ha sido pintado) de color rojo, y el camión también ((VERB)funciona) con cargas pesadas"`, and `"El árbol ((VERB es)) grande, las hojas ((VERB)son) doradas y ((VERB)son) secas, los juegos del parque ((VERB)estan) algo oxidados y ((VERB)es) peligroso subirse a ellos"` – Matt095 Feb 13 '23 at 23:48
  • 1
    You lose the split characters in the resulting arrays. The split regex consumes the splitters. Which side do you want to keep it on, left or right ? `(?<=)` or `(?=)` Should be fairly easy. – sln Feb 14 '23 at 00:09

1 Answers1

1

I'm taking a guess the splitter regex should be this:

(?:[,.;]?\s*y\s+|[,;]\s*)(?=[A-Z]|l(?:[ao]s|a)|[eé]l)

https://regex101.com/r/jpWfvq/1

 (?: [,.;]? \s* y \s+ | [,;] \s* )   # consumed
 (?=                                 # not consumed
    [A-Z] 
  | l
    (?: [ao] s | a )
  | [eé] l
 )

which splits on punctuation and y (ands, optional) at the boundarys
while maintaining a forward looking group of qualifying text without consuming them. And trimming leading whitespace as a bonus.

sln
  • 2,071
  • 1
  • 3
  • 11