Deleting all text starting with a specific string in a pandas series

Question

I've got the following df called "places"

                   place_name
0                 "Palais et bâtiments officiels[modifier | modifier le code]"
1                 "Lieux de culte renommés[modifier | modifier le code]"
2                 "Vestiges gallo-romains[modifier | modifier le code]"

As you can see there is a similar substring [modifier | modifier le code] in all the inputs for places["place_name] and I would like to delete the substring.

I tried the following two techniques

places["place_name"] = places["place_name"].apply(lambda x: re.sub("\\[modifier \\| modifier le code\\]", "", x))

places["places_name"] = places["place_name"].str.replace("[modifier | modifier le code]", "", regex=False)

None of these work because I think the problem is that the substring I am trying to delete is stuck with another substring (note that there is no space at beginning) so I think the code does not recognise it as a string in itself. I have been trying to split this using split() method but I have the same issue since there is no space at the beginning of the string I am trying to delete.

Final output should be

                   place_name
0                 "Palais et bâtiments officiels"
1                 "Lieux de culte renommés"
2                 "Vestiges gallo-romains"

I have tried to look for other solutions but can't find any, I know there are lot of questions with strings but can't find specific solution for this.

Mayank Porwal · Accepted Answer · 2020-11-08T12:48:34.160

3

You should use Series.str.split:

places["place_name"] = places["place_name"].str.split('\\[modifier').str[0]

Basically, split your string on '[modifier' and pick the the first value([0]]

edited Nov 08 '20 at 12:48

answered Nov 08 '20 at 12:36

Mayank Porwal

33,470
8
37
58

@aramis You may yse `"[modifier"` with `rsplit` since it does not use a regex and you only have one `[modifier` anyway in the string, see my answer with a couple more solutions. – Wiktor Stribiżew Nov 08 '20 at 13:31

Wiktor Stribiżew · Answer 2 · 2020-11-08T13:28:58.217

I suggest

Removing all starting from 0+ whitespaces and [modifier:

places["place_name"].str.replace(r'\s*\[modifier.*', '')

Here, \s* matches 0+ whitespaces, \[ matches [ and modifier.* matches modifier and then any 0+ chars other than line break chars, as many as possible.

See this regex demo.

Extracting all text from the beginning of the string till the first [:

places["place_name"] = places["place_name"].str.extract(r'^([^][]+)', expand=False)

See the regex demo. Details:

^ - start of string
([^][]+) - Capturing group 1 (Seris.str.extract requires a capturing group to return any value): one or more chars other than ] and [.

Pandas test:

>>> import pandas as pd
>>> places = pd.DataFrame({'place_name':["Palais et bâtiments officiels[modifier | modifier le code]","Lieux de culte renommés[modifier | modifier le code]","Vestiges gallo-romains[modifier | modifier le code]"]})
>>> places["place_name"] = places["place_name"].str.extract(r'^([^][]+)', expand=False)
>>> places
                      place_name
0  Palais et bâtiments officiels
1        Lieux de culte renommés
2         Vestiges gallo-romains

>>> places["place_name"].str.replace(r'\s*\[modifier.*', '')
0    Palais et bâtiments officiels
1          Lieux de culte renommés
2           Vestiges gallo-romains

And if you prefer split, you may use Seris.str.rsplit that uses a literal string, not a regex:

>>> places["place_name"].str.rsplit('[modifier').str[0]
0    Palais et bâtiments officiels
1          Lieux de culte renommés
2           Vestiges gallo-romains

thank you very much for the expansive answer, this is great for this task but also expanded my knowledge of regex — aramis, Nov 08 '20 at 21:55

Deleting all text starting with a specific string in a pandas series

2 Answers2