2

I've got the following df called "places"

                   place_name
0                 "Palais et bâtiments officiels[modifier | modifier le code]"
1                 "Lieux de culte renommés[modifier | modifier le code]"
2                 "Vestiges gallo-romains[modifier | modifier le code]"

As you can see there is a similar substring [modifier | modifier le code] in all the inputs for places["place_name] and I would like to delete the substring.

I tried the following two techniques

places["place_name"] = places["place_name"].apply(lambda x: re.sub("\\[modifier \\| modifier le code\\]", "", x))

places["places_name"] = places["place_name"].str.replace("[modifier | modifier le code]", "", regex=False) 

None of these work because I think the problem is that the substring I am trying to delete is stuck with another substring (note that there is no space at beginning) so I think the code does not recognise it as a string in itself. I have been trying to split this using split() method but I have the same issue since there is no space at the beginning of the string I am trying to delete.

Final output should be

                   place_name
0                 "Palais et bâtiments officiels"
1                 "Lieux de culte renommés"
2                 "Vestiges gallo-romains"

I have tried to look for other solutions but can't find any, I know there are lot of questions with strings but can't find specific solution for this.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
aramis
  • 85
  • 7

2 Answers2

3

You should use Series.str.split:

places["place_name"] = places["place_name"].str.split('\\[modifier').str[0]

Basically, split your string on '[modifier' and pick the the first value([0]]

Mayank Porwal
  • 33,470
  • 8
  • 37
  • 58
  • @aramis You may yse `"[modifier"` with `rsplit` since it does not use a regex and you only have one `[modifier` anyway in the string, see my answer with a couple more solutions. – Wiktor Stribiżew Nov 08 '20 at 13:31
1

I suggest

  1. Removing all starting from 0+ whitespaces and [modifier:
places["place_name"].str.replace(r'\s*\[modifier.*', '')

Here, \s* matches 0+ whitespaces, \[ matches [ and modifier.* matches modifier and then any 0+ chars other than line break chars, as many as possible.

See this regex demo.

  1. Extracting all text from the beginning of the string till the first [:
places["place_name"] = places["place_name"].str.extract(r'^([^][]+)', expand=False)

See the regex demo. Details:

  • ^ - start of string
  • ([^][]+) - Capturing group 1 (Seris.str.extract requires a capturing group to return any value): one or more chars other than ] and [.

Pandas test:

>>> import pandas as pd
>>> places = pd.DataFrame({'place_name':["Palais et bâtiments officiels[modifier | modifier le code]","Lieux de culte renommés[modifier | modifier le code]","Vestiges gallo-romains[modifier | modifier le code]"]})
>>> places["place_name"] = places["place_name"].str.extract(r'^([^][]+)', expand=False)
>>> places
                      place_name
0  Palais et bâtiments officiels
1        Lieux de culte renommés
2         Vestiges gallo-romains

>>> places["place_name"].str.replace(r'\s*\[modifier.*', '')
0    Palais et bâtiments officiels
1          Lieux de culte renommés
2           Vestiges gallo-romains

And if you prefer split, you may use Seris.str.rsplit that uses a literal string, not a regex:

>>> places["place_name"].str.rsplit('[modifier').str[0]
0    Palais et bâtiments officiels
1          Lieux de culte renommés
2           Vestiges gallo-romains
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    thank you very much for the expansive answer, this is great for this task but also expanded my knowledge of regex – aramis Nov 08 '20 at 21:55