Trying to get a substring using regex in Python / pandas

Question

I know this may seem stupid but I've been looking everywhere and trying with regex and split in vain. My script never works for all type of string I have on my data set.

I have this column that contains raw data that look like (three cases):

20181223-FB-BOOST-AAAA-CC Auchy-Les-Mines - Père Noel
20161224-FB-BOOST-SSSS-CC LeMarine - XXX XXX
20161223-FB-BOOST-XXXX-CC Bonjour le monde - Blah blah

So what I want to do is to get the strings in the middle after CC and right before "-". I wrote a script that did work for the 2nd case but never the other two :

1st case: Auchy-Les-Mines
2nd case: LeMarine
3rd case: Bonjour le monde

Here is the regex that I used but never works for all cases: regex = r"\s\b.*-."

Thanks in advance !

`str.extract(r'-CC\s*(.*?)\s+-')`? See https://regex101.com/r/2LagAy/1 — Wiktor Stribiżew, Feb 12 '19 at 12:57
@WiktorStribiżew you don't need to put in the whitespace at the beginning, a better approach to that regex would be "-CC.*?\s-" — Pablo Alvarez, Feb 12 '19 at 13:04
@PabloAlvarez See [my answer](https://stackoverflow.com/a/54650702/3832970) where I explain why it is there — Wiktor Stribiżew, Feb 12 '19 at 13:05
You can use split method and select 2nd index, if your data has this structure. — Farshid Shekari, Feb 12 '19 at 13:08

Wiktor Stribiżew · Accepted Answer · 2022-09-20T07:02:31.993

6

You my use

df['Col'].str.extract(r'-CC\s+(.*?)\s+-')

If there can be line breaks between the two delimiters, add the s/dotall flag or use [\w\W]/[\s\S]/[\d\D] instead of a .:

df['Col'].str.extract(r'(?s)-CC\s+(.*?)\s+-')
#                       ^^^^
df['Col'].str.extract(r'-CC\s+([\w\W]*?)\s+-')
#                              ^^^^^^

See the regex demo.

Pattern details

-CC - a literal substring
\s+ - 1+ whitespaces
(.*?) - Group 1 (this value will be returned by .str.extract): any 0+ chars other than newline, as few as possible
\s+ - 1+ whitespaces (+ is important here)
- - a hyphen

The fact that there are \s+ patterns on both ends of (.*?) will make sure the result is already stripped from whitespace regardless of how many whitespaces there were.

edited Sep 20 '22 at 07:02

answered Feb 12 '19 at 13:01

Wiktor Stribiżew

607,720
39
448
563

1

It works perfectly! Thank you so much and thank especially for explaining the patter details! U just saved my day! thanks :) ! – helloworld Feb 12 '19 at 13:11
Already done :) ! Otherwise, there was one more problem haha! I have some rows that looks like: 20180223-FB-BOOST-XXX-CC Vitrolles - 110 - St Valentin where I only want to retrieve "Vitrolles" I get this output instead using the regex u suggested: Vitrolles - 110 so I fixed it using the following regex: (r'-CC\s+(.*?)\s+-[^1-9]') (Apparently, it didn't go green cuz they said I need more reputation to be able to vote :( ) – helloworld Feb 12 '19 at 13:34
@helloworld See the latest update, no need for `[^1-9]`. Please consider clicking the grey tick mark on the left to accept the solution. Do not pay attention at the `1` on the left, it does not mean the answer is accepted. – Wiktor Stribiżew Feb 12 '19 at 13:42
Same problem! It reutns numbers which is not the the results expected! [^1-9] still does the job tho ! Wdu think? – helloworld Feb 12 '19 at 13:47
@helloworld Ok, that is already fine-tuning that only you know how to perform. So, there should be no digit after `-`, right? So, it seems `r'-CC\s+(.*?)\s+-'` does what you need already, but you may also use `r'-CC\s+(.*?)\s+-(?!\s*\d)'` to avoid matching `-` that is followed with 0+ whitespaces and a digit. – Wiktor Stribiżew Feb 12 '19 at 13:53

score -1 · Answer 2 · answered Feb 12 '19 at 13:02

-1

You can do it rather simple with:

result = df.raw_data.str.extract(r'-CC (.*) -')

answered Feb 12 '19 at 13:02

ibarrond

6,617
4
26
45

Trying to get a substring using regex in Python / pandas

2 Answers2