2

I know this may seem stupid but I've been looking everywhere and trying with regex and split in vain. My script never works for all type of string I have on my data set.

I have this column that contains raw data that look like (three cases):

20181223-FB-BOOST-AAAA-CC Auchy-Les-Mines - Père Noel
20161224-FB-BOOST-SSSS-CC LeMarine - XXX XXX
20161223-FB-BOOST-XXXX-CC Bonjour le monde - Blah blah

So what I want to do is to get the strings in the middle after CC and right before "-". I wrote a script that did work for the 2nd case but never the other two :

1st case: Auchy-Les-Mines
2nd case: LeMarine
3rd case: Bonjour le monde

Here is the regex that I used but never works for all cases: regex = r"\s\b.*-."

Thanks in advance !

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
helloworld
  • 416
  • 2
  • 6
  • 15

2 Answers2

6

You my use

df['Col'].str.extract(r'-CC\s+(.*?)\s+-')

If there can be line breaks between the two delimiters, add the s/dotall flag or use [\w\W]/[\s\S]/[\d\D] instead of a .:

df['Col'].str.extract(r'(?s)-CC\s+(.*?)\s+-')
#                       ^^^^
df['Col'].str.extract(r'-CC\s+([\w\W]*?)\s+-')
#                              ^^^^^^

See the regex demo.

Pattern details

  • -CC - a literal substring
  • \s+ - 1+ whitespaces
  • (.*?) - Group 1 (this value will be returned by .str.extract): any 0+ chars other than newline, as few as possible
  • \s+ - 1+ whitespaces (+ is important here)
  • - - a hyphen

The fact that there are \s+ patterns on both ends of (.*?) will make sure the result is already stripped from whitespace regardless of how many whitespaces there were.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    It works perfectly! Thank you so much and thank especially for explaining the patter details! U just saved my day! thanks :) ! – helloworld Feb 12 '19 at 13:11
  • Already done :) ! Otherwise, there was one more problem haha! I have some rows that looks like: 20180223-FB-BOOST-XXX-CC Vitrolles - 110 - St Valentin where I only want to retrieve "Vitrolles" I get this output instead using the regex u suggested: Vitrolles - 110 so I fixed it using the following regex: (r'-CC\s+(.*?)\s+-[^1-9]') (Apparently, it didn't go green cuz they said I need more reputation to be able to vote :( ) – helloworld Feb 12 '19 at 13:34
  • @helloworld See the latest update, no need for `[^1-9]`. Please consider clicking the grey tick mark on the left to accept the solution. Do not pay attention at the `1` on the left, it does not mean the answer is accepted. – Wiktor Stribiżew Feb 12 '19 at 13:42
  • Same problem! It reutns numbers which is not the the results expected! [^1-9] still does the job tho ! Wdu think? – helloworld Feb 12 '19 at 13:47
  • @helloworld Ok, that is already fine-tuning that only you know how to perform. So, there should be no digit after `-`, right? So, it seems `r'-CC\s+(.*?)\s+-'` does what you need already, but you may also use `r'-CC\s+(.*?)\s+-(?!\s*\d)'` to avoid matching `-` that is followed with 0+ whitespaces and a digit. – Wiktor Stribiżew Feb 12 '19 at 13:53
-1

You can do it rather simple with:

result = df.raw_data.str.extract(r'-CC (.*) -')
ibarrond
  • 6,617
  • 4
  • 26
  • 45