-2

Please help, regex blown my mind.

I am cleaning data in Pandas dataframe (python 3).

I tried so many combos of regex found on the web for digits but none work for my case. I can't seem to figure out how to write my own regex for pattern 2 digits space to space 2 digits (example 26 to 40).

My challenge is to extract from pandas column BLOOM (scraped data) number of petals. Frequently petals are specified as "dd to dd petals". I know that 2 digits in regex are \d\d or \d{2} but how do I incorporate split by "to"? It will also be good to have a condition that the pattern is followed by word "petals".

Surely I am not the first person that needs regex in python for pattern \d\d to \d\d.

Edit:

I realised that my question without a sample dataframe is a bit confusing. So here is a sample dataframe.

import pandas as pd 
import re

# initialize list of lists 
data = [['Evert van Dijk', 'Carmine-pink, salmon-pink streaks, stripes, flecks.  Warm pink, clear carmine pink, rose pink shaded salmon.  Mild fragrance.  Large, very double, in small clusters, high-centered bloom form.  Blooms in flushes throughout the season.'],
    ['Every Good Gift', 'Red.  Flowers velvety red.  Moderate fragrance.  Average diameter 4".  Medium-large, full (26-40 petals), borne mostly solitary bloom form.  Blooms in flushes throughout the season.'], 
    ['Evghenya', 'Orange-pink.  75 petals.  Large, very double bloom form.  Blooms in flushes throughout the season.'], 
    ['Evita', 'White or white blend.  None to mild fragrance.  35 petals.  Large, full (26-40 petals), high-centered bloom form.  Blooms in flushes throughout the season.'],
    ['Evrathin', 'Light pink. [Deep pink.]  Outer petals white. Expand rarely.  Mild fragrance.  35 to 40 petals.  Average diameter 2.5".  Medium, double (17-25 petals), full (26-40 petals), cluster-flowered, in small clusters bloom form.  Prolific, once-blooming spring or summer.  Glandular sepals, leafy sepals, long sepals buds.'],
    ['Evita 2', 'White, blush shading.  Mild, wild rose fragrance.  20 to 25 petals.  Average diameter 1.25".  Small, very double, cluster-flowered bloom form.  Blooms in flushes throughout the season.']]

# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['NAME', 'BLOOM']) 

# print dataframe. 
df 
The smell of roses
  • 117
  • 1
  • 2
  • 10

3 Answers3

1

This worked for me:

import re

sample = '2 digits (example 26 to 40 petals) and 16 to 43 petals.'
re.compile(r"\d{2}\sto\s\d{2}\spetals").findall(sample)

Output:

['26 to 40 petals', '16 to 43 petals']

As you have stated, \d{2} finds 2 digit numbers, \sto\s finds the word 'to' surrounded by blank spaces, then \d{2} again for the second 2-digit number, followed by a space (\s) and the word 'petals'.

footfalcon
  • 581
  • 5
  • 16
1

You can use

df['res_col'] = df['src_col'].str.extract(r'(?<!\d)(\d{2}\s+to\s+\d{2})\s*petal', expand=False)

See the regex demo

Details

  • (?<!\d) - a negative lookbehind making sure there is no digit immediately on the left of the current location
  • (\d{2}\s+to\s+\d{2}) - Group 1 (the actual return of str.extract):
    • \d{2} - two digits
    • \s+to\s+ - 1+ whitespaces, to string, 1+ whitespaces
    • \d{2} - two digits
  • \s*petal - 0+ whitespaces followed with petal.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
-1

Posting an answer to show how I solved petals data extraction from column BLOOM. I had to use multiple regex to get all data that I wanted. This question was only covering one of the regex I used.

Sample dataframe looks like this when printed:

enter image description here

I created those columns before I run into the issue that lead to this post. My initial approach was to get all the data in the brackets.

#coping content in column BLOOM inside first brackets into new column PETALS
df['PETALS'] = df['BLOOM'].str.extract('(\\(.*?)\\)', expand=False).str.strip()
df['PETALS'] = df['PETALS'].str.replace("(","") 

# #coping content in column BLOOM inside all brackets into new column ALL_PETALS_BRACKETS
df['ALL_PETALS_BRACKETS'] = df['BLOOM'].str.findall('(\\(.*?)\\)')
df[['NAME','BLOOM','PETALS', 'ALL_PETALS_BRACKETS']]

enter image description here

I later realised that this way only getting petal values for some rows. Petals can be specified in column BLOOM in more than one way. Another common pattern is "2 digits to 2 digits". There is also pattern "2 digits petals.".

# solution provided by Wiktor Stribiżew
df['PETALS_Wiktor_S'] = df['BLOOM'].str.extract(r'(?<!\d)(\d{2}\s+to\s+\d{2})\s*petal', expand=False)

# my modification that worked on the main df and not only on the test one. 
# now lets copy part of column BLOOM that matches regex pattern two digits to two digits
df['PETALS5'] = df['BLOOM'].str.extract(r'(\d{2}\s+to\s+\d{2})', expand=False).str.strip()

# also came across cases where pattern is two digits followed by word "petals"
#now lets copy part of column BLOOM that matches regex patern two digits followed by word "petals"
df['PETALS6'] = df['BLOOM'].str.extract(r'(\d{2}\s+petals+\.)', expand=False).str.strip()
df

enter image description here

Since I was after pattern "2 digits petals." I had to modify my regex so it looks for dot using +\. in r'(\d{2}\s+petals+\. If regex is written as r'(\d{2}\s+petals. it grabs cases where word petals is followed by . and (.

The smell of roses
  • 117
  • 1
  • 2
  • 10