pandas: extract specific text before or after hyphen, that ends in given substrings

Question

I am very new to pandas and have a data frame similar to the below

import pandas as pd 

df = pd.DataFrame({'id': ["1", "2", "3","4","5"],
                   'mill': ["Company A Palm Oil Mill – Special Company A of CC Ltd",
                            "Company X POM – Company X Ltd","DDDD Mill – Company New and Old Ltd",
                            "Company Not Special – R Mill","Greatest Company – Great World POM"]})

  id                                               mill
0  1  Company A Palm Oil Mill – Special Company A of...
1  2                      Company X POM – Company X Ltd
2  3                DDDD Mill – Company New and Old Ltd
3  4                       Company Not Special – R Mill
4  5                 Greatest Company – Great World POM

What I would like to get from the above data frame is something like the below:

Is there an easy way to extract those substrings into the same column. The mill name can sometimes be before and other times after the '-' but will almost always end with Palm Oil Mill, POM or Mill.

What determines whether you keep what's before or after the hyphen? The second line, especially, isn't obvious. — Prune, Apr 02 '18 at 22:16
*"extract those substrings"* is not a clear problem statement. You mean *"split on the hyphen (if any), and return the substring ending in 'Mill' or 'POM'"* — smci, Apr 02 '18 at 22:34
@smci, yes apologies for not making the clear in the post title. Have updated it now — Funkeh-Monkeh, Apr 02 '18 at 22:37

Anton vBR · Answer 1 · 2018-04-02T22:33:47.683

Previous solution: You could use .str.split() and do this: df.mill = df.mill.str.split(' –').str[0].

Update: Seeing you got a few constraints you could build up your own returning function (called func below) and put any logic you want inside there. This will loop through all strings split by - and if Mill is in the first word you return.

In other case I recommend Wen's solution.

import pandas as pd 

df = pd.DataFrame({'id': ["1", "2", "3","4","5"],
                   'mill': ["Company A Palm Oil Mill – Special Company A of CC Ltd",
                            "Company X POM – Company X Ltd","DDDD Mill – Company New and Old Ltd",
                            "Company Not Special – R Mill","Greatest Company – Great World POM"]})

def func(x):
    #Split array
    ar = x.split(' – ')

    # If length is smaller than 2 return value
    if len(ar) < 2:
        return x

    # Else loop through and apply logic here
    for ind, x in enumerate(ar):
        if x.lower().endswith(('mill', 'pom')):
            return x

    # Nothing found, return x
    return x

df.mill = df.mill.apply(func)

print(df)

Returns:

  id                     mill
0  1  Company A Palm Oil Mill
1  2            Company X POM
2  3                DDDD Mill
3  4                   R Mill
4  5          Great World POM

@Wen Yes I also saw this after posting and joined the discussion. I'm editing. — Anton vBR, Apr 02 '18 at 22:19

score 1 · Accepted Answer · answered Apr 02 '18 at 22:23

IIUC, you can using str.contains with those key words Palm Oil Mill,POM,Mill

s = df.mill.str.split(' – ', expand=True)

df['Name']=s[s.apply(lambda x : x.str.contains('Palm Oil Mill|POM|Mill'))].fillna('').sum(1)
df
Out[230]: 
  id                                               mill  \
0  1  Company A Palm Oil Mill – Special Company A of...   
1  2                      Company X POM – Company X Ltd   
2  3                DDDD Mill – Company New and Old Ltd   
3  4                       Company Not Special – R Mill   
4  5                 Greatest Company – Great World POM   
                      Name  
0  Company A Palm Oil Mill  
1            Company X POM  
2                DDDD Mill  
3                   R Mill  
4          Great World POM

score 1 · Answer 3 · answered Apr 02 '18 at 22:43

You want to split on the hyphen (if any), and return the substring ending in 'Mill' or 'POM':

def extract_mill_name(s):
    """Extract the substring which ends in 'Mill' or 'POM'"""
    for subs in s.split('–'):
        subs = subs.strip(' ')
        if subs.endswith('Mill') or subs.endswith('POM'):
            return subs

    return None # parsing error. Could raise Exception instead

df.mill.apply(extract_mill_name)

0    Company A Palm Oil Mill
1              Company X POM
2                  DDDD Mill
3                     R Mill
4            Great World POM

pandas: extract specific text before or after hyphen, that ends in given substrings

3 Answers3