0

I am very new to pandas and have a data frame similar to the below

import pandas as pd 

df = pd.DataFrame({'id': ["1", "2", "3","4","5"],
                   'mill': ["Company A Palm Oil Mill – Special Company A of CC Ltd",
                            "Company X POM – Company X Ltd","DDDD Mill – Company New and Old Ltd",
                            "Company Not Special – R Mill","Greatest Company – Great World POM"]})

  id                                               mill
0  1  Company A Palm Oil Mill – Special Company A of...
1  2                      Company X POM – Company X Ltd
2  3                DDDD Mill – Company New and Old Ltd
3  4                       Company Not Special – R Mill
4  5                 Greatest Company – Great World POM

What I would like to get from the above data frame is something like the below:

enter image description here

Is there an easy way to extract those substrings into the same column. The mill name can sometimes be before and other times after the '-' but will almost always end with Palm Oil Mill, POM or Mill.

smci
  • 32,567
  • 20
  • 113
  • 146
Funkeh-Monkeh
  • 649
  • 6
  • 17
  • This will involved Nltk – BENY Apr 02 '18 at 22:14
  • What determines whether you keep what's before or after the hyphen? The second line, especially, isn't obvious. – Prune Apr 02 '18 at 22:16
  • *"extract those substrings"* is not a clear problem statement. You mean *"split on the hyphen (if any), and return the substring ending in 'Mill' or 'POM'"* – smci Apr 02 '18 at 22:34
  • 1
    @smci, yes apologies for not making the clear in the post title. Have updated it now – Funkeh-Monkeh Apr 02 '18 at 22:37

3 Answers3

1

Previous solution: You could use .str.split() and do this: df.mill = df.mill.str.split(' –').str[0].

Update: Seeing you got a few constraints you could build up your own returning function (called func below) and put any logic you want inside there. This will loop through all strings split by - and if Mill is in the first word you return.

In other case I recommend Wen's solution.

import pandas as pd 

df = pd.DataFrame({'id': ["1", "2", "3","4","5"],
                   'mill': ["Company A Palm Oil Mill – Special Company A of CC Ltd",
                            "Company X POM – Company X Ltd","DDDD Mill – Company New and Old Ltd",
                            "Company Not Special – R Mill","Greatest Company – Great World POM"]})

def func(x):
    #Split array
    ar = x.split(' – ')

    # If length is smaller than 2 return value
    if len(ar) < 2:
        return x

    # Else loop through and apply logic here
    for ind, x in enumerate(ar):
        if x.lower().endswith(('mill', 'pom')):
            return x

    # Nothing found, return x
    return x

df.mill = df.mill.apply(func)

print(df)

Returns:

  id                     mill
0  1  Company A Palm Oil Mill
1  2            Company X POM
2  3                DDDD Mill
3  4                   R Mill
4  5          Great World POM
Anton vBR
  • 18,287
  • 5
  • 40
  • 46
1

IIUC, you can using str.contains with those key words Palm Oil Mill,POM,Mill

s = df.mill.str.split(' – ', expand=True)

df['Name']=s[s.apply(lambda x : x.str.contains('Palm Oil Mill|POM|Mill'))].fillna('').sum(1)
df
Out[230]: 
  id                                               mill  \
0  1  Company A Palm Oil Mill – Special Company A of...   
1  2                      Company X POM – Company X Ltd   
2  3                DDDD Mill – Company New and Old Ltd   
3  4                       Company Not Special – R Mill   
4  5                 Greatest Company – Great World POM   
                      Name  
0  Company A Palm Oil Mill  
1            Company X POM  
2                DDDD Mill  
3                   R Mill  
4          Great World POM  
BENY
  • 317,841
  • 20
  • 164
  • 234
1

You want to split on the hyphen (if any), and return the substring ending in 'Mill' or 'POM':

def extract_mill_name(s):
    """Extract the substring which ends in 'Mill' or 'POM'"""
    for subs in s.split('–'):
        subs = subs.strip(' ')
        if subs.endswith('Mill') or subs.endswith('POM'):
            return subs

    return None # parsing error. Could raise Exception instead

df.mill.apply(extract_mill_name)

0    Company A Palm Oil Mill
1              Company X POM
2                  DDDD Mill
3                     R Mill
4            Great World POM
smci
  • 32,567
  • 20
  • 113
  • 146