3

I have a dataframe like this.

df = pd.DataFrame({
    "Name" : ["ABC LLC Ram corp", "IJK Inc"],
    "id" : [101, 102]
 })

    Name                id
0 ABC LLC Ram corp      101
1 IJK Inc               102

I am trying to split the Name series into multiple rows based on my separator. I am able to split but unable to retain the separators too.

separators = ["inc","corp","llc"]

My expected output is,

Name       id
ABC LLC    101
RAM corp   101
IJK Inc    102

Please help, thanks.

Pyd
  • 6,017
  • 18
  • 52
  • 109

2 Answers2

4

You can use str.findall to find all the occurrence of matching regex pattern in column Name, then assign these matching occurrences to the column Name and explode the dataframe on Name:

pat = fr"(?i)(.*?(?:{'|'.join(separators)}))"
df.assign(Name=df['Name'].str.findall(pat)).explode('Name')

Regex details:

  • (?i) : Case insensitive flag
  • ( : Start of capturing group
  • .*? : Matches any character except line terminators between zero and unlimited times, as few times as possible (lazy match).
  • (?: : start of a non capturing group
  • {'|'.join(separators)}: f-string expression which evaluates to inc|corp|llc
  • ) : End of non-capturing group
  • ) : End of capturing group

        Name   id
0    ABC LLC  101
0   Ram corp  101
1    IJK Inc  102
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
3

A bit verbose approach , by replacing the spaces after the words with comma and then split:

d = dict(zip([f'{i} ' for i in separators],[f'{i},' for i in separators]))
#{'inc ': 'inc,', 'corp ': 'corp,', 'llc ': 'llc,'}

out = (df.assign(Name=df['Name'].str.lower()
       .replace(d,regex=True).str.title().str.split(",")).explode("Name"))

print(out)

       Name   id
0   Abc Llc  101
0  Ram Corp  101
1   Ijk Inc  102
anky
  • 74,114
  • 11
  • 41
  • 70
  • Thanks for the answer :) @Anky, but Subham Answer does not change to title case, so that matches my expected output, not sure interms of performance – Pyd Mar 06 '21 at 14:15