How to use multiple separators in a pandas Series and split into multiple rows

Question

I have a dataframe like this.

df = pd.DataFrame({
    "Name" : ["ABC LLC Ram corp", "IJK Inc"],
    "id" : [101, 102]
 })

    Name                id
0 ABC LLC Ram corp      101
1 IJK Inc               102

I am trying to split the Name series into multiple rows based on my separator. I am able to split but unable to retain the separators too.

separators = ["inc","corp","llc"]

My expected output is,

Name       id
ABC LLC    101
RAM corp   101
IJK Inc    102

Please help, thanks.

Shubham Sharma · Accepted Answer · 2021-03-06T14:27:24.747

4

You can use str.findall to find all the occurrence of matching regex pattern in column Name, then assign these matching occurrences to the column Name and explode the dataframe on Name:

pat = fr"(?i)(.*?(?:{'|'.join(separators)}))"
df.assign(Name=df['Name'].str.findall(pat)).explode('Name')

Regex details:

(?i) : Case insensitive flag
( : Start of capturing group
.*? : Matches any character except line terminators between zero and unlimited times, as few times as possible (lazy match).
(?: : start of a non capturing group
{'|'.join(separators)}: f-string expression which evaluates to inc|corp|llc
) : End of non-capturing group
) : End of capturing group

        Name   id
0    ABC LLC  101
0   Ram corp  101
1    IJK Inc  102

edited Mar 06 '21 at 14:27

answered Mar 06 '21 at 14:06

Shubham Sharma

68,127
6
24
53

Thank you @subham , can you explain the regex if possible? – Pyd Mar 06 '21 at 14:07
1

@pyd Sure give me a minute. – Shubham Sharma Mar 06 '21 at 14:08
1

@pyd For now you can check [`this link`](https://regex101.com/r/RYMw7h/1) to see the regex pattern in action. In the meantime i will edit the answer to include the details as well. – Shubham Sharma Mar 06 '21 at 14:18
Hi @subham, how can we have variable in df.assign like `a="Name";df.assign(a,....)` – Pyd Mar 08 '21 at 08:19
Hi @pyd I guess we can simply do `df.assign(a=)`, where `` could be a`scalar` or `series` or even a `list` but having same length as of the dataframe. – Shubham Sharma Mar 08 '21 at 08:22
Hi @Subham, in some cases the word will not ends with separator, can we do anything in that case? – Pyd Mar 09 '21 at 11:15
@pyd Lets discuss [`here`](https://chat.stackoverflow.com/rooms/228691/ds5) – Shubham Sharma Mar 09 '21 at 12:15

score 3 · Answer 2 · answered Mar 06 '21 at 14:12

3

A bit verbose approach , by replacing the spaces after the words with comma and then split:

d = dict(zip([f'{i} ' for i in separators],[f'{i},' for i in separators]))
#{'inc ': 'inc,', 'corp ': 'corp,', 'llc ': 'llc,'}

out = (df.assign(Name=df['Name'].str.lower()
       .replace(d,regex=True).str.title().str.split(",")).explode("Name"))

print(out)

       Name   id
0   Abc Llc  101
0  Ram Corp  101
1   Ijk Inc  102

answered Mar 06 '21 at 14:12

anky

74,114
11
41
70

Thanks for the answer :) @Anky, but Subham Answer does not change to title case, so that matches my expected output, not sure interms of performance – Pyd Mar 06 '21 at 14:15

How to use multiple separators in a pandas Series and split into multiple rows

2 Answers2

Linked