0

Is it possible to split strings from a dataframe column based on a list of words?

For example: There is a dataframe with a column Company, each record includes the company name, a legal form, and sometimes additional information after the legal form like 'electronics'.

Company
XYZ ltd electronics
ABC ABC inc iron
AB XY Z inc
CD EF GHI JK llc incident

I have list with 1500 worldwide legal form for companies (inc, ltd, ...). I would like to split the string in the dataframe column, based on this legal form list for example:

['gmbh', 'ltd', 'inc', 'srl', 'spa', 'co', 'sa', 'ag', 'kg', 'ab', 'spol', 'sasu', 'sas', 'pvt', 'sarl', 'gmbh & co kg', 'llc', 'ilc', 'corp', 'ltda', 'coltd', 'se', 'as', 'sp zoo', 'plc', 'pvtltd', 'og', 'gen']

In other words, to separate everything before and after the words in the list to new columns. This is the desired output:

Company Legal form Addition
XYZ ltd electronics
ABC ABC inc iron
AB XY Z inc
CD EF GHI JK llc incident

Note that "inc" appears in the middle, at the end, and also part of a word in the various company name examples.

Adrian Mole
  • 49,934
  • 160
  • 51
  • 83
hazen23
  • 29
  • 4

2 Answers2

1

Assuming you are just trying to string split after spaces you could try something like this:

import re

df = pd.DataFrame({'Company': ['XYZ ltd electronics', 'ABC ABC inc iron', 'AB XY Z inc', 'CD EF GHI JK llc chicago']}, columns=['Company'])
df['Addition'] = df['Company'].apply(lambda x: re.split('(ltd|inc|llc)', x))
print(df)

                    Company                        Addition
0       XYZ ltd electronics       [XYZ , ltd,  electronics]
1          ABC ABC inc iron          [ABC ABC , inc,  iron]
2               AB XY Z inc               [AB XY Z , inc, ]
3  CD EF GHI JK llc chicago  [CD EF GHI JK , llc,  chicago]
blackraven
  • 5,284
  • 7
  • 19
  • 45
user
  • 36
  • 4
1

You could use regular expression (Regex) to filter out the legal form. Each legal form is in this format: \slegalform\s

\s means the legal form is preceded by and ended with a whitespace. Because I have appended all company names with a white space, so the legal form can be at the end as well. The data is processed in lowercase, then converted back to Title Case. So try this:

import pandas as pd
import re

legal_forms = '(\sgmbh\s|\sltd\s|\sinc\s|\ssrl\s|\sspa\s|\sco\s|\ssa\s|\sag\s|\skg\s|\sab\s|\sspol\s|\ssasu\s|\ssas\s|\spvt\s|\ssarl\s|\sgmbh\s&\sco\skg\s|\sllc\s|\silc\s|\scorp\s|\sltda\s|\scoltd\s|\sse\s|\sas\s|\ssp\szoo\s|\splc\s|\spvtltd\s|\sog\s|\sgen\s)'
df = pd.DataFrame({'Company': ['XYZ ltd electronics', 'ABC ABC inc iron', 'AB XY Z inc', 'CD EF GHI JK llc incident']}, columns=['Company'])
df['Coy']= df['Company'].apply(lambda x: [e.strip() for e in re.split(legal_forms, x.lower()+' ')])
print(df)

This will create a list for each company name, separated by the legal form

                     Company                            Coy
0        XYZ ltd electronics        [xyz, ltd, electronics]
1           ABC ABC inc iron           [abc abc, inc, iron]
2                AB XY Z inc               [ab xy z, inc, ]
3  CD EF GHI JK llc incident  [cd ef ghi jk, llc, incident]

After that you can split them into 3 separate columns:

df1 = pd.DataFrame(df['Coy'].tolist(), columns=['Company', 'Legal form', 'Addition'])
for col in df1.columns:
    df1[col] = df1[col].str.title()
print(df1)

Output:

        Company Legal form     Addition
0           Xyz        Ltd  Electronics
1       Abc Abc        Inc         Iron
2       Ab Xy Z        Inc             
3  Cd Ef Ghi Jk        Llc     Incident
blackraven
  • 5,284
  • 7
  • 19
  • 45
  • I edited the company names. If the company names would contain only one word this should work of course but in my case it‘s not that easy – hazen23 Aug 05 '22 at 15:25
  • sure.. I've edited the answer, this time using regular expression, pls have a look :-) – blackraven Aug 05 '22 at 15:39
  • Thank you again for your anwser but the list contains over 1500 legal world wide forms which would be very exhausting to enter them manually – hazen23 Aug 05 '22 at 15:47
  • would you copy-paste the list of legal forms here, so that I can format them into the code – blackraven Aug 06 '22 at 03:16
  • Unfortunately I can not insert so many words here, so I take only an excerpt and can then apply it to the whole list itself. – hazen23 Aug 06 '22 at 11:32
  • 'gmbh', 'ltd', 'inc', 'srl', 'spa', 'co', 'sa', 'ag', 'kg', 'ab', 'spol', 'sasu', 'sas', 'pvt', 'sarl', 'gmbh & co kg', 'llc', 'ilc', 'corp', 'ltda', 'coltd', 'se', 'as', 'sp zoo', 'plc', 'pvtltd', 'og', 'gen' – hazen23 Aug 06 '22 at 11:33
  • sure.. I've edited the answer again. Please note that white space is `\s` in regex, and these legal forms are separated by `|` which means 'or' – blackraven Aug 06 '22 at 14:14
  • hi @hazen23 is the code able to resolve your question? – blackraven Aug 11 '22 at 07:53
  • Hi @perpetualstudent, I can work with that. I was hoping that I wouldn't have to manually type out the legal forms. Nevertheless it works - Thank you very much! – hazen23 Aug 11 '22 at 07:58
  • do you have the list of legal forms in pdf or text format? I can help you create the python list using some automation – blackraven Aug 11 '22 at 09:07
  • I can help if you allow. But if you're trying it yourself, you could try text replace by Control+H in a text editor – blackraven Aug 11 '22 at 11:33
  • you could share the list via dropbox or googleDrive, etc – blackraven Aug 12 '22 at 19:51