I am currently trying to do some natural language processing for company names.
The regex I wrote is -\s+\w+('\w+|\s+\w)
this is to remove all the text after the hyphen if its whitespace.
Next, I then [.,/#!$%\^&*;:{}=-_`''"<>|~()]
remove all punctuation. Third, I (Reg|Ltd|PLC|NV|LTD|LLC|INC|LLP|US)
remove the company suffix. Lastly, there are some names with carriage returns in front and at the end of the string, I resolve this with "\r*\n*
.
I would like to put all of these regex pieces together as I am running this in Alteryx & Python.
Please note: there are company names with hyphen that do not have whitespace after, I need to keep this and make sure they are not removed with the punctuation removal.
How can I combine all of these pieces? And, am I going about this correctly? In the end, after the string clean-up I will be joining this data to another client list to pull back specific information.
This is why all front-ends should NEVER contain a free text field especially for companies.
How do I go about combining these into one pattern, or is it better practice to separate each pattern?
Before
MY COMPANY X,Y,Z, TENNESSEE CORPORATION L.L.C.
MY COMPANY HOLDINGS, LP. (there is a carriage return after the LP.)
ABN FGDF - NEW YORK - UNITED STATES
COLLEGE-INRIA
ABCDE - UNITED STATES
MANAGEMENT MANAGERS - UNITED STATES
INVESTMENT MANAGEMENT CORPORATION - CANADA
AUTO-CHLOR
After
MY COMPANY XYZ TENNESSEE CORPORATION
MY COMPANY HOLDINGS
ABN FGDF
COLLEGE-INRIA
ABCDE
MANAGEMENT MANAGERS
INVESTMENT MANAGEMENT CORPORATION
AUTO-CHLOR
note that the COLLEGE-INRIA stayed as there was no whitespace between the hyphen and the next char.