I have a matching problem that I've tried to solve, but have not found a way to do so. I'm new to python, so there might well be simple methods for doing this. I've searched the questions, but haven't found anything that quite gets what I need.
PURPOSE
I need to identify a "type" of person (to be defined below), and if they belong to one of a list of groups, note the group. The result should be a single row in a dataframe that contains an observation of the person's name, their type, and their affiliation. The data all exist as strings, and there are multiple parts of the strings I will have to parse out.
DATA
My data currently look like this:
lastname person1 person2 person3
smith matt smith, company a tom jones, company b karen cook, company c
jones matt smith, company a tom jones, company b karen cook, company c
cook matt smith, company a tom jones, company b karen cook, company c
Also, I have a list of companies
CompanyName
company a
company b
company d
The steps are, roughly (in pseudocod-ish), for every row in the dataframe
If lastname is in person1:
if person1 company is in list of CompanyNames:
df['persontype'] = 1 #where 1 indicates being in a listed company
df['affiliation'] = company name #the matched item from CompanyName list
else if lastname is in person2:
if person2 company is in list of CompanyNames:
df['personatype'] = 1
df['affiliation'] = company name
else if lastname is in person3:
if person3 company name is in list of CompanyNames:
df['persontype'] = 1
df['affiliation'] = company name
else:
df['persontype'] = 0 #indicating a person in an unlisted company
df['affiliation'] = 'Unlisted'
I know that's terrible to read, but I'm just starting out. (All comments welcome.)
The issues are:
- The lastname is short, so the match to the list of persons can be a simple
is 'a' in 'b'
, but I expect the company names within the person1...N strings to have typos or idiosyncrasies (think "C-SPAN" and "CSPAN" as possible versions of one name). - The number of columns for persons is static. That is, it is fixed (though many entries will be empty, so that while the number of columns is fixed at 100, some rows will only have 1 or 2 entries).
- The formatting of the person1...N strings is not consistent. That is, there may be multiple commas, periods, hyphens, etc. Thus, splitting on a comma a getting everything after it won't always give the company name.
UPDATE
Thanks to great questions, I realize I've not been detailed enough on the format of the data. The entries under the person1..N
columns are highly variable. Examples of actual data from those columns below:
person1
Hill, Daniel O., Assistant Administrator for Technology, U.S. Small Business Administration
person2
R&D Credit Coalition, R. Randall Capps
This is why I didn't try using .split()
methods right away. The order of the tokens are not uniform, some have a position/job as well as the name of the organization, and the names are not always in "phone book" order.
DESIRED RESULT
What I would like to end with is a dataframe with this (in addition to all the other columns)
lastname ... persontype affiliation
smith 1 company a
jones 1 company b
cook 0 None
Thanks a TON for any pointers, guidance, etc. on this. I've been working with the fuzzywuzzy module, but haven't had any success.
NOTE: I'm not providing sample code here because the failure isn't a single line or function. Anything I provide would have too many ancillary problems to be worth your time to look at.