0

I have a matching problem that I've tried to solve, but have not found a way to do so. I'm new to python, so there might well be simple methods for doing this. I've searched the questions, but haven't found anything that quite gets what I need.

PURPOSE

I need to identify a "type" of person (to be defined below), and if they belong to one of a list of groups, note the group. The result should be a single row in a dataframe that contains an observation of the person's name, their type, and their affiliation. The data all exist as strings, and there are multiple parts of the strings I will have to parse out.

DATA

My data currently look like this:

lastname       person1                person2                person3
smith          matt smith, company a  tom jones, company b   karen cook, company c
jones          matt smith, company a  tom jones, company b   karen cook, company c
cook           matt smith, company a  tom jones, company b   karen cook, company c

Also, I have a list of companies

CompanyName
company a
company b
company d

The steps are, roughly (in pseudocod-ish), for every row in the dataframe

If lastname is in person1:
   if person1 company is in list of CompanyNames:
       df['persontype'] = 1 #where 1 indicates being in a listed company
       df['affiliation'] = company name #the matched item from CompanyName list
   else if lastname is in person2: 
       if person2 company is in list of CompanyNames: 
            df['personatype'] = 1
            df['affiliation'] = company name
       else if lastname is in person3: 
            if person3 company name is in list of CompanyNames: 
                df['persontype'] = 1
                df['affiliation'] = company name
            else: 
                df['persontype'] = 0 #indicating a person in an unlisted company
                df['affiliation'] = 'Unlisted'

I know that's terrible to read, but I'm just starting out. (All comments welcome.)

The issues are:

  1. The lastname is short, so the match to the list of persons can be a simple is 'a' in 'b', but I expect the company names within the person1...N strings to have typos or idiosyncrasies (think "C-SPAN" and "CSPAN" as possible versions of one name).
  2. The number of columns for persons is static. That is, it is fixed (though many entries will be empty, so that while the number of columns is fixed at 100, some rows will only have 1 or 2 entries).
  3. The formatting of the person1...N strings is not consistent. That is, there may be multiple commas, periods, hyphens, etc. Thus, splitting on a comma a getting everything after it won't always give the company name.

UPDATE

Thanks to great questions, I realize I've not been detailed enough on the format of the data. The entries under the person1..N columns are highly variable. Examples of actual data from those columns below:

person1
Hill, Daniel O., Assistant Administrator for Technology, U.S. Small Business Administration
person2
R&D Credit Coalition, R. Randall Capps

This is why I didn't try using .split() methods right away. The order of the tokens are not uniform, some have a position/job as well as the name of the organization, and the names are not always in "phone book" order.

DESIRED RESULT

What I would like to end with is a dataframe with this (in addition to all the other columns)

lastname  ...   persontype     affiliation 
smith           1              company a
jones           1              company b
cook            0              None

Thanks a TON for any pointers, guidance, etc. on this. I've been working with the fuzzywuzzy module, but haven't had any success.

NOTE: I'm not providing sample code here because the failure isn't a single line or function. Anything I provide would have too many ancillary problems to be worth your time to look at.

Savage Henry
  • 1,990
  • 3
  • 21
  • 29
  • Are the person1-3 columns always the same for every row? This is an extremely inefficient way to store the data – JohnE Sep 16 '15 at 17:59
  • Thanks for the question. No, they're not always the same. There are repeats, but that's because the data covers a number of years where the groups change membership. – Savage Henry Sep 16 '15 at 18:26
  • OK, then I think it would be better if the sample data showed those cases. – JohnE Sep 16 '15 at 19:31

1 Answers1

0

Data format

The data format provided isn't clear about association or delimiters, so I am going to restate what I think you meant. (in a python syntactic structure)

Your data:

lastname       person1                person2                person3
smith          matt smith, company a  tom jones, company b   karen cook, company c
jones          matt smith, company a  tom jones, company b   karen cook, company c
cook           matt smith, company a  tom jones, company b   karen cook, company c

Also, I have a list of companies

CompanyName
company a
company b
company d

What I think it means:

lastname = ['smith','jones','cook']
companies = ['company a','company b']
affiliations = {'matt smith':'company a','tom jones':'company b','karen cook':'company c'}

How I think it should be formatted

lastname = ['smith','jones','cook']
companies = ['company a','company b']
names = {'smith':'matt smith','jones':'tom jones','cook':'karen cook'}
affiliations = {'matt smith':'company a','tom jones':'company b','karen cook':'company c'}

Desired result

The idea is to provide the following output:

lastname  ...   persontype     affiliation 
smith           1              company a
jones           1              company b
cook            0              None

How to get the desired result

Note: This is using the format that I think would be best.

print('lastname\tpersontype\taffiliation\n')
for last in lastname:
    if last in names:
        full = names[last]
        if full in affiliations:
            comp = affiliations[full]
            if comp in companies:
                print(full+'\t1\t'+comp)
            else:
                print(full+'\t0\tNone')
        else:
            print(full+' is not listed in affiliations.')
    else:
        print(last+' has no matches in names.')
    print('\n')

I don't know if you'll be able to create the data format that I think you should use. Though if your data is in a plain text format with the shape you provide, it should be fairly straightforward to read the file and process the lines.

If you want to print it back out to a file, that would be doable as well, study the open() and write() built in functions.

I hope this is helpful. Enjoy!

Alea Kootz
  • 913
  • 4
  • 11
  • Thanks a TON for this; your clarification of the format is hugely helpful. The only thing I think I failed to communicate that is different, perhaps, from what you note is that the entire entry under `person1...N` is a string. And it's a string that might not be formatted the same way each time (could be "matt smith director company a" or "matt smith, dir, company a" or other, but always with "company a"), which is why I was looking into fuzzy matching. So I'm not sure how to get that into the dict format without using the list of companies to search the string. Thanks again! – Savage Henry Sep 16 '15 at 19:08
  • 1
    split() will return a list of words from a string (removes punctuation). Use like: `strings = 'this, is. a string'` then `words = strings.split()`. You can then use list comprehension or some other technique to extract the information to populate your lists and dictionaries. – Alea Kootz Sep 16 '15 at 19:15
  • For reading from the file, look at the `readlines()` function. It will provide easy to handle strings for each row of the file. Then you could use `split()`to break that row into columns. – Alea Kootz Sep 16 '15 at 19:31
  • Understood. The issue, at least for a newbie like me, is that the order of items after the spit won't be uniform. For example, here are two actual entries: `Person1: Cove, Thomas J., Vice President of Government Relations, Sporting Goods Manufacturers Association` `Person2: R&D Credit Coalition, R. Randall Capps` You can see the issue: the company name is not always second, the person is not always listed last-name first, etc. Sorry this wasn't in the original question. Figured the answers would stick to fuzzy matching. Thanks again! – Savage Henry Sep 16 '15 at 20:12
  • So those strings are lines within the file? And there are no dedicated delimiters (commas tabs or something) that separate names and companies? – Alea Kootz Sep 16 '15 at 20:58
  • I have it as a dataframe, so that `df.loc[i, 'person1'] == Cove, Thomas J., Vice President of Government Relations, Sporting Goods Manufacturers Association` and `df.loc[i, 'person2'] == R&D Credit Coalition, R. Randall Capps`. Those entries have commas, but not all of them do. That is, it could be just as likely that `df.loc[i, 'person1'] == Cove Thomas J Vice President of Government Relations Sporting Goods Manufacturers Association`. The only thing that is definitely common is that the order of the tokens for the person's job are common. – Savage Henry Sep 17 '15 at 10:50
  • That's a challenging problem then, without a predictable data structure, writing a regex or a set of regex's to sort out the names and companies reliably will be quite complex. – Alea Kootz Sep 17 '15 at 15:17
  • 1
    @SavageHenry you are going to be better off if you just make the sample data more realistic. This is a huge comment section on account of the fact that your sample data is way simpler than your actual data. It's much better to just put this in the question itself rather than in a bunch of followup comments or just hoping people can read your mind. Nobody can possibly guess what your data looks like -- you have to explicitly show them. – JohnE Sep 18 '15 at 02:04
  • Yup. Bad etiquette. Closed it out. – Savage Henry Sep 18 '15 at 02:05
  • @SavageHenry It's not bad etiquette, I'm just trying to help you generate a good answer. I don't think you need to close it out, you still have 5 days and have invested 100 points (non-refundable). It's your choice, but I'd just spend a little time on the sample data and wait a little longer for an answer. You have elaborated in the comments, which is OK, it's just much easier to read if you integrate the comments into the original question. – JohnE Sep 18 '15 at 02:07