6

I have a dataframe, based on the strings in a column named "originator" I would like to check if the string has a word that resides in another list. If the string has a word that resides in the said list, update column originator_prediction to "org".

Is there a better way to do this? I did it the following way but its slow.

for row in df['ORIGINATOR'][1:]:
    string = str(row)
    splits = string.split()
    for word in splits:
        if word in COMMON_ORG_UNIGRAMS_LIST:
            df['ORGINATOR_PREDICTION'] = 'Org'
        else:
            continue

df  = pd.DataFrame({'ORIGINATOR':  ['JOHN DOE', 'APPLE INC', 'MIKE LOWRY'],
        'ORGINATOR_PREDICTION': ['Person', 'Person','Person']})

COMMON_ORG_UNIGRAMS_LIST = ['INC','LLC','LP']

Concretely, if you look at row 2 in our dataframe "APPLE INC" should have an originator_prediction = 'ORG' not person.

The reason being, we looped through our common org unigrams list and the word INC was in there.

mikelowry
  • 1,307
  • 4
  • 21
  • 43

3 Answers3

3

Your code won't give the correct result because after every check, with df['ORGINATOR_PREDICTION'] = 'Org', you are assigning all the rows in that column that value. That will result in all the rows within that column to have the value Org. Also, I don't get why you have added [1:] in the loop. It does not pick the column name if that's what you were trying to avoid. I have made some changes to your code, it works as desired

org_or_person_list = []
for row in df['ORIGINATOR']:
    splits = row.split()
    org_or_person_list.append('Org' if set(splits) & set(COMMON_ORG_UNIGRAMS_LIST) else 'Person')

df['ORGINATOR_PREDICTION'] = org_or_person_list

Output:

    ORIGINATOR  ORGINATOR_PREDICTION
0   JOHN DOE    Person
1   APPLE INC   Org
2   MIKE LOWRY  Person
callmeanythingyouwant
  • 1,789
  • 4
  • 15
  • 40
  • Thank you for modifying my original code, is there a more efficient way to write this? It takes a long time to run. I have roughly 800k rows for context. – mikelowry Sep 28 '20 at 20:03
  • I have edited it. It's more efficient now. Also, I have removed `string = str(row)` as each row was already a string but you can add it back if other rows are of a different datatype – callmeanythingyouwant Sep 28 '20 at 20:45
  • While i think this works, and is much more efficient, i think i need to go back to the drawing board and figure out how to run this through 800k rows. It never finishes... I'm going to try string splitting a head of time and creating a column with all the string splits. Maybe use some numpy arrays to speed up the process.. – mikelowry Sep 28 '20 at 21:47
  • I could help if you can provide me with the data. A tip I can give is that if the org unigrams (INC, LLP, etc) are always at the end, then you only need to read the last word of `splits` and discard the rest. – callmeanythingyouwant Sep 29 '20 at 02:51
  • @mikelowry Hi. I have edited my code further. The repeated assignment to each row of dataframe is a time intensive process. I appended the results to a list and assigned it to the column all in one go. (I haven't used the `splits` advice from my previous comment.) – callmeanythingyouwant Sep 29 '20 at 03:12
2

Try this, using the .str, string accessor, with the contains method. We can create a regex using join for the list of strings:

df.loc[df['ORIGINATOR'].str.contains('|'.join(COMMON_ORG_UNIGRAMS_LIST)), 'ORGINATOR_PREDICTION'] = 'Org'

Output:

   ORIGINATOR ORGINATOR_PREDICTION
0    JOHN DOE               Person
1   APPLE INC                  Org
2  MIKE LOWRY               Person

Full code:

df  = pd.DataFrame({'ORIGINATOR':  ['JOHN DOE', 'APPLE INC', 'MIKE LOWRY'],
        'ORGINATOR_PREDICTION': ['Person', 'Person','Person']})

COMMON_ORG_UNIGRAMS_LIST = ['INC','LLC','LP']

df.loc[df['ORIGINATOR'].str.contains('|'.join(COMMON_ORG_UNIGRAMS_LIST)),'ORGINATOR_PREDICTION'] = 'Org'

print(df)
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
0

Alternative solution:

df  = pd.DataFrame({
    'ORIGINATOR':  ['JOHN DOE', 'APPLE INC', 'MIKE LOWRY'],
    'ORIGINATOR_PREDICTION': ['Person', 'Person','Person']
})

COMMON_ORG_UNIGRAMS_LIST = ['INC','LLC','LP']

df.loc[df['ORIGINATOR'].apply(lambda x: len(set(x.split(' ')) & set(COMMON_ORG_UNIGRAMS_LIST)) > 0), 'ORIGINATOR_PREDICTION'] = 'Org'
    
Alexandra Dudkina
  • 4,302
  • 3
  • 15
  • 27