Updating dataframe value based on list

Question

I have a dataframe, based on the strings in a column named "originator" I would like to check if the string has a word that resides in another list. If the string has a word that resides in the said list, update column originator_prediction to "org".

Is there a better way to do this? I did it the following way but its slow.

for row in df['ORIGINATOR'][1:]:
    string = str(row)
    splits = string.split()
    for word in splits:
        if word in COMMON_ORG_UNIGRAMS_LIST:
            df['ORGINATOR_PREDICTION'] = 'Org'
        else:
            continue


df  = pd.DataFrame({'ORIGINATOR':  ['JOHN DOE', 'APPLE INC', 'MIKE LOWRY'],
        'ORGINATOR_PREDICTION': ['Person', 'Person','Person']})

COMMON_ORG_UNIGRAMS_LIST = ['INC','LLC','LP']

Concretely, if you look at row 2 in our dataframe "APPLE INC" should have an originator_prediction = 'ORG' not person.

The reason being, we looped through our common org unigrams list and the word INC was in there.

I would say yes, can you post some sample data and expected output? — Scott Boston, Sep 28 '20 at 19:31

callmeanythingyouwant · Answer 1 · 2020-09-29T03:00:36.323

3

Your code won't give the correct result because after every check, with df['ORGINATOR_PREDICTION'] = 'Org', you are assigning all the rows in that column that value. That will result in all the rows within that column to have the value Org. Also, I don't get why you have added [1:] in the loop. It does not pick the column name if that's what you were trying to avoid. I have made some changes to your code, it works as desired

org_or_person_list = []
for row in df['ORIGINATOR']:
    splits = row.split()
    org_or_person_list.append('Org' if set(splits) & set(COMMON_ORG_UNIGRAMS_LIST) else 'Person')

df['ORGINATOR_PREDICTION'] = org_or_person_list

Output:

    ORIGINATOR  ORGINATOR_PREDICTION
0   JOHN DOE    Person
1   APPLE INC   Org
2   MIKE LOWRY  Person

edited Sep 29 '20 at 03:00

answered Sep 28 '20 at 19:54

callmeanythingyouwant

1,789
4
15
40

Thank you for modifying my original code, is there a more efficient way to write this? It takes a long time to run. I have roughly 800k rows for context. – mikelowry Sep 28 '20 at 20:03
I have edited it. It's more efficient now. Also, I have removed `string = str(row)` as each row was already a string but you can add it back if other rows are of a different datatype – callmeanythingyouwant Sep 28 '20 at 20:45
While i think this works, and is much more efficient, i think i need to go back to the drawing board and figure out how to run this through 800k rows. It never finishes... I'm going to try string splitting a head of time and creating a column with all the string splits. Maybe use some numpy arrays to speed up the process.. – mikelowry Sep 28 '20 at 21:47
I could help if you can provide me with the data. A tip I can give is that if the org unigrams (INC, LLP, etc) are always at the end, then you only need to read the last word of `splits` and discard the rest. – callmeanythingyouwant Sep 29 '20 at 02:51
@mikelowry Hi. I have edited my code further. The repeated assignment to each row of dataframe is a time intensive process. I appended the results to a list and assigned it to the column all in one go. (I haven't used the `splits` advice from my previous comment.) – callmeanythingyouwant Sep 29 '20 at 03:12

Scott Boston · Answer 2 · 2020-09-28T19:59:55.317

2

Try this, using the .str, string accessor, with the contains method. We can create a regex using join for the list of strings:

df.loc[df['ORIGINATOR'].str.contains('|'.join(COMMON_ORG_UNIGRAMS_LIST)), 'ORGINATOR_PREDICTION'] = 'Org'

Output:

   ORIGINATOR ORGINATOR_PREDICTION
0    JOHN DOE               Person
1   APPLE INC                  Org
2  MIKE LOWRY               Person

Full code:

df  = pd.DataFrame({'ORIGINATOR':  ['JOHN DOE', 'APPLE INC', 'MIKE LOWRY'],
        'ORGINATOR_PREDICTION': ['Person', 'Person','Person']})

COMMON_ORG_UNIGRAMS_LIST = ['INC','LLC','LP']

df.loc[df['ORIGINATOR'].str.contains('|'.join(COMMON_ORG_UNIGRAMS_LIST)),'ORGINATOR_PREDICTION'] = 'Org'

print(df)

edited Sep 28 '20 at 19:59

answered Sep 28 '20 at 19:51

Scott Boston

147,308
15
139
187

getting the following error: Series' objects are mutable, thus they cannot be hashed – mikelowry Sep 28 '20 at 19:55
Uh.. is COMMON_ORN_UNIGRAMS_LIST a series? – Scott Boston Sep 28 '20 at 19:57
COMMON_ORG_UNIGRAMS_LIST is a list - df['ORIGINATOR'] is an object – mikelowry Sep 28 '20 at 20:00
yes, well i run out of memory before the operation can finish. – mikelowry Sep 28 '20 at 20:04

score 0 · Answer 3 · answered Sep 28 '20 at 20:11

0

Alternative solution:

df  = pd.DataFrame({
    'ORIGINATOR':  ['JOHN DOE', 'APPLE INC', 'MIKE LOWRY'],
    'ORIGINATOR_PREDICTION': ['Person', 'Person','Person']
})

COMMON_ORG_UNIGRAMS_LIST = ['INC','LLC','LP']

df.loc[df['ORIGINATOR'].apply(lambda x: len(set(x.split(' ')) & set(COMMON_ORG_UNIGRAMS_LIST)) > 0), 'ORIGINATOR_PREDICTION'] = 'Org'

answered Sep 28 '20 at 20:11

Alexandra Dudkina

4,302
3
15
27

getting an IndexingError:Name: ORIGINATOR, Length: 7601, dtype: bool, 'ORIGINATOR_PREDICTION') – mikelowry Sep 28 '20 at 20:13

Updating dataframe value based on list

3 Answers3