I have a list of 60 complicated items, i have a dataframe with a text column that I would like to extract all the items from the list

Question

I tried to ask this here and I oversimplified it too greatly

I have a list of 60 unique text items, varying in length and what they contain, I have a text column in a dataframe that has that information plus extra information, I would like to add a new column preserving the original and adding rows when duplicates from the list are encountered in the dataframe

pd.set_option('display.max_colwidth', None)
my_list = ['alabama 500', 'beta 15', 'carthouse', 'd320 blend', 'royal blue lugnuts']
# in actuality this list contains 60 different items, anywhere from the color red, to a sentence with 80 characters

# this is an example of how each row can sometimes contain multiple items from the list, but not always
# it is important to capture the multiples, and that all of the original rows are maintained, but to also isolate the items from the original list
# all of the data are strings
df = pd.DataFrame({'col1':['left side alabama 500 on the right side carthouse near the royal blue lugnuts', '1st entry is at beta 15', 'this one takes a mix of d320 blend and beta 15']})

In [1]: df

Out [1]: 

    col1
0   left side alabama 500 on the right side carthouse near the royal blue lugnuts
1   1st entry is at beta 15
2   this one takes a mix of d320 blend and beta 15

goal:

    col1                                                                            col2
0   left side alabama 500 on the right side carthouse near the royal blue lugnuts   alabama 500
0   left side alabama 500 on the right side carthouse near the royal blue lugnuts   carthouse
0   left side alabama 500 on the right side carthouse near the royal blue lugnuts   royal blue lugnuts
1   1st entry takes beta 15                                                         beta 15
2   this one takes a mix of d320 blend and south takes beta 15                      d320 blend 
3   this one takes a mix of d320 blend and south takes beta 15                      beta 15

I tried to write a function here, a few different ways, it does not appear to be as simple as a string extract

write a python function that takes a list and one column of a dataframe and adds a new column based on that list

there is an answer here that is very close,

Extract string from a dataframe comparing to a list

but so far I don't see how it takes care of duplicates, I tried changing expand to True, also looked at extractall but it doesn't seem to have the same behavior

tried this:

df['col2'] = df['col1'].str.extract("(" + "|".join(my_list) +")")
#changed expand to true and false with no change in behavior

    col1                                                                            col2
0   left side alabama 500 on the right side carthouse near the royal blue lugnuts   alabama 500
1   1st entry is at beta 15                                                         beta 15
2   this one takes a mix of d320 blend and beta 15                                  d320 blend

trying extractall

df['col2'] = df['col1'].str.extractall("(" + "|".join(my_list) +")")
#gives this error
TypeError: incompatible index of inserted column with frame index

score 2 · Accepted Answer · answered Dec 07 '20 at 21:36

extractall returns a new dataframe with multiindex, first level index shows the row number in the original dataframe.

So, you cannot directly append the result of extractall to the original df. You need to merge or something to combine the result to the original df.

df_ext = df['col1'].str.extractall("(" + "|".join(my_list) +")")
>>> df_ext = df['col1'].str.extractall("(" + "|".join(my_list) +")")
>>> df_ext
                          0
  match
0 0             alabama 500
  1               carthouse
  2      royal blue lugnuts
1 0                 beta 15
2 0              d320 blend
  1                 beta 15

>>> df_ext.merge(df, left_on=df_ext.index.get_level_values(0), right_on=df.index)
   key_0                   0                                               col1
0      0         alabama 500  left side alabama 500 on the right side cartho...
1      0           carthouse  left side alabama 500 on the right side cartho...
2      0  royal blue lugnuts  left side alabama 500 on the right side cartho...
3      1             beta 15                            1st entry is at beta 15
4      2          d320 blend     this one takes a mix of d320 blend and beta 15
5      2             beta 15     this one takes a mix of d320 blend and beta 15

This is not that great and as you can see, you still need to clean extra columns.

=====================================================

Alternatively with pandas >= 0.25.0, you can use findall and explode. This is probably simpler to read.

findall will return Series with matching strings as in a list. Then explode can explode the list into rows.

>>> df['ext'] = df.col1.str.findall(f'({"|".join(my_list)})')
>>> df
                                                col1                                           ext
0  left side alabama 500 on the right side cartho...  [alabama 500, carthouse, royal blue lugnuts]
1                            1st entry is at beta 15                                     [beta 15]
2     this one takes a mix of d320 blend and beta 15                         [d320 blend, beta 15]

>>> df = df.explode('ext')
>>> df
                                                col1                 ext
0  left side alabama 500 on the right side cartho...         alabama 500
0  left side alabama 500 on the right side cartho...           carthouse
0  left side alabama 500 on the right side cartho...  royal blue lugnuts
1                            1st entry is at beta 15             beta 15
2     this one takes a mix of d320 blend and beta 15          d320 blend
2     this one takes a mix of d320 blend and beta 15             beta 15

I believe there are more ways to achieve this but does either of these satisfy your question?

thank you so much, this had some weird behavior on the large dataset that I was able to fix with df['ext'] = df['ext'].str[0], it had extracted the data as a tuple. I was trying to use a simpler example, but the findall and explode was exactly what I needed — robert_553z8, Dec 08 '20 at 14:57

I have a list of 60 complicated items, i have a dataframe with a text column that I would like to extract all the items from the list

1 Answers1