I tried to ask this here and I oversimplified it too greatly
I have a list of 60 unique text items, varying in length and what they contain, I have a text column in a dataframe that has that information plus extra information, I would like to add a new column preserving the original and adding rows when duplicates from the list are encountered in the dataframe
pd.set_option('display.max_colwidth', None)
my_list = ['alabama 500', 'beta 15', 'carthouse', 'd320 blend', 'royal blue lugnuts']
# in actuality this list contains 60 different items, anywhere from the color red, to a sentence with 80 characters
# this is an example of how each row can sometimes contain multiple items from the list, but not always
# it is important to capture the multiples, and that all of the original rows are maintained, but to also isolate the items from the original list
# all of the data are strings
df = pd.DataFrame({'col1':['left side alabama 500 on the right side carthouse near the royal blue lugnuts', '1st entry is at beta 15', 'this one takes a mix of d320 blend and beta 15']})
In [1]: df
Out [1]:
col1
0 left side alabama 500 on the right side carthouse near the royal blue lugnuts
1 1st entry is at beta 15
2 this one takes a mix of d320 blend and beta 15
goal:
col1 col2
0 left side alabama 500 on the right side carthouse near the royal blue lugnuts alabama 500
0 left side alabama 500 on the right side carthouse near the royal blue lugnuts carthouse
0 left side alabama 500 on the right side carthouse near the royal blue lugnuts royal blue lugnuts
1 1st entry takes beta 15 beta 15
2 this one takes a mix of d320 blend and south takes beta 15 d320 blend
3 this one takes a mix of d320 blend and south takes beta 15 beta 15
I tried to write a function here, a few different ways, it does not appear to be as simple as a string extract
there is an answer here that is very close,
Extract string from a dataframe comparing to a list
but so far I don't see how it takes care of duplicates, I tried changing expand to True, also looked at extractall but it doesn't seem to have the same behavior
tried this:
df['col2'] = df['col1'].str.extract("(" + "|".join(my_list) +")")
#changed expand to true and false with no change in behavior
col1 col2
0 left side alabama 500 on the right side carthouse near the royal blue lugnuts alabama 500
1 1st entry is at beta 15 beta 15
2 this one takes a mix of d320 blend and beta 15 d320 blend
trying extractall
df['col2'] = df['col1'].str.extractall("(" + "|".join(my_list) +")")
#gives this error
TypeError: incompatible index of inserted column with frame index