How do I filter only those rows which contains any of the values from a given list of tags

Question

I have this dataframe which contains user id and tags related to the user. what is the best way to filter out only those rows which has tags containing any one among this list. data_science = ['python', 'r', 'matlab', 'sas', 'excel', 'sql'] I have tried the below code in pandas, it does filter out to a certain extent but it gives tags which has any tag similar to the list. for example, for sql, it throws sql-server. Could you suggest a better approach?

df_ds = df_combo[df_combo["Tag"].astype(str).str.contains('(python|excel|sql|matlab)', regex=True)]

Please don't add links to pictures of your data: provide a sample. [MRE] For example, `data = df.combo.head().to_dict(): df_combo = pd.DataFrame(data)`. You would need to replace `df_combo.head().to_dict()` with the actual values returned. — Alexander, Nov 13 '20 at 03:20

score 0 · Accepted Answer · answered Nov 13 '20 at 03:04

I think a possible simpler approach, but perhaps verbose is this:

# create a set with the queried tags
tags = {'python', 'r', 'matlab', 'sas', 'excel', 'sql'}
# create an auxiliary column where all the tags are separated elements of a set 
df_combo['Tag-set'] = df_combo['Tag'].str.split(',').apply(lambda x: [e.strip() for e in x]).tolist() 
# use sets to check the intersection
df_combo['Tag-set'] = df_combo['Tag-set'].apply(set)
# filter the list
df_fd = df_combo[df_combo['Tag-set'].apply(lambda x: len(x & tags) > 0)]

The idea is to clean all the strings using split and strip, and then keep only those where the intersection has at least one element.

If this answers your question, please accept the answer :) – Mateo Torres Nov 13 '20 at 13:59 — Mateo Torres, Nov 13 '20 at 13:59

How do I filter only those rows which contains any of the values from a given list of tags

1 Answers1