-2

DataFrame

I have this dataframe which contains user id and tags related to the user. what is the best way to filter out only those rows which has tags containing any one among this list. data_science = ['python', 'r', 'matlab', 'sas', 'excel', 'sql'] I have tried the below code in pandas, it does filter out to a certain extent but it gives tags which has any tag similar to the list. for example, for sql, it throws sql-server. Could you suggest a better approach?

df_ds = df_combo[df_combo["Tag"].astype(str).str.contains('(python|excel|sql|matlab)', regex=True)]
Deb
  • 37
  • 4
  • Please don't add links to pictures of your data: provide a sample. [MRE] For example, `data = df.combo.head().to_dict(): df_combo = pd.DataFrame(data)`. You would need to replace `df_combo.head().to_dict()` with the actual values returned. – Alexander Nov 13 '20 at 03:20
  • Thank you very much ! Will share a sample next time. – Deb Nov 13 '20 at 03:52

1 Answers1

0

I think a possible simpler approach, but perhaps verbose is this:

# create a set with the queried tags
tags = {'python', 'r', 'matlab', 'sas', 'excel', 'sql'}
# create an auxiliary column where all the tags are separated elements of a set 
df_combo['Tag-set'] = df_combo['Tag'].str.split(',').apply(lambda x: [e.strip() for e in x]).tolist() 
# use sets to check the intersection
df_combo['Tag-set'] = df_combo['Tag-set'].apply(set)
# filter the list
df_fd = df_combo[df_combo['Tag-set'].apply(lambda x: len(x & tags) > 0)]

The idea is to clean all the strings using split and strip, and then keep only those where the intersection has at least one element.

Mateo Torres
  • 1,545
  • 1
  • 13
  • 22