0

I have a 73 million row dataset, and I need to filter out rows that match any of a few conditions. I have been doing this with Boolean indexing, but it's taking a really long time (~30mins) and I want to know if I can make it faster (e.g. fancy indexing, np.where, np.compress?)

My code:

clean_df = df[~(df.project_name.isin(p_to_drop) | 
                df.workspace_name.isin(ws_to_drop) | 
                df.campaign_name.str.contains(regex_string,regex=True) | 
                df.campaign_name.isin(small_launches))]

The regex string is

regex_string = '(?i)^.*ARCHIVE.*$|^.*birthday.*$|^.*bundle.*$|^.*Competition followups.*$|^.*consent.*$|^.*DOI.*$|\
                    ^.*experiment.*$|^.*hello.*$|^.*new subscribers.*$|^.*not purchased.*$|^.*parent.*$|\
                    ^.*re engagement.*$|^.*reengagement.*$|^.*re-engagement.*$|^.*resend.*$|^.*Resend of.*$|\
                    ^.*reward.*$|^.*survey.*$|^.*test.*$|^.*thank.*$|^.*welcome.*$'

and the other three conditions are lists of strings with fewer than 50 items.

travelsandbooks
  • 111
  • 1
  • 12
  • Please do mention what is the value of `regex_string` in your question, thank you. – RavinderSingh13 May 18 '21 at 11:13
  • I have added it - it's quite extensive – travelsandbooks May 18 '21 at 11:18
  • let me fix it and get back here. – RavinderSingh13 May 18 '21 at 11:19
  • Don't apply regex filter to big dataframe use other 3 condition 1st make temporary df and then use regex condition on that small temp_df as regex operations are costly. – Nk03 May 18 '21 at 11:25
  • @travelsandbooks, Here `.*` is not required and you can simplify your regex pattern, which could improve speed/performance. Try making list like: `words = ['ARCHIVE', 'birthday', 'bundle', 'Competition followups', 'consent', 'DOI', 'experiment', 'hello', 'new subscribers', 'not purchased', 'parent', 're engagement', 'reengagement', 're-engagement', 'resend', 'Resend of', 'reward', 'survey', 'test', 'thank', 'welcome']` Then make `regex_string = r'(?i)\b(' + '|'.join(words) +r')\b'` then try running code once, but again, since I haven't have that much big data so couldn't test it. – RavinderSingh13 May 18 '21 at 11:29
  • @RavinderSingh13 when I try with that amendment to the regex, I get the warning 'This pattern has match groups. To actually get the groups, use str.extract.' – travelsandbooks May 18 '21 at 12:04
  • @travelsandbooks, ok change it to `regex_string = r'(?i)\b(?:' + '|'.join(words) +r')\b'` and let me know how it goes then? – RavinderSingh13 May 18 '21 at 12:06

2 Answers2

2

If you have this many rows, I think it will be faster to first remove the records one step at a time. Regex is typically slow, so you could use that as a last step with a much smaller data frame.

For example:

clean_df = df.copy()
clean_df = clean_df.loc[~(df.project_name.isin(p_to_drop)]
clean_df = clean_df.loc[~df.workspace_name.isin(ws_to_drop)]
clean_df = clean_df.loc[~df.campaign_name.isin(small_launches)]
clean_df = clean_df.loc[~df.campaign_name.str.contains(regex_string,regex=True)]
ThePyGuy
  • 17,779
  • 5
  • 18
  • 45
Rutger
  • 593
  • 5
  • 11
1

I had thought that chaining my conditions was a good idea but the answer about making them consecutive helped me rethink: each time I ran a Boolean indexing operation, I was making the dataset smaller - therefore cheaper for the next operation.

I've separated them out, as suggested, and put the operation that gets rid of the most rows at the top, so the next ones are quicker. I've put the regex last - because it's expensive, it makes sense to do it on the smallest df possible.

Hope this helps someone! TIL chaining your operations looks good but it's not efficient :)

travelsandbooks
  • 111
  • 1
  • 12
  • Thanks for sharing, could you please do let me know if my regex(given in comments) worked for you? – RavinderSingh13 May 18 '21 at 15:53
  • Hi, I used it but I'm not sure it made a huge difference. It also didn't catch everything - e.g. when the word was at the very start or very end of a string. – travelsandbooks May 18 '21 at 16:45