How to filter out a certain pattern in filter pyspark RDD

Question

I have a pyspark sc initialized.

instance = (data
          .filter(lambda x: len(x) != 0 )
          .filter(lambda x: ('%auth%login%' not in url)
          .map(lambda x: function(x))
          .reduceByKey(lambda x, y: x + y)

My goal is to filter out any url that has both auth and login keywords in it, but they could be in any position of a string.

In sql I could use %auth%login%, % means any length of string.

How to do it in pyspark syntax easily?

Forgot to mention, there are 'auth' page I do not want to filter out, I only want to filter out auth when login is also in the string

I am not sure why this is flagged as dups, this is RDD not dataframe

score 2 · Accepted Answer · answered Jan 28 '20 at 03:08

2

Using PySpark RDD filter method, you just need to make sure at least one of login or auth is NOT in the string, in Python code:

data.filter(lambda x: any(e not in x for e in ['login', 'auth']) ).collect()

answered Jan 28 '20 at 03:08

jxc

13,553
4
16
34

this method won't work since I do want to retain some 'auth' pages, I only want to exclude when both keywords(auth and login) exist in the url. I will explore all method – inuyasha yolo Jan 28 '20 at 04:41
@inuyashayolo, this will keep the urls containing `auth` if `login` does NOT exist at the same time. as I mentioned in my post. Any of the two does not exist which is basically to exclude only cases when both exist. please do test on some sample URLs to verify. – jxc Jan 28 '20 at 05:21
Or in another word, this condition means any of the following three: (1) `auth` does not exist, (2) `login` does not exist, (3) neither `auth` nor `login` exist. – jxc Jan 28 '20 at 05:29
just tested this part in python any(e not in x for e in ['login', 'auth']) . it works for mutual. I am not sure why this question is marked as dups since it is rdd not dataframe..Thank you @jxc – inuyasha yolo Jan 28 '20 at 17:50

cronoik · Answer 2 · 2020-01-28T07:25:56.690

1

In case you are using a dataframe, you are looking for contains:

#url is the column name 
df = df.filter(~df.url.contains('auth') & ~df.url.contains('login'))

When you are working with a RDD, please have a look at the answer of jxc.

edited Jan 28 '20 at 07:25

answered Jan 28 '20 at 01:43

cronoik

15,434
3
40
78

How to filter out a certain pattern in filter pyspark RDD

2 Answers2