0

I have a pyspark sc initialized.

instance = (data
          .filter(lambda x: len(x) != 0 )
          .filter(lambda x: ('%auth%login%' not in url)
          .map(lambda x: function(x))
          .reduceByKey(lambda x, y: x + y)

My goal is to filter out any url that has both auth and login keywords in it, but they could be in any position of a string.

In sql I could use %auth%login%, % means any length of string.

How to do it in pyspark syntax easily?

Forgot to mention, there are 'auth' page I do not want to filter out, I only want to filter out auth when login is also in the string

I am not sure why this is flagged as dups, this is RDD not dataframe

2 Answers2

2

Using PySpark RDD filter method, you just need to make sure at least one of login or auth is NOT in the string, in Python code:

data.filter(lambda x: any(e not in x for e in ['login', 'auth']) ).collect()
jxc
  • 13,553
  • 4
  • 16
  • 34
  • this method won't work since I do want to retain some 'auth' pages, I only want to exclude when both keywords(auth and login) exist in the url. I will explore all method – inuyasha yolo Jan 28 '20 at 04:41
  • @inuyashayolo, this will keep the urls containing `auth` if `login` does NOT exist at the same time. as I mentioned in my post. Any of the two does not exist which is basically to exclude only cases when both exist. please do test on some sample URLs to verify. – jxc Jan 28 '20 at 05:21
  • Or in another word, this condition means any of the following three: (1) `auth` does not exist, (2) `login` does not exist, (3) neither `auth` nor `login` exist. – jxc Jan 28 '20 at 05:29
  • just tested this part in python any(e not in x for e in ['login', 'auth']) . it works for mutual. I am not sure why this question is marked as dups since it is rdd not dataframe..Thank you @jxc – inuyasha yolo Jan 28 '20 at 17:50
1

In case you are using a dataframe, you are looking for contains:

#url is the column name 
df = df.filter(~df.url.contains('auth') & ~df.url.contains('login'))

When you are working with a RDD, please have a look at the answer of jxc.

cronoik
  • 15,434
  • 3
  • 40
  • 78