I have a pyspark sc initialized.
instance = (data
.filter(lambda x: len(x) != 0 )
.filter(lambda x: ('%auth%login%' not in url)
.map(lambda x: function(x))
.reduceByKey(lambda x, y: x + y)
My goal is to filter out any url that has both auth and login keywords in it, but they could be in any position of a string.
In sql I could use %auth%login%, % means any length of string.
How to do it in pyspark syntax easily?
Forgot to mention, there are 'auth' page I do not want to filter out, I only want to filter out auth when login is also in the string
I am not sure why this is flagged as dups, this is RDD not dataframe