1

I'm trying to remove one by one elements of a RDD, but that doesn't work, as elements reappeared.

Here is a part of my code :

rdd = spark.sparkContext.parallelize([0,1,2,3,4])
for i in range(5):
    rdd=rdd.filter(lambda x:x!=i)
print(rdd.collect())
[0, 1, 2, 3]

So it seems that just the last filter is "remember". I was thinking that after this loop, the rdd would be empty.

However, I do not understand why, as every time I save the new rdd obtained by filter in "rdd", so shouldn't it keep all the transformations ? If not, how should I do ?

Thank you for pointing me out where I am wrong !

mck
  • 40,932
  • 13
  • 35
  • 50
Ezriel_S
  • 248
  • 3
  • 11
  • Because `rdd` variable is getting replaced by by new values on each loop. Like `rdd = 1'filter` then `rdd=2 filter` and so on. Since you print outside of the loop you will only see the latest value.\ – Equinox Jan 20 '21 at 11:00
  • @venky__ yes, but shouldn't rdd had it's elements removed one-by-one ? As after the first loop, there should be only [1,2,3]; after the second shouldn't it be only [2,3] ? Why/when the 0 returned ? – Ezriel_S Jan 20 '21 at 11:06
  • You should include code your question. Use images only to show the outputs you get, or anything you can write in the question, otherwise it is impossible to copy paste the code to test the error. You should also include an input example (copy pasteable) and the desired output. You will have higher chances to receive a relevant answer this way. – gionni Jan 20 '21 at 11:27
  • Related: https://stackoverflow.com/questions/57154430/how-to-apply-multiple-filters-in-a-for-loop-for-pyspark – mck Jan 20 '21 at 12:53
  • See also https://stackoverflow.com/questions/41666977/python-2-vs-python-3-difference-in-behavior-of-filter , which explains how filter works – mck Jan 20 '21 at 17:25

1 Answers1

1

The result is actually correct - it is not a bug of Spark. Note that the lambda function is defined as x != i, and i is not substituted into the lambda function. So in each iteration of the for loop, the RDDs will look like

rdd
rdd.filter(lambda x: x != i)
rdd.filter(lambda x: x != i).filter(lambda x: x != i)
rdd.filter(lambda x: x != i).filter(lambda x: x != i).filter(lambda x: x != i)

etc.

Since the filters are all the same, and they will be substituted with the latest value of i, only one item is filtered away in each for loop iteration.

To avoid this, you can use a partial function to make sure i is substituted into the function:

from functools import partial
 
rdd = spark.sparkContext.parallelize([0,1,2,3,4])
for i in range(5):
    rdd = rdd.filter(partial(lambda x, i: x != i, i))

print(rdd.collect())

Or you can use reduce:

from functools import reduce

rdd = spark.sparkContext.parallelize([0,1,2])
rdd = reduce(lambda r, i: r.filter(lambda x: x != i), range(3), rdd)
print(rdd.collect())
mck
  • 40,932
  • 13
  • 35
  • 50