I'm trying to remove one by one elements of a RDD, but that doesn't work, as elements reappeared.
Here is a part of my code :
rdd = spark.sparkContext.parallelize([0,1,2,3,4])
for i in range(5):
rdd=rdd.filter(lambda x:x!=i)
print(rdd.collect())
[0, 1, 2, 3]
So it seems that just the last filter is "remember". I was thinking that after this loop, the rdd would be empty.
However, I do not understand why, as every time I save the new rdd obtained by filter in "rdd", so shouldn't it keep all the transformations ? If not, how should I do ?
Thank you for pointing me out where I am wrong !