I want to subtract an RDD from another RDD. I looked into the documentation and I found that subtract
can do that. Actually, when I tested subtract
, the final RDD remains the same and the values are not removed!
Is there any other function to do that? Or am I using subtract
incorrectly?
Here is the code that I used:
val vertexRDD: org.apache.spark.rdd.RDD[(VertexId, Array[Int])]
val clusters = vertexRDD.takeSample(false, 3)
val clustersRDD: RDD[(VertexId, Array[Int])] = sc.parallelize(clusters)
val final = vertexRDD.subtract(clustersRDD)
final.collect().foreach(println(_))