2

I've got an RDD in Spark which I've cached. Before I cache it I repartition it. This works, and I can see in the storage tab in spark that it has the expected number of partitions.

This is what the stages look like on subsequent runs: Repartition example

It's skipping a bunch of work which I've done to my cached RDD which is great. What I'm wondering though is why Stage 18 starts with a repartition. You can see that it's done at the end of Stage 17.

The steps I don in the code are:

List<Tuple2<String, Integer>> rawCounts = rdd
        .flatMap(...)
        .mapToPair(...)
        .reduceByKey(...)
        .collect();

To get the RDD, I grab it out of the session context. I also have to wrap it since I'm using Java:

JavaRDD<...> javaRdd = sc.emptyRDD();
return javaRdd.wrapRDD((RDD<...>)rdd);

Edit

I don't think this is specific to repartitioning. I've removed the repartitioning, and now I'm seeing some of the other operations I do prior to caching appearing after the skipped stages. E.g.

Non repartition example

The green dot and everything before it should have already been worked out and cached.

yarrichar
  • 423
  • 5
  • 17

0 Answers0