2

I'm getting messages along the lines of the following in my Spark JobServer logs:

Stage 14 contains a task of very large size (9523 KB). The maximum recommended task size is 100 KB.

I'm creating my RDD with this code:

List<String> data = new ArrayList<>();
for (int i = 0; i < 2000000; i++) {
    data.add(UUID.randomUUID().toString());
}

JavaRDD<String> randomData = sc.parallelize(data).cache();

I understand that the first time I run this is could be big, because the data in the RDD doesn't exist on the executor nodes yet.

I would have thought that it would be quick on subsequent runs though (I'm using Spark JobServer to keep the session context around, and reuse the RDD), since I'm reusing the RDD so the data should exist on the nodes.

The code is very simple:

private static Function<String, Boolean> func = new Function<String, Boolean>() {
    public Boolean call(String s) {
        return s.contains("a");
    }
};
----
rdd.filter(aFunc).count();
yarrichar
  • 423
  • 5
  • 17
  • Are you looking for NamedObjects? (https://github.com/spark-jobserver/spark-jobserver#named-objects) – noorul Mar 08 '17 at 04:05
  • I don't think so. I can find the existing RDD fine. I'm wondering about the warning and why there are any large tasks being generated if the data should already be cached on the executors. – yarrichar Mar 08 '17 at 04:52
  • Interesting! Did you check the Spark UI (port 4040)? It should tell you whether the data is indeed cached, and you should be able to see what is executed and why. A possible, if unlikely, explanation is that the RDD did not fit entirely into the cache. – Daniel Darabos Mar 10 '17 at 09:21

0 Answers0