I'm getting messages along the lines of the following in my Spark JobServer logs:
Stage 14 contains a task of very large size (9523 KB). The maximum recommended task size is 100 KB.
I'm creating my RDD with this code:
List<String> data = new ArrayList<>();
for (int i = 0; i < 2000000; i++) {
data.add(UUID.randomUUID().toString());
}
JavaRDD<String> randomData = sc.parallelize(data).cache();
I understand that the first time I run this is could be big, because the data in the RDD doesn't exist on the executor nodes yet.
I would have thought that it would be quick on subsequent runs though (I'm using Spark JobServer to keep the session context around, and reuse the RDD), since I'm reusing the RDD so the data should exist on the nodes.
The code is very simple:
private static Function<String, Boolean> func = new Function<String, Boolean>() {
public Boolean call(String s) {
return s.contains("a");
}
};
----
rdd.filter(aFunc).count();