Spark: Collect/parallelize on RDD is "faster" than "doing nothing" on RDD

Question

I'm trying to improve my Spark app code understanding "collect", and I'm dealing with this code:

  val triple = logData.map(x => x.split('@'))
                  .map(x => (x(1),x(0),x(2)))
                  .collect()
                  .sortBy(x => (x._1,x._2))
  val idx = sc.parallelize(triple)

Basically I'm creating a [String,String,String] RDD with an unneccesary (imho) collect/parallelize step (200k elements in the original RDD).

The Spark guide says: "Collect: Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data."

BTW: 200k is sufficiently small?

I feel that this code should be "lighter" (with no collect-parallelize):

  val triple = logData.map(x => x.split('@'))
                  .map(x => (x(1),x(0),x(2)))
                  .sortBy(x => (x._1,x._2))
  val idx = triple

But after having runned (local not distributed) the same app many times, I always get faster times with the first code which in my opinion is doing an extra job (first collect then parallelize).

The entire app (not just this code snippet) takes on average 48 seconds in the first case, and at least 52 seconds in the second case.

How is this possible?

Thanks in advance

I don't see the same thing on a simple case using 1m records. 4s on the top one, 1s for the bottom. Also, keep in mind that you should test this against a warmed up engine. When performance testing, you should almost always go against a warmed up instance. — Justin Pihony, May 12 '15 at 17:45
My times are referred to my whole app not just to those 2 code snippets... I'll try to run'em "alone" and see... So this is interesting:it seems that all the rest of my app is affcected by this collect/parallelize, but in a unexpected way... — Fabio Fantoni, May 12 '15 at 18:49

score 1 · Answer 1 · answered May 12 '15 at 16:31

1

I think it is because the dataset is too small, in the later case you suffered the scheduling of shuffle to do the sort which could be faster when operating locally. when your dataset grows, it may even not possible to collect into driver.

answered May 12 '15 at 16:31

yjshen

6,583
3
31
40

I know that collect should be avoided with large dataset (btw how much large?)... But even though I'm running it locally, this behaviour it's quite strange to me: doing nothing is slower than anything else... – Fabio Fantoni May 12 '15 at 16:54
could you please suggest a reasonable number of elements in order to consider my dataset a large one? I'm curious to investigate... Hoping this answer could help someone else... – Fabio Fantoni May 12 '15 at 17:28
The size of your JVM's RAM is the limiting factor. If you're not seeing OutOfMemoryError, it's a small enough dataset. 200k is tiny. – Dick Chesterwood Feb 28 '18 at 10:14

Spark: Collect/parallelize on RDD is "faster" than "doing nothing" on RDD

1 Answers1