35

How can I force Spark to execute a call to map, even if it thinks it does not need to be executed due to its lazy evaluation?

I have tried to put cache() with the map call but that still doesn't do the trick. My map method actually uploads results to HDFS. So, its not useless, but Spark thinks it is.

MetallicPriest
  • 29,191
  • 52
  • 200
  • 356
  • 3
    This question suffers the same problem as [your previous one](http://stackoverflow.com/q/31383733/1093528): no code to work with. Please post example code. Also, if this is related to the same problem, do not open a new question. – fge Jul 13 '15 at 12:52
  • 2
    That is a general question. Basically, how to stop Spark from making assumptions and execute whatever code I give to it. – MetallicPriest Jul 13 '15 at 12:53
  • We can't tell what assumptions Spark is making without the code that you claim it's making assumptions about. Post the code, please. – The Archetypal Paul Jul 13 '15 at 14:06
  • 1
    Just out of interest, why would you want this? Spark is Spark with a definite thing in mind. – thebluephantom Nov 02 '19 at 10:01

2 Answers2

43

Short answer:

To force Spark to execute a transformation, you'll need to require a result. Sometimes a simple count action is sufficient.

TL;DR:

Ok, let's review the RDD operations.

RDDs support two types of operations:

  • transformations - which create a new dataset from an existing one.
  • actions - which return a value to the driver program after running a computation on the dataset.

For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).

All transformations in Spark are lazy, in that they do not compute their results right away.

Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.

Conclusion

To force Spark to execute a call to map, you'll need to require a result. Sometimes a count action is sufficient.

Reference

eliasah
  • 39,588
  • 11
  • 124
  • 154
  • What is the difference between persist and cache? And aren't RDDs persisted in memory anyway if they are only used once? – MetallicPriest Jul 13 '15 at 13:03
  • 2
    With `cache`, you use only the default storage level MEMORY_ONLY. With `persist`, you can specify which storage level you want. Use `persist` if you want to assign another storage level than MEMORY_ONLY to the RDD – eliasah Jul 13 '15 at 13:22
  • 3
    How about using "take" to trigger persist ? I did an experiment with version 1.6.1, "count" would need one more stage (comprised of shuffle and aggregate)than "take". So I think it is more efficient by using "take" action. – JiaMing Lin Jul 22 '16 at 14:22
  • Is there a list of all actions somewhere? – rsmith54 Feb 07 '18 at 18:23
  • 2
    @rsmith54 https://spark.apache.org/docs/2.1.1/programming-guide.html#actions gives the most common ones, and there should be a link to the docs for an exhaustive list for whichever language you use – bendl Mar 01 '18 at 21:28
  • That is useful for the most common, but it would be great to have an exhaustive list of actions. – rsmith54 Mar 01 '18 at 22:42
  • @eliasah what do you mean by sometimes? i would like to know more about the conditions when a count action wouldn't trigger intermediate map steps. – Jane Wayne Nov 02 '18 at 00:47
16

Spark transformations only describe what has to be done. To trigger an execution you need an action.

In your case there is a deeper problem. If goal is to create some kind of side effect, like storing data on HDFS, the right method to use is foreach. It is both an action and has a clean semantics. What is also important, unlike map, it doesn't imply referential transparency.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • Would foreach also execute in parallel? Can you give some example? – MetallicPriest Jul 13 '15 at 13:05
  • 1
    Yes, it is executed in parallel on the worker nodes. The simplest thing is to log or print things. Using PySpark: `from __future__ import print_function; rdd.foreach(print)`. Another option is to `foreachPartition`. – zero323 Jul 13 '15 at 13:12