0

I am using Scala Spark API. In my code, I have an RDD of the following structure:

Tuple2[String, (Iterable[Array[String]], Option[Iterable[Array[String]]])]

I need to process (perform validations and modify values) the second element of the RDD. I am using map function to do that:

myRDD.map(line => mappingFunction(line))

Unfortunately, the mappingFunction is not invoked. This is the code of the mapping function:

def mappingFunction(line: Tuple2[String, (Iterable[Array[String]], Option[Iterable[Array[String]]])] ): Tuple2[String, (Iterable[Array[String]], Option[Iterable[Array[String]]])] = {
    println("Inside mappingFunction")
    return line
  }

When my program ends, there are no printed messages in the stdout.

In order to investigate the problem, I implemented a code snippet that worked:

val x = List.range(1, 10)
val mappedX = x.map(i => callInt(i))

And the following mapping function was invoked:

 def callInt(i: Int) = {
    println("Inside callInt")
  }

Please assist in getting the RDD mapping function mappingFunction invoked. Thank you.

stacker
  • 37
  • 6
  • 1
    Have you invoked an action like collect after the map? Spark executes transformations lazily. Your sample snippet works, because it is a local collection and no RDD. – Steffen Schmitz Feb 03 '18 at 08:08
  • @SteffenSchmitz No, I did not invoke collect. I did so after your suggestion and the `mappingFunction` got invoked. Thank you for you help – stacker Feb 03 '18 at 08:19
  • Transformation operations should always followed by actions. count() is an action that materialized your transformations. – Balaji Reddy Feb 03 '18 at 08:52

1 Answers1

1

x is a List, so there is no laziness there, that's why your action is being invoked regardless you are not calling an action.

myRDD is an RDD, RDDs are lazy, this means that you don't actually execute your transformations (map, flatMap, filter) until you need to.

That means that you are not running your map function until you perform an action. An action is an operation that triggers the precedent operations (called transformations) to be executed.

Some examples of actions are collect or count

If you do this:

myRDD.map(line => mappingFunction(line)).count()

You'll see your prints. Anyway, there is no problem with your code at all, you just need to take into consideration the laziness nature of the RDDs

There is a good answer about this topic here. Also you can find more info and a whole list of transformations and actions here

SCouto
  • 7,808
  • 5
  • 32
  • 49
  • Thank you for your answer @SCouto. This works. I am not sure though that I can apprise your answer since @SteffenSchmitz was the first to answer with his hint about using `collect` method – stacker Feb 03 '18 at 08:25
  • @stacker Sure, just go ahead. There is no use in posting the same answer twice. – Steffen Schmitz Feb 03 '18 at 08:26
  • Thank you @SteffenSchmitz. I accepted SCouto's answer. – stacker Feb 03 '18 at 08:28