Internals of reduce function in spark-shell

Question

Input file contains 20 lines. I am trying to count total number of records using reduce function. Can anyone please explain me why there is difference in the results? Because here value of y is nothing but only 1.

Default number of partitions : 4

scala> rdd = sc.textFile("D:\LearningPythonTomaszDenny\Codebase\\wholeTextFiles\\names1.txt")
scala> rdd.map(x=>1).reduce((acc,y) => acc+1)
res17: Int = 8

scala> rdd.map(x=>1).reduce((acc,y) => acc+y)
res18: Int = 20

Best way to achieve it to invoke count() function. But I wanted to understand the internals of reduce function. — Pratik Garg, Apr 21 '19 at 11:13

score 0 · Answer 1 · answered Apr 21 '19 at 11:36

Because here value of y is nothing but only 1.

That is simply not true. reduce consist of three stages (not in a strict Spark meaning of the word):

Distributed reduce on each partition.
Collection of the partial results to the driver (synchronous or asynchronous depending on the backend).
Local driver reduction.

In your case the results of the first and second stage will be the same, but the first approach will simply ignore the partial results. In other words, no matter what was the result for the partition, it will always add only 1.

Such approach would work only with non-parallel, non-sequential reduce implementations.

I am not able to fully get it. Why the first approach ignore the partial results? Below example result in 5. In [32]: rdd1 = sc.textFile("D:\LearningPythonTomaszDenny\Codebase\\wholeTextFiles\\names1.txt",1) In [33]: rdd1.map( lambda x: 1).reduce(lambda acc,y : acc+1) Out[33]: 5 — Pratik Garg, Apr 21 '19 at 12:02

Internals of reduce function in spark-shell

1 Answers1