What's the correct way of using reduceByKey in Spark using Python

Question

I'm new to apache spark and don't know if I'm misunderstanding reduceByKey or am encountering a bug. I'm using the spark-1.4.1-bin-hadoop1 build, due to issues with the python Cassandra interface in spark-1.4.1-bin-hadoop2.

reduceByKey(lambda x,y: y[0]) returns the first value of the last tuple, but reduceByKey(lambda x,y: x[0]) throws an exception.

Trying to get to reduceByKey(lambda x,y: x[0]+y[0]), to sum values by key, but that statement throws the same exception as x[0].

Code Fragments:

import sys

from pyspark import SparkContext, SparkConf
from pyspark import StorageLevel
from pyspark.sql import SQLContext, Row
from pyspark.sql.types import *

import h5py
import sys
import numpy
import os
import datetime

if __name__ == "__main__":

  sc_conf = SparkConf().setAppName("VIIRS_QC").set("spark.default.parallelism", "49").set("spark.storage.memoryFraction", "0.75")
  sc = SparkContext(conf=sc_conf)

  sqlContext=SQLContext(sc)

  f=h5py.File("/mnt/NAS/pmacharr/sample_20130918/GMTCO_npp_d20130919_t0544413_e0546054_b09816_c20130919063740340635_noaa_ops.h5", 'r')
  result = f["/All_Data/VIIRS-MOD-GEO-TC_All/Latitude"]
  myLats = numpy.ravel(result).tolist()
  ...
  t1 = numpy.dstack((myLats, myLons, myArray, myM2_radiance, myDNP))

  t1 = t1.tolist()

  x=sc.parallelize(t1[0][123401:123410])

  print t1[0][123401:123410]
  print "input list=", t1[0][123401:123410]

  y=x.map(
      lambda (lat, lon, m6_rad, m2_rad, dn):
                    ((round(lat,0),round(lon,0),dn), (m2_rad,m6_rad))
       )

  print "map"
  print y.collect()

  print "reduceByKey(lambda x,y: x)=", y.reduceByKey(lambda x,y: x ).collect()
  print "reduceByKey(lambda x,y: y)=", y.reduceByKey(lambda x,y: y ).collect()
  print "reduceByKey(lambda x,y: y[0])=", y.reduceByKey(lambda x,y: y[0]).collect()
  print "reduceByKey(lambda x,y: x[0])=", y.reduceByKey(lambda x,y: x[0]).collect()

  sc.stop()
  exit()

Output:

./bin/spark-submit --driver-class-path ./lib/spark-examples-1.4.1-hadoop1.0.4.jar ./agg_v.py

input list= [
  [12.095850944519043, 111.84786987304688, 41252.0, 7469.0, 16.0],
  [12.094693183898926, 111.84053802490234, 40811.0, 7444.0, 16.0],
  [12.093526840209961, 111.83319091796875, 40778.0, 7446.0, 16.0],
  [12.092370986938477, 111.82584381103516, 39389.0, 7352.0, 16.0],
  [12.091206550598145, 111.81849670410156, 42592.0, 7602.0, 16.0],
  [12.09003734588623, 111.8111343383789, 38572.0, 7328.0, 16.0],
  [12.088878631591797, 111.80377960205078, 46203.0, 7939.0, 16.0],
  [12.087711334228516, 111.7964096069336, 42690.0, 7608.0, 16.0],
  [12.08655071258545, 111.78905487060547, 40942.0, 7478.0, 16.0]
]

map=[
  ((12.0, 112.0, 16.0), (7469.0, 41252.0)),
  ((12.0, 112.0, 16.0), (7444.0, 40811.0)),
  ((12.0, 112.0, 16.0), (7446.0, 40778.0)),
  ((12.0, 112.0, 16.0), (7352.0, 39389.0)),
  ((12.0, 112.0, 16.0), (7602.0, 42592.0)),
  ((12.0, 112.0, 16.0), (7328.0, 38572.0)),
  ((12.0, 112.0, 16.0), (7939.0, 46203.0)),
  ((12.0, 112.0, 16.0), (7608.0, 42690.0)),
  ((12.0, 112.0, 16.0), (7478.0, 40942.0))
]

reduceByKey(lambda x,y: x)= [((12.0, 112.0, 16.0), (7469.0, 41252.0))]
reduceByKey(lambda x,y: y)= [((12.0, 112.0, 16.0), (7478.0, 40942.0))]
reduceByKey(lambda x,y: y[0])= [((12.0, 112.0, 16.0), 7478.0)]
reduceByKey(lambda x,y: x[0])=
15/09/24 12:02:39 ERROR Executor: Exception in task 14.0 in stage 8.0 (TID 406)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/apps/ots/spark-1.4.1-bin-hadoop1/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
...
 print "reduceByKey(lambda x,y: x[0])=", y.reduceByKey(lambda x,y: x[0]).collect()
TypeError: 'float' object has no attribute '__getitem__'

It's probably an issue with initialization. When you do reduceByKey(lambda a, b: a+b) for example, the a is the accumulated value and b is the next element from the list (or RDD). So, when you reduce the first KV-pair, what is the accumulator? It's probably not a list, it's probably either None, or 0. But then None[0] or 0[0] doesn't make sense so python/spark complains. Think of the a as the "running total" and b as the next item to process. — TravisJ, Sep 24 '15 at 18:46
>>> y2.reduceByKey(lambda (x), y: x[0]+y[0]).collect() [((12.0, 112.0, 16.0), 82063.0)] Seems to work. — Peter, Sep 24 '15 at 19:38
>>> y2.reduceByKey(lambda x, y: (x[1]+y[1], 0)).collect() [((12.0, 112.0, 16.0), (14913.0, 0))], also seems to work. — Peter, Sep 24 '15 at 19:52

Peter · Answer 1 · 2015-09-24T19:55:35.570

Using pyspark:

>>> t1=[
...   [12.095850944519043, 111.84786987304688, 41252.0, 7469.0, 16.0],
...   [12.094693183898926, 111.84053802490234, 40811.0, 7444.0, 16.0],
... ]
>>> t1  
[[12.095850944519043, 111.84786987304688, 41252.0, 7469.0, 16.0],[12.094693183898926, 111.84053802490234, 40811.0, 7444.0, 16.0]]  
>>> x=sc.parallelize(t1)  
>>> y2=x.map(lambda (lat, lon, m6_rad, m2_rad, dn):((round(lat,0),round(lon,0),dn), (m6_rad, m2_rad)))  
>>> y2.collect()  
[((12.0, 112.0, 16.0), (41252.0, 7469.0)), ((12.0, 112.0, 16.0), (40811.0, 7444.0))]  
>>> y2.reduceByKey(lambda (x), y: x[0]+y[0]).collect()
[((12.0, 112.0, 16.0), 82063.0)]
>>>

Or can do:

>>> y2.reduceByKey(lambda x, y: (x[0]+y[0], 0)).collect()
[((12.0, 112.0, 16.0), (82063.0, 0))]
>>> y2.reduceByKey(lambda x, y: (x[1]+y[1], 0)).collect()
[((12.0, 112.0, 16.0), (14913.0, 0))]
>>>

Not sure which is the"best" way, but its producing what I'm after.

Would it be "better" to implement the map differently?

What's the correct way of using reduceByKey in Spark using Python

1 Answers1