0

I'm new to apache spark and don't know if I'm misunderstanding reduceByKey or am encountering a bug. I'm using the spark-1.4.1-bin-hadoop1 build, due to issues with the python Cassandra interface in spark-1.4.1-bin-hadoop2.

reduceByKey(lambda x,y: y[0]) returns the first value of the last tuple, but reduceByKey(lambda x,y: x[0]) throws an exception.

Trying to get to reduceByKey(lambda x,y: x[0]+y[0]), to sum values by key, but that statement throws the same exception as x[0].

Code Fragments:

import sys

from pyspark import SparkContext, SparkConf
from pyspark import StorageLevel
from pyspark.sql import SQLContext, Row
from pyspark.sql.types import *

import h5py
import sys
import numpy
import os
import datetime

if __name__ == "__main__":

  sc_conf = SparkConf().setAppName("VIIRS_QC").set("spark.default.parallelism", "49").set("spark.storage.memoryFraction", "0.75")
  sc = SparkContext(conf=sc_conf)

  sqlContext=SQLContext(sc)

  f=h5py.File("/mnt/NAS/pmacharr/sample_20130918/GMTCO_npp_d20130919_t0544413_e0546054_b09816_c20130919063740340635_noaa_ops.h5", 'r')
  result = f["/All_Data/VIIRS-MOD-GEO-TC_All/Latitude"]
  myLats = numpy.ravel(result).tolist()
  ...
  t1 = numpy.dstack((myLats, myLons, myArray, myM2_radiance, myDNP))

  t1 = t1.tolist()

  x=sc.parallelize(t1[0][123401:123410])

  print t1[0][123401:123410]
  print "input list=", t1[0][123401:123410]

  y=x.map(
      lambda (lat, lon, m6_rad, m2_rad, dn):
                    ((round(lat,0),round(lon,0),dn), (m2_rad,m6_rad))
       )

  print "map"
  print y.collect()

  print "reduceByKey(lambda x,y: x)=", y.reduceByKey(lambda x,y: x ).collect()
  print "reduceByKey(lambda x,y: y)=", y.reduceByKey(lambda x,y: y ).collect()
  print "reduceByKey(lambda x,y: y[0])=", y.reduceByKey(lambda x,y: y[0]).collect()
  print "reduceByKey(lambda x,y: x[0])=", y.reduceByKey(lambda x,y: x[0]).collect()

  sc.stop()
  exit()

Output:

./bin/spark-submit --driver-class-path ./lib/spark-examples-1.4.1-hadoop1.0.4.jar ./agg_v.py

input list= [
  [12.095850944519043, 111.84786987304688, 41252.0, 7469.0, 16.0],
  [12.094693183898926, 111.84053802490234, 40811.0, 7444.0, 16.0],
  [12.093526840209961, 111.83319091796875, 40778.0, 7446.0, 16.0],
  [12.092370986938477, 111.82584381103516, 39389.0, 7352.0, 16.0],
  [12.091206550598145, 111.81849670410156, 42592.0, 7602.0, 16.0],
  [12.09003734588623, 111.8111343383789, 38572.0, 7328.0, 16.0],
  [12.088878631591797, 111.80377960205078, 46203.0, 7939.0, 16.0],
  [12.087711334228516, 111.7964096069336, 42690.0, 7608.0, 16.0],
  [12.08655071258545, 111.78905487060547, 40942.0, 7478.0, 16.0]
]

map=[
  ((12.0, 112.0, 16.0), (7469.0, 41252.0)),
  ((12.0, 112.0, 16.0), (7444.0, 40811.0)),
  ((12.0, 112.0, 16.0), (7446.0, 40778.0)),
  ((12.0, 112.0, 16.0), (7352.0, 39389.0)),
  ((12.0, 112.0, 16.0), (7602.0, 42592.0)),
  ((12.0, 112.0, 16.0), (7328.0, 38572.0)),
  ((12.0, 112.0, 16.0), (7939.0, 46203.0)),
  ((12.0, 112.0, 16.0), (7608.0, 42690.0)),
  ((12.0, 112.0, 16.0), (7478.0, 40942.0))
]

reduceByKey(lambda x,y: x)= [((12.0, 112.0, 16.0), (7469.0, 41252.0))]
reduceByKey(lambda x,y: y)= [((12.0, 112.0, 16.0), (7478.0, 40942.0))]
reduceByKey(lambda x,y: y[0])= [((12.0, 112.0, 16.0), 7478.0)]
reduceByKey(lambda x,y: x[0])=
15/09/24 12:02:39 ERROR Executor: Exception in task 14.0 in stage 8.0 (TID 406)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/apps/ots/spark-1.4.1-bin-hadoop1/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
...
 print "reduceByKey(lambda x,y: x[0])=", y.reduceByKey(lambda x,y: x[0]).collect()
TypeError: 'float' object has no attribute '__getitem__'
Peter
  • 31
  • 6
  • Please start with fixing indentation and formatting. – zero323 Sep 24 '15 at 16:09
  • 1
    It's probably an issue with initialization. When you do reduceByKey(lambda a, b: a+b) for example, the a is the accumulated value and b is the next element from the list (or RDD). So, when you reduce the first KV-pair, what is the accumulator? It's probably not a list, it's probably either None, or 0. But then None[0] or 0[0] doesn't make sense so python/spark complains. Think of the a as the "running total" and b as the next item to process. – TravisJ Sep 24 '15 at 18:46
  • Hi TravisJ, thanks for your help. I understand it more. – Peter Sep 24 '15 at 19:33
  • >>> y2.reduceByKey(lambda (x), y: x[0]+y[0]).collect() [((12.0, 112.0, 16.0), 82063.0)] Seems to work. – Peter Sep 24 '15 at 19:38
  • >>> y2.reduceByKey(lambda x, y: (x[1]+y[1], 0)).collect() [((12.0, 112.0, 16.0), (14913.0, 0))], also seems to work. – Peter Sep 24 '15 at 19:52

1 Answers1

1

Using pyspark:

>>> t1=[
...   [12.095850944519043, 111.84786987304688, 41252.0, 7469.0, 16.0],
...   [12.094693183898926, 111.84053802490234, 40811.0, 7444.0, 16.0],
... ]
>>> t1  
[[12.095850944519043, 111.84786987304688, 41252.0, 7469.0, 16.0],[12.094693183898926, 111.84053802490234, 40811.0, 7444.0, 16.0]]  
>>> x=sc.parallelize(t1)  
>>> y2=x.map(lambda (lat, lon, m6_rad, m2_rad, dn):((round(lat,0),round(lon,0),dn), (m6_rad, m2_rad)))  
>>> y2.collect()  
[((12.0, 112.0, 16.0), (41252.0, 7469.0)), ((12.0, 112.0, 16.0), (40811.0, 7444.0))]  
>>> y2.reduceByKey(lambda (x), y: x[0]+y[0]).collect()
[((12.0, 112.0, 16.0), 82063.0)]
>>>

Or can do:

>>> y2.reduceByKey(lambda x, y: (x[0]+y[0], 0)).collect()
[((12.0, 112.0, 16.0), (82063.0, 0))]
>>> y2.reduceByKey(lambda x, y: (x[1]+y[1], 0)).collect()
[((12.0, 112.0, 16.0), (14913.0, 0))]
>>>

Not sure which is the"best" way, but its producing what I'm after.

Would it be "better" to implement the map differently?

Peter
  • 31
  • 6