pyspark: TypeError: 'float' object is not iterable

Question

I'm working on a spark code, I always got error:

TypeError: 'float' object is not iterable

on the line of reduceByKey() function. Can someone help me? This is the stacktrace of the error:

d[k] = comb(d[k], v) if k in d else creator(v)
  File "/home/hw/SC/SC_spark.py", line 535, in <lambda>
TypeError: 'float' object is not iterable

Here is code:

def field_valid(m):
    dis=m[1]
    TxP=m[2]
    ef=m[3]
    pl=m[4]
    if TxP != 'NaN' and disl != 'NaN' and ef !='NaN' and pl != 'NaN':
        return True
    else:
        return False

def parse_input(d):
    #d=data.split(',')

    s_name='S'+d[6] # serving cell name

    if d[2] =='NaN' or d[2] == '':
        ef='NaN'
    else:
        ef=float(d[2].strip().rstrip())

    if d[7] =='NaN' or d[7] == '' or d[7] == '0':
        TxP='NaN'
    else:
        TxP=float(d[7].strip().rstrip())

    if d[9] =='NaN' or d[9] == '':
        dis='NaN'
    else:
        dis=float(d[9].strip().rstrip())

    if d[10] =='NaN' or d[10] == '':
        pl='NaN'
    else:
        pl=float(d[10].strip().rstrip())

return s_name,dis, TxP, ef, pl


sc=SparkContext(appName="SC_spark")
lines=sc.textFile(ip_file)
lines=lines.map(lambda m: (m.split(",")))
lines=lines.filter(lambda m: (m[6] != 'cell_name'))
my_rdd=lines.map(parse_input).filter(lambda m: (field_valid(m)==True))
my_rdd=my_rdd.map(lambda x: (x[0],(x[1],x[2])))                                                                                                                                          
my_rdd=my_rdd.reduceByKey(lambda x,y:(max(x[0],y[0]),sum(x[1],y[1])))  #this line got error

Here is some sample data:


Class,PB,EF,RP,RQ,ID,cell_name,TxP,BW,DIS,PL,geom
NaN,10,5110,-78.0,-7.0,134381669,S417|134381669|5110,62.78151250383644,10,2578.5795095469166,113.0,NaN
NaN,10,5110,-71.0,-6.599999904632568,134381669,S417|134381669|5110,62.78151250383644,10,2689.630258510342,106.0,NaN
NaN,10,5110,-77.0,-7.300000190734863,134381669,S417|134381669|5110,62.78151250383644,10,2907.8184899249713,112.0,19.299999999999983
NaN,10,5110,-91.0,-11.0,134381669,S417|134381669|5110,62.78151250383644,10,2779.96762695867,126.0,5.799999999999997
NaN,10,5110,-90.0,-12.69999980926514,134381669,S417|134381669|5110,62.78151250383644,10,2749.8351648579583,125.0,9.599999999999994
NaN,10,5110,-95.0,-13.80000019073486,134381669,S417|134381669|5110,62.78151250383644,10,2942.7938902934643,130.0,-2.4000000000000057
NaN,10,5110,-70.0,-7.099999904632568,134381669,S417|134381669|5110,62.78151250383644,10,3151.930706017461,105.0,22.69999999999999

I am not familiar with `pyspark`, but in the line where the error occurs you call `sum` with two arguments. Unless the first one is an iterable and the second an int, your error is probably there. Try calling `sum(1.0, 2)` on a python console. It gives me a very similar error. — bla, Apr 22 '18 at 06:07
Hi @bla, I just tested out, made sure all fields are converted to float. You noticed I filtered the line with NaN on those values, so, the number is float only. I also checked the syntax of lambda function, I separate to (k,v). I didn't find anything wrong. Did you find anything wrong? — Helen Z, Apr 22 '18 at 06:14
What exactly is `m.split(",")` doing? You have no commas in the data — OneCricketeer, Apr 22 '18 at 06:15
@HelenZ you cannot pass a float as the first argument of `sum`. It expects an interable. Check it out: https://docs.python.org/3.5/library/functions.html#sum. I cannot confirm that this is the case, since I am not sure `x[1]` is a float. But the stacktrace are very similar. — bla, Apr 22 '18 at 06:19
Hi, @cricket_007, .split(",") splits the lines by comma. I copied the data from csv file. Let me edit it to notepad format. — Helen Z, Apr 22 '18 at 06:19
Your error is that `sum()` is a built in function, you're not using a Spark function there, which would also explain why `sum(1.0, 2)` fails by itself as the sum function requires an iterable, and your `x[1]` is a single value. Try `x[1] + y[1]` if you are trying to sum your RDD column... Alternatively, I suggest using SparkSQL sum functions — OneCricketeer, Apr 22 '18 at 06:19
@HelenZ `x[1]` is a float and therefore not an iterable (like lists or sets). Since `sum` expects an iterable it raises an error when a non iterable value is passed. — bla, Apr 22 '18 at 06:25
As part of `lambda x,y`, the `x` value is not a list, tuple, or other collection... As the error says, it is a single floating point number... Is there a specific reason you're using RDDs or Spark1 functions instead of Spark2 with its built-in CSV reader? Also, what is expected output here? — OneCricketeer, Apr 22 '18 at 06:25
Hi @cricket_007. BTW i just changed to x[1]+y[1], and it works!! I'm new to spark, and can't distinguish spark1 and spark2 yet. can you tell me how to do in spark2? the expected result is sum and max of value 'dis' by the same key, and key is column 'cell_name'. — Helen Z, Apr 22 '18 at 06:32
Hi @cricket_007, just realized our server uses spark-1.4. But thank you for great help. — Helen Z, Apr 22 '18 at 06:46
You can run Spark2 code against the same YARN or Mesos cluster as Spark1. Not sure about a standalone scheduler — OneCricketeer, Apr 22 '18 at 06:48

OneCricketeer · Accepted Answer · 2018-04-22T06:52:39.070

0

the expected result is sum and max of value

In that case, you are looking for x[1] + y[1], and not use the built-in sum() function.

my_rdd.reduceByKey( lambda x,y: ( max(x[0],y[0]), x[1] + y[1] ) )

edited Apr 22 '18 at 06:52

answered Apr 22 '18 at 06:50

OneCricketeer

179,855
19
132
245

Hi @cricket_700, can I ask another question? Now I want to save result into a .txt file, but I want to add a header to the .txt file, how should I do it? I used this statement: my_rdd.repartition(1).saveAsTextFile("sc_result/result.txt") – Helen Z Apr 22 '18 at 06:58
You need to union your RDD with a header RDD. https://stackoverflow.com/questions/26157456/add-a-header-before-text-file-on-save-in-spark – OneCricketeer Apr 22 '18 at 07:01

pyspark: TypeError: 'float' object is not iterable

1 Answers1