1

I'm getting an error "int object is unsubscriptable" while executing the following script :

element.reduceByKey( lambda x , y : x[1]+y[1])

with element is an key-value RDD and the value is a tuple. Example input:

(A, (toto , 10))
(A, (titi , 30))
(5, (tata, 10))
(A, (toto, 10))

I understand that the reduceByKey function takes (K,V) tuples and apply a function on all the values to get the final result of the reduce. Like the example given in ReduceByKey Apache.

Any help please?

Shaido
  • 27,497
  • 23
  • 70
  • 73
Eliane PDC
  • 103
  • 1
  • 2
  • 6
  • 4
    What output do you want? The problem is that `x[1]+y[1]` is an int, not a tuple (which is what `reduceByKey` expect in the next iteration. – Shaido Jan 16 '18 at 06:17
  • 1
    The output expected is `(A , 50) (5, 10)`, but why `reduceByKey` should expect a tuple in the next iteration? should it keep the same type of the values reduced? – Eliane PDC Jan 16 '18 at 16:34

2 Answers2

3

Here is an example that will illustrate what's going on.

Let's consider what happens when you call reduce on a list with some function f:

reduce(f, [a,b,c]) = f(f(a,b),c)

If we take your example, f = lambda u, v: u[1] + v[1], then the above expression breaks down into:

reduce(f, [a,b,c]) = f(f(a,b),c) = f(a[1]+b[1],c)

But a[1] + b[1] is an integer so there is no __getitem__ method, hence your error.

In general, the better approach (as shown below) is to use map() to first extract the data in the format that you want, and then apply reduceByKey().


A MCVE with your data

element = sc.parallelize(
    [
        ('A', ('toto' , 10)),
        ('A', ('titi' , 30)),
        ('5', ('tata', 10)),
        ('A', ('toto', 10))
    ]
)

You can almost get your desired output with a more sophisticated reduce function:

def add_tuple_values(a, b):
    try:
        u = a[1]
    except:
        u = a
    try:
        v = b[1]
    except:
        v = b
    return u + v

print(element.reduceByKey(add_tuple_values).collect())

Except that this results in:

[('A', 50), ('5', ('tata', 10))]

Why? Because there's only one value for the key '5', so there is nothing to reduce.

For these reasons, it's best to first call map. To get your desired output, you could do:

>>> print(element.map(lambda x: (x[0], x[1][1])).reduceByKey(lambda u, v: u+v).collect())
[('A', 50), ('5', 10)]

Update 1

Here's one more approach:

You could create tuples in your reduce function, and then call map to extract the value you want. (Essentially reverse the order of map and reduce.)

print(
    element.reduceByKey(lambda u, v: (0,u[1]+v[1]))
        .map(lambda x: (x[0], x[1][1]))
        .collect()
)
[('A', 50), ('5', 10)]

Notes

  • Had there been at least 2 records for each key, using add_tuple_values() would have given you the correct output.
pault
  • 41,343
  • 15
  • 107
  • 149
2

Another approach would be to use Dataframe

rdd = sc.parallelize([('A', ('toto', 10)),('A', ('titi', 30)),('5', ('tata', 10)),('A', ('toto', 10))])
rdd.map(lambda (a,(b,c)): (a,b,c)).toDF(['a','b','c']).groupBy('a').agg(sum("c")).rdd.map(lambda (a,c): (a,c)).collect()

>>>[(u'5', 10), (u'A', 50)]
Bala
  • 11,068
  • 19
  • 67
  • 120