Reducing by (K,V) pairs and sort by V

Question

I'm super new to pyspark and RDDs. Apologies if this question is very rudimentary.

I have mapped and cleaned by data using the following code:

delay = datasplit.map(lambda x: ((x[33], x[8], x[9]))).filter(lambda x: x[0]!= u'0.00').filter(lambda x: x[0]!= '')

but now I need to somehow convert into the following output:

(124, u'"OO""N908SW"')
(432, u'"DL""N810NW"')

where the first is a sum of x[33] mentioned above when grouped by a combination of x[8] and x[9]

I've completed the mapping and get the below output (which is close)

lines = delay.map(lambda x: (float(x[0]), [x[1], x[2]]))

Output:

[(-10.0, [u'OO', u'N908SW']),(62, [u'DL', u'N810NW]), (-6.0, [u'WN', w'N7811F'])]

but I can't figure out how to reduce or combine x[1] and x[2] to create the output shown above.

Thanks in advance.

score 2 · Answer 1 · answered Dec 01 '18 at 19:55

2

You can create key likes below and the apply reduceByKey and then map to get unified key:

from operator import add
result = delay.map(lambda x: ((x[1], x[2]), x[0])) \
                  .reduceByKey(add).map(lambda x: (x[0][1] + x[0][2], x[1]))

answered Dec 01 '18 at 19:55

OmG

18,337
10
57
90

Thanks. I tried this, but I get an error message: IndexError: tuple index out of range – A Dolegowski Dec 01 '18 at 20:05

Uri Goren · Accepted Answer · 2018-12-01T20:14:36.817

0

As a general rule of thumb, you want as little python operations as possible.

I reduced your code to one map and one reduce.

import operator

delay_sum = datasplit\
    .map(lambda x: (x[8]+x[9], float(x[33]) if any(x[33]) else 0.0))\
    .reduceByKey(operator.add)

And it goes without saying, that these kind of operations usually run faster when using spark dataframes.

edited Dec 01 '18 at 20:14

answered Dec 01 '18 at 20:08

Uri Goren

13,386
6
58
110

It can be erroneous! as you assume `x[8] + x[9]` has a one-to-one map to pair of `(x[8], x[9])`, but it's not true generally. – OmG Dec 02 '18 at 11:25
This is what the OP requested. This is a solution for his specific case and assumptions, not for any general unspecified case. – Uri Goren Dec 02 '18 at 12:35
The OP just provides a sample! Hence, this code just works for his sample and not more anyhow. – OmG Dec 03 '18 at 09:20

Reducing by (K,V) pairs and sort by V

2 Answers2