1

I'm super new to pyspark and RDDs. Apologies if this question is very rudimentary.

I have mapped and cleaned by data using the following code:

delay = datasplit.map(lambda x: ((x[33], x[8], x[9]))).filter(lambda x: x[0]!= u'0.00').filter(lambda x: x[0]!= '')

but now I need to somehow convert into the following output:

(124, u'"OO""N908SW"')
(432, u'"DL""N810NW"')

where the first is a sum of x[33] mentioned above when grouped by a combination of x[8] and x[9]

I've completed the mapping and get the below output (which is close)

lines = delay.map(lambda x: (float(x[0]), [x[1], x[2]]))

Output:

[(-10.0, [u'OO', u'N908SW']),(62, [u'DL', u'N810NW]), (-6.0, [u'WN', w'N7811F'])]

but I can't figure out how to reduce or combine x[1] and x[2] to create the output shown above.

Thanks in advance.

2 Answers2

2

You can create key likes below and the apply reduceByKey and then map to get unified key:

from operator import add
result = delay.map(lambda x: ((x[1], x[2]), x[0])) \
                  .reduceByKey(add).map(lambda x: (x[0][1] + x[0][2], x[1]))
OmG
  • 18,337
  • 10
  • 57
  • 90
0

As a general rule of thumb, you want as little python operations as possible.

I reduced your code to one map and one reduce.

import operator

delay_sum = datasplit\
    .map(lambda x: (x[8]+x[9], float(x[33]) if any(x[33]) else 0.0))\
    .reduceByKey(operator.add)

And it goes without saying, that these kind of operations usually run faster when using spark dataframes.

Uri Goren
  • 13,386
  • 6
  • 58
  • 110
  • It can be erroneous! as you assume `x[8] + x[9]` has a one-to-one map to pair of `(x[8], x[9])`, but it's not true generally. – OmG Dec 02 '18 at 11:25
  • This is what the OP requested. This is a solution for his specific case and assumptions, not for any general unspecified case. – Uri Goren Dec 02 '18 at 12:35
  • The OP just provides a sample! Hence, this code just works for his sample and not more anyhow. – OmG Dec 03 '18 at 09:20