-4

I have an rdd of the following form:

rdd = sc.parallelize([(2, [199.99, 250.0, 129.99]),
(4, [49.98, 299.95, 150.0, 199.92]), 
(8, [179.97, 299.95, 199.92, 50.0]), 
(10, [199.99, 99.96, 129.99, 21.99, 199.99]), 
(12, [299.98, 100.0, 149.94, 499.95, 250.0])])

I need it to be flattened to this form:

2,199.99
2,250.0
2,12.99
4,49.98
4.299.95
...

It also has to be ordered by either the first or the second field.

How to achieve that?

Thanks.

user3689574
  • 1,596
  • 1
  • 11
  • 20
juamd
  • 363
  • 2
  • 5

1 Answers1

0

You can use flatMap like this:

rdd = sc.parallelize([(2, [199.99, 250.0, 129.99]),
(4, [49.98, 299.95, 150.0, 199.92]), 
(8, [179.97, 299.95, 199.92, 50.0]), 
(10, [199.99, 99.96, 129.99, 21.99, 199.99]), 
(12, [299.98, 100.0, 149.94, 499.95, 250.0])])

print rdd.flatMap(lambda x: [(x[0], y) for y in x[1]])\
.sortBy(lambda x: (x[0], x[1])).collect()

[(2, 129.99), (2, 199.99), (2, 250.0), (4, 49.98), (4, 150.0), (4, 199.92), (4, 299.95), (8, 50.0), (8, 179.97), (8, 199.92), (8, 299.95), (10, 21.99), (10, 99.96), (10, 129.99), (10, 199.99), (10, 199.99), (12, 100.0), (12, 149.94), (12, 250.0), (12, 299.98), (12, 499.95)]

user3689574
  • 1,596
  • 1
  • 11
  • 20