apache spark - iteratively skip and take from RDD

Question

Given an RDD, what's the best way to sort it and then consume it in discrete sized chunks? For example:

  JavaRDD<Integer> baseRdd = sc.parallelize(Arrays.asList(1,2,5,3,4));

  JavaRDD<Integer> sorted = baseRdd.sortBy(x -> x, true, 5);   

  // returns 1, 2   
  List<Integer> first = sorted.take(2);

  // returns 1, 2.  How to skip 2 and then take?
  List<Integer> second = sorted.take(2);

What I would really like is to consume 1, 2 on the first call to take(2), and then have some sort of "skip" parameter that gets passed into the second take(2) to return 3, 4?

Since that "skip" function doesn't seem to exist in the current RDD functionality, what would be the most efficient way to split up the sorted RDD into chunks of known size that can be independently acted on?

Do you only want elements at indexes (0, 1) and (2, 3) or would it be for all (n, n+1)? — Xavier Guihot, Mar 19 '18 at 17:32
Not just (n, n+1). If I have an RDD with 75,000 entries, I would want the first 25,000 on the first call to take(), then entries 25001 to 50000 on the second call to take(), the remaining entries on the third, etc. The number 2 in my original question, as well as the number 25,000 here, are just examples. — Kyle Fransham, Mar 19 '18 at 17:43

score 2 · Accepted Answer · answered Mar 19 '18 at 18:21

To make it efficient, don't forget you can cache your RDD at any point. This will avoid recomputing the sorted RDD from the text file every time we call take. As we will be using the sorted RDD multiple times, we will cache it:

JavaRDD<Integer> sorted = baseRdd.sortBy(x -> x, true, 5).cache();

Then to take elements from a given index to another index, we can combine zipWithIndex and filter. zipWithIndex transforms the RDD into an RDD of tuples where the first part of the tuple is the element of the sorted RDD and the second part is its index. Once we have these indexed records, we can filter them thanks to their index (let's say offset = 2 and window = 2):

List<Integer> nth =
  sorted.zipWithIndex()
  .filter(x -> x._2() >= offset && x._2() < offset + window)
  .map(x -> x._1())
  .collect();

which returns:

[3, 4]

The final result would be:

JavaRDD<Integer> sorted = baseRdd.sortBy(x -> x, true, 5).zipWithIndex().cache();

Integer offset = 2;
Integer window = 2;

List<Integer> nth =
  sorted
  .filter(x -> x._2() >= offset && x._2() < offset + window)
  .map(x -> x._1())
  .collect();

Here I've cached the rdd only after zipping it with index in order not to perform the zipping part each time we perform this action on a different window.

You can then map this nth creation snippet into a loop or a map depending on how you want to create the different window lists.

score 0 · Answer 2 · answered Mar 19 '18 at 17:33

0

rdd1=sc.parallelize((1,2,3,4,5,6,7,8))
rdd2=rdd1.take(2)

Now you filter your initial rdd based on rdd2

rdd1.filter(lambda line:line not in rdd2).take(2)

This gives [3, 4]

Using PySpark

answered Mar 19 '18 at 17:33

André Caetano

11
2

This takes a sort function with O(nlogn) complexity and turns it into an O(n^2) problem. With millions of entries in my RDD I can't do this. – Kyle Fransham Mar 19 '18 at 17:52
1

My other option would be to use `.zipWithUniqueId().filter(lambda x : x[1]>50)` Guess it wont work for your needs though, good luck! – André Caetano Mar 19 '18 at 18:17
1

Thanks Andre, your second option is roughly like @Xavier's answer below. Seems like a good approach! – Kyle Fransham Mar 19 '18 at 18:29

apache spark - iteratively skip and take from RDD

2 Answers2