Can't Zip RDDs with unequal number of partitions. What can I use as an alternative to zip?

Question

I have three RDDs of the same size rdd1contains a String identifier, rdd2 contains a vector and rdd3contains an integer value.

Essentially I want to zip those three together to get an RDD of RDD[String,Vector,Int] but I continuously get can't zip RDDs with unequal number of partitions. How can I completely bypass zip to do the abovementioned thing?

What is supposed to happen for the records in rdd1 that can't be matched with a record in rdd2 or rdd3? Do you want to drop the extra records, use some sort of null value, or something else? Also how is the data in these RDDs related to each other? — puhlen, Nov 03 '16 at 15:53
They are all values related to a specific entry. All RDDs have the exact same size and there will always be a match. They are generated by transforming the entries of one RDD repeatedly. — Mnemosyne, Nov 03 '16 at 17:20
Re reading your question, I noticed that you are using zipPartitions instead of zip. zip requires only that the number of rows is the same. How do you know which row from rdd1 goes with which row of rdd2 and rdd3 though? — puhlen, Nov 03 '16 at 17:27
All RDDs are generated through transformation of a unique initial RDD. So I know that each RDD has the same number of values as the original and that Row 1 for each of them corresponds to Object 1 and so forth (actually rdd1 contains the sha256 identifier of each object). Now I want a final RDD that "zips"/"glues" together all those RDDs so each row represents data for ne specific object that is specified by the String part. — Mnemosyne, Nov 03 '16 at 17:31
While zip would work for you, I don't think its guaranteed that the RDDs will maintain the same ordering depending on what kind of transformations you do especially considering you are changing the partitioning, so they might get zipped back together wrong. To be sure, you could assign a unique id to each row of the original RDD before splitting, then use that id as a key to join the RDDs back together. — puhlen, Nov 03 '16 at 17:49

score 7 · Accepted Answer · edited Apr 13 '17 at 14:43

7

Try:

rdd1.zipWithIndex.map(_.swap).join(rdd2.zipWithIndex.map(_.swap)).values

edited Apr 13 '17 at 14:43

Federico

1,925
14
19

answered Nov 03 '16 at 15:54

Tim · Answer 2 · 2016-11-03T18:09:51.413

1

Do they all have the same number of elements? zipPartitions is used to join RDDs in the special case that they have exactly the same number of partitions and exactly the same number of elements in each partition.

Your case has no such guarantees. What do you want to do in the case that rdd3 is actually empty? Should you get a resulting RDD with no elements?

Edit: If you know that the length's are exactly the same, LostInOverflow's answer will work.

edited Nov 03 '16 at 18:09

answered Nov 03 '16 at 15:55

Tim

3,675
12
25

they definitely have the same number of elements but I cannot make a statement about the number of partitions. And no rdd is empty. They all have one value per row. – Mnemosyne Nov 03 '16 at 17:18

score 1 · Answer 3 · answered Nov 03 '16 at 18:01

1

Before splitting up your origional RDD, assign each row a unique id with RDD.zipWithUniqueId. Then make sure to include the id field in each of the RDDs you spit from the original and use them as the key for those rows (use keyBy if the id is not already the key) then use RDD.join to recomine the rows.

An example might look like:

val rddWithKey = origionalRdd.zipWithUniqueID().map(_.swap)
val rdd1 = rddWithKey.map{case (key,value) => key -> value.stringField }
val rdd2 = rddWithKey.map{case (key,value) => key -> value.intField }

/*transformations on rdd1 and 2*/

val recombined = rdd1.join(rdd2)

answered Nov 03 '16 at 18:01

puhlen

8,400
1
16
31

Would .zipWithIndex instead of .zipWithUniqueID work the same? – Mnemosyne Nov 03 '16 at 18:34
Yes, but unless you need the IDs to have no gaps between them you should prefer zipWithUniqueID – puhlen Nov 03 '16 at 18:38
what so you mean by gaps? – Mnemosyne Nov 03 '16 at 19:05
A sequence without gaps might look like this: 1,2,3,4,5,6... All numbers included.. A sequence with gaps might give you 1,3,4,5,8,12... so you're still guaranteed uniqueness but might be skipping some numbers. – puhlen Nov 03 '16 at 19:22

Can't Zip RDDs with unequal number of partitions. What can I use as an alternative to zip?

3 Answers3