Unable to write PySpark Dataframe created from two zipped dataframes

Question

I am trying to follow the example given here for combining two dataframes without a shared join key (combining by "index" in a database table or pandas dataframe, except that PySpark does not have that concept):

My Code

left_df = left_df.repartition(right_df.rdd.getNumPartitions()) # FWIW, num of partitions = 303
joined_schema = StructType(left_df.schema.fields + right_df.schema.fields)
interim_rdd = left_df.rdd.zip(right_df.rdd).map(lambda x: x[0] + x[1])
full_data = spark.createDataFrame(interim_rdd, joined_schema)

This all seems to work fine. I am testing it out while using DataBricks, and I can run the "cell" above with no problem. But then when I go to save it, I am unable because it complains that the partitions do not match (???). I have confirmed that the number of partitions match, but you can also see above that I am explicitly making sure they match. My save command:

full_data.write.parquet(my_data_path, mode="overwrite")

Error

I receive the following error:

Caused by: org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition

My Guess

I am suspecting the problem is that, even though I have matched the number of partitions, I do not have the same number of rows in each partition. But I do not know how to do that. I only know how to specify the # of partitions, not the way to partition.

Or, more specifically, I do not know the way to specify how to partition if there is no column I can use. Remember, they have no shared column.

How do I know that I can combine them this way, with no shared join key? In this case, it is because I am trying to join model predictions with input data, but I actually have this case more generally, in situations beyond just model data + predictions.

My Questions

Specifically in the case above, how can I properly set up the partitioning so that it works?
How should I join two dataframes by row index?
- (I know the standard response is "you shouldn't... partitioning makes indices nonsensical", but until Spark creates ML libraries that do not force data loss like I described in the link above, this will always be an issue.)

How big are your datasets? If they are not too big, a low-tech approach would be to write both of them into csv files and the use [paste](https://unix.stackexchange.com/questions/16443/combine-text-files-column-wise) to combine them. — werner, Sep 04 '20 at 14:31
They are quite big, but not too big to put into CSV... as long as I am reading them in line-by-line, which it sounds like paste does. But it just seems so contrary to the purpose of going with Spark. I will definitely consider it, if there is nothing more graceful. — Mike Williamson, Sep 07 '20 at 07:05

thebluephantom · Answer 1 · 2020-09-03T19:51:59.040

RDD's are old hat, but answering from that perspective the error.

From la Trobe University http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html#zip the following:

Joins two RDDs by combining the i-th of either partition with each other. The resulting RDD will consist of two-component tuples which are interpreted as key-value pairs by the methods provided by the PairRDDFunctions extension.

Note pair.

This means you must have the same partitioner with number of partitions and number of kv's per partition, else the definition above does not hold.

Best applied when reading in from files as repartition(n) may not give same distribution.

A little trick to get around that is to use zipWithIndex for the k of k, v, like so (Scala as not a pyspark specific aspect):

val rddA = sc.parallelize(Seq(
  ("ICCH 1", 10.0), ("ICCH 2", 10.0), ("ICCH 4", 100.0), ("ICCH 5", 100.0)
))
val rddAA = rddA.zipWithIndex().map(x => (x._2, x._1)).repartition(5)

val rddB = sc.parallelize(Seq(
  (10.0, "A"), (64.0, "B"), (39.0, "A"), (9.0, "C"), (80.0, "D"), (89.0, "D")
))
val rddBB = rddA.zipWithIndex().map(x => (x._2, x._1)).repartition(5)

val zippedRDD = (rddAA zip rddBB).map{ case ((id, x), (y, c)) => (id, x, y, c) }
zippedRDD.collect

The repartition(n) then seems to work as the k is the same type.

But you must have same num elements per partition. It is what it is, but it makes sense.

I could not get this to work. There was always a complaint that the partitions were mismatched. Most recent error after several trials: `File "/databricks/spark/python/pyspark/sql/types.py", line 1387, in verify_struct "length of fields (%d)" % (len(obj), len(verifiers)))) ValueError: Length of object (2) does not match with length of fields (31)` — Mike Williamson, Sep 07 '20 at 07:06

werner · Answer 2 · 2020-09-03T19:43:31.540

1

You can temporarily switch to RDDs and add an index with zipWithIndex. This index can then be used as join criterium:

#create rdds with an additional index
#as zipWithIndex adds the index as second column, we have to switch
#the first and second column
left = left_df.rdd.zipWithIndex().map(lambda a: (a[1], a[0]))
right= right_df.rdd.zipWithIndex().map(lambda a: (a[1], a[0]))

#join both rdds 
joined = left.fullOuterJoin(right)

#restore the original columns
result = spark.createDataFrame(joined).select("_2._1.*", "_2._2.*")

The Javadoc of zipWithIndex states that

Some RDDs, such as those returned by groupBy(), do not guarantee order of elements in a partition.

Depending on the nature of the original datasets, this code might not produce deterministic results.

edited Sep 03 '20 at 19:43

answered Sep 03 '20 at 19:33

werner

13,518
6
30
45

Thank you! This makes sense, but I am worried about the non-determinism. :( – Mike Williamson Sep 04 '20 at 08:44
Go with my answer and it will be deterministic. – thebluephantom Sep 04 '20 at 08:57
I actually tried both, but focused upon @thebluephantom's because of the determinism. But I could not get it to work. See error I posted in comment to his. Thanks for all of the help! – Mike Williamson Sep 07 '20 at 07:08
@MikeWilliamson determinism is still based on row position. I have observed that zipWI distributes to same partition with repartition of int as key of key, value pair, and leaves the incoming data in same place as it is simply a narrow transformation. That said, you must have same number data in partitions else you have no use case for zipped RDD (not DF). – thebluephantom Sep 07 '20 at 07:33

Unable to write PySpark Dataframe created from two zipped dataframes

My Code

Error

My Guess

My Questions

2 Answers2

Linked