I am trying to follow the example given here for combining two dataframes without a shared join key (combining by "index" in a database table or pandas dataframe, except that PySpark does not have that concept):
My Code
left_df = left_df.repartition(right_df.rdd.getNumPartitions()) # FWIW, num of partitions = 303
joined_schema = StructType(left_df.schema.fields + right_df.schema.fields)
interim_rdd = left_df.rdd.zip(right_df.rdd).map(lambda x: x[0] + x[1])
full_data = spark.createDataFrame(interim_rdd, joined_schema)
This all seems to work fine. I am testing it out while using DataBricks, and I can run the "cell" above with no problem. But then when I go to save it, I am unable because it complains that the partitions do not match (???). I have confirmed that the number of partitions match, but you can also see above that I am explicitly making sure they match. My save command:
full_data.write.parquet(my_data_path, mode="overwrite")
Error
I receive the following error:
Caused by: org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition
My Guess
I am suspecting the problem is that, even though I have matched the number of partitions, I do not have the same number of rows in each partition. But I do not know how to do that. I only know how to specify the # of partitions, not the way to partition.
Or, more specifically, I do not know the way to specify how to partition if there is no column I can use. Remember, they have no shared column.
How do I know that I can combine them this way, with no shared join key? In this case, it is because I am trying to join model predictions with input data, but I actually have this case more generally, in situations beyond just model data + predictions.
My Questions
- Specifically in the case above, how can I properly set up the partitioning so that it works?
- How should I join two dataframes by row index?
- (I know the standard response is "you shouldn't... partitioning makes indices nonsensical", but until Spark creates ML libraries that do not force data loss like I described in the link above, this will always be an issue.)