2

So I have a users df with unique user_ids and a second df with a set of questions. I would then like to merge the dfs so that each user_id is attached to full set of questions:

User Df:

+--------------------------+
|user_id                   |
+--------------------------+
|GDDVWWIOOKDY4WWBCICM4VOQHQ|
|77VC23NYEWLGHVVS4UMHJEVESU|
|VCOX7HUHTMPFCUOGYWGL4DMIRI|
|XPJBJMABYXLTZCKSONJVBCOXQM|
|QHTPQSFNOA5YEWH6N7FREBMMDM|
|JLQNBYCSC4DGCOHNLRBK5UANWI|
|RWYUOLBKIQMZVYHZJYCQ7SGTKA|
|CR33NGPK2GKK6G35SLZB7TGIJE|
|N6K7URSGH65T5UT6PZHMN62E2U|
|SZMPG3FQQOHGDV23UVXODTQETE|
+--------------------------+

Questions Df

+--------------------+-------------------+-----------------+--------------------+
|       category_type|   category_subject|      question_id|            question|
+--------------------+-------------------+-----------------+--------------------+
|Consumer & Lifestyle|     Dietary Habits|pdl_diet_identity|Eating habits des...|
|Consumer & Lifestyle|     Dietary Habits|pdl_diet_identity|Eating habits des...|
|Consumer & Lifestyle|     Dietary Habits|pdl_diet_identity|Eating habits des...|
|Consumer & Lifestyle|     Dietary Habits|pdl_diet_identity|Eating habits des...|
|Consumer & Lifestyle|     Dietary Habits|pdl_diet_identity|Eating habits des...|
|Consumer & Lifestyle|     Dietary Habits|pdl_diet_identity|Eating habits des...|
|Consumer & Lifestyle|     Dietary Habits|pdl_diet_identity|Eating habits des...|
|        Demographics|Social Demographics|pdl_ethnicity_new|           Ethnicity|
|        Demographics|Social Demographics|pdl_ethnicity_new|           Ethnicity|
|        Demographics|Social Demographics|pdl_ethnicity_new|           Ethnicity|
+--------------------+-------------------+-----------------+--------------------+

So at the moment I turn the user_ids into a list and loop through them creating new column on questions df creating a temporary df from results. I then union to a final df to save the results for that user_id iteration as per below:

create user_id list:

unique_users_list = users_df \
  .select("user_id") \
  .agg(f.collect_list('user_id')).collect()[0][0]

create empty final df to append to:

finaldf_schema = StructType([
    StructField("category_type", StringType(), False),
    StructField("category_subject", StringType(), False),
    StructField("question_id", StringType(), False),
    StructField("question", StringType(), False),
    StructField("user_id", StringType(), False)
])

final_df = spark.createDataFrame([], finaldf_schema)

Then loop through user_id merging to questions df:

for user_id in unique_users_list:
   temp_df = questions_df.withColumn("user_id", f.lit(user_id))
   final_df = final_df.union(temp_df)

However, I find the performance very slow. Is there a more efficient and faster way to do this please.

Thanks

RonD
  • 75
  • 8
  • I think you just want to do a `crossJoin`. Try `users_df.crossJoin(questions_df)` – pault Sep 08 '18 at 19:21
  • Thanks pault that worked a charm. – RonD Sep 08 '18 at 19:46
  • Related/potential dupe: [Memory efficient cartesian join in PySpark](https://stackoverflow.com/questions/42069991/memory-efficient-cartesian-join-in-pyspark) – pault Sep 08 '18 at 20:19

1 Answers1

2

What you are looking for is called a Cartesian product. You can achieve this using pyspark.sql.DataFrame.crossJoin():

Try:

final_df = users_df.crossJoin(questions_df)
pault
  • 41,343
  • 15
  • 107
  • 149