-1

I have multiple dataframes that I need to concatenate together, row-wise. In pandas, we would typically write: pd.concat([df1, df2]).

This thread: How to concatenate/append multiple Spark dataframes column wise in Pyspark? appears close, but its respective answer:

    df1_schema = StructType([StructField("id",IntegerType()),StructField("name",StringType())])
    df1 = spark.sparkContext.parallelize([(1, "sammy"),(2, "jill"),(3, "john")])
    df1 = spark.createDataFrame(df1, schema=df1_schema)
    df2_schema = StructType([StructField("secNo",IntegerType()),StructField("city",StringType())])

    df2 = spark.sparkContext.parallelize([(101, "LA"),(102, "CA"),(103,"DC")])
    df2 = spark.createDataFrame(df2, schema=df2_schema)

schema = StructType(df1.schema.fields + df2.schema.fields)
df1df2 = df1.rdd.zip(df2.rdd).map(lambda x: x[0]+x[1])
spark.createDataFrame(df1df2, schema).show()

Yields the following error when done on my data at scale: Can only zip RDDs with same number of elements in each partition

How can I join 2 or more data frames that are identical in row length but are otherwise independent of content (they share a similar repeating structure/order but contain no shared data)?

Example expected data looks like:

    +---+-----+        +-----+----+       +---+-----+-----+----+
    | id| name|        |secNo|city|       | id| name|secNo|city|
    +---+-----+        +-----+----+       +---+-----+-----+----+
    |  1|sammy|    +   |  101|  LA|   =>  |  1|sammy|  101|  LA|
    |  2| jill|        |  102|  CA|       |  2| jill|  102|  CA|
    |  3| john|        |  103|  DC|       |  3| john|  103|  DC|
    +---+-----+        +-----+----+       +---+-----+-----+----+

John Stud
  • 1,506
  • 23
  • 46
  • 1
    What is the output you are expecting? It looks you are trying to merge these 2 data frames by column (id, name, age, address) and to merge the column as single table, what is the key you are expecting the df to use? – H Roy Jul 27 '22 at 03:57
  • 1
    spark (or sql) would require your dataframes to have a join-key in all dataframes that gets matched while merging the data. without that a cross join is performed. see doc on [`join`](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.join.html) and [some examples](https://sparkbyexamples.com/pyspark/pyspark-join-explained-with-examples/). – samkart Jul 27 '22 at 05:31
  • 1
    also, spark does not retain a sort order like pandas does. so, while pandas can merge on index, spark does not have anything without an `id` column. – samkart Jul 27 '22 at 05:40
  • I've updated the expected output; I am hoping to merge the tables but not using any key, a la how pandas handles it – John Stud Jul 27 '22 at 07:32
  • The reason it doesn't work at scale is that the rdd has multiple partitions with varying number of records in each partition, and zip fails. A detailed answer here - https://stackoverflow.com/a/36058718/19261326 – viggnah Jul 27 '22 at 08:08

1 Answers1

2

You can create unique IDs with

df1 = df1.withColumn("unique_id", expr("row_number() over (order by (select null))"))

df2 = df2.withColumn("unique_id", expr("row_number() over (order by (select null))"))

then, you can left join them

df1.join(df2, Seq("unique_id"), "left").drop("unique_id")

Final output looks like

+---+----+---+-------+
| id|name|age|address|
+---+----+---+-------+
|  1|   a|  7|      x|
|  2|   b|  8|      y|
|  3|   c|  9|      z|
+---+----+---+-------+
vilalabinot
  • 1,420
  • 4
  • 17