9

How to do pandas equivalent of pd.concat([df1,df2],axis='columns') using Pyspark dataframes? I googled and couldn't find a good solution.

DF1
var1        
     3      
     4      
     5      

DF2
var2    var3     
  23      31
  44      45
  52      53

Expected output dataframe
var1        var2    var3
     3        23      31
     4        44      45
     5        52      53

Edited to include expected output

ZygD
  • 22,092
  • 39
  • 79
  • 102
GeorgeOfTheRF
  • 8,244
  • 23
  • 57
  • 80

4 Answers4

11

Equivalent of accepted answer using pyspark would be

from pyspark.sql.types import StructType

spark = SparkSession.builder().master("local").getOrCreate()
df1 = spark.sparkContext.parallelize([(1, "a"),(2, "b"),(3, "c")]).toDF(["id", "name"])
df2 = spark.sparkContext.parallelize([(7, "x"),(8, "y"),(9, "z")]).toDF(["age", "address"])

schema = StructType(df1.schema.fields + df2.schema.fields)
df1df2 = df1.rdd.zip(df2.rdd).map(lambda x: x[0]+x[1])
spark.createDataFrame(df1df2, schema).show()
Devi
  • 5,023
  • 1
  • 34
  • 27
  • 2
    I've come across the same issue and this solution by @Devi works very well. However, I find it baffling why such a basic operation would be so difficult in PySpark and not implemented directly. – martin_wun Oct 04 '21 at 08:41
10

I have spent hours to do this with PySpark and a working solution of mine is as follows; (quite in Python equivalent of @Shankar Koirala ' s answer by the way)

from pyspark.sql.functions import monotonically_increasing_id

DF1 = df2.withColumn("row_id", monotonically_increasing_id())
DF2 = df3.withColumn("row_id", monotonically_increasing_id())
result_df = DF1.join(DF2, ("row_id")).drop("row_id")

You are simply defining a common column for both of the dataframes and dropping that column right after merge. I hope this solution helps in cases like that dataframes do not include any common columns.

However, this method joins dataframes rows randomly, a detail to keep in mind.

Kubra Altun
  • 365
  • 3
  • 12
4

Below is the example for what you want to do but in scala, I hope you can convert it to pyspark

val spark = SparkSession
    .builder()
    .master("local")
    .appName("ParquetAppendMode")
    .getOrCreate()
  import spark.implicits._

  val df1 = spark.sparkContext.parallelize(Seq(
    (1, "abc"),
    (2, "def"),
    (3, "hij")
  )).toDF("id", "name")

  val df2 = spark.sparkContext.parallelize(Seq(
    (19, "x"),
    (29, "y"),
    (39, "z")
  )).toDF("age", "address")

  val schema = StructType(df1.schema.fields ++ df2.schema.fields)

  val df1df2 = df1.rdd.zip(df2.rdd).map{
    case (rowLeft, rowRight) => Row.fromSeq(rowLeft.toSeq ++ rowRight.toSeq)}

  spark.createDataFrame(df1df2, schema).show()

This is how you do only using dataframe

import org.apache.spark.sql.functions._

val ddf1 = df1.withColumn("row_id", monotonically_increasing_id())
val ddf2 = df2.withColumn("row_id", monotonically_increasing_id())

val result = ddf1.join(ddf2, Seq("row_id")).drop("row_id")

result.show()

add new column as row_id and join both dataframe with key as row_id.

Hope this helps!

koiralo
  • 22,594
  • 6
  • 51
  • 72
  • 3
    Note that joining based on `monotonically_increasing_id` is not really a good idea since it does not guarantee consecutive numbers, see for example here: https://stackoverflow.com/questions/44336593/how-to-use-monotonically-increasing-id-to-join-two-pyspark-dataframes-having-no – Shaido Sep 25 '19 at 09:42
  • 1
    scala example which uses zip gives `Can't zip RDDs with unequal partitions` error. Any idea how to do it with `zipWithIndex` ? – GadaaDhaariGeek Dec 05 '19 at 13:32
0

Here What I did to merge 2 Dataframes column-wise in Pyspark (Without Joining) using @Shankar Koirala's Answer

    +---+-----+        +-----+----+       +---+-----+-----+----+
    | id| name|        |secNo|city|       | id| name|secNo|city|
    +---+-----+        +-----+----+       +---+-----+-----+----+
    |  1|sammy|    +   |  101|  LA|   =>  |  1|sammy|  101|  LA|
    |  2| jill|        |  102|  CA|       |  2| jill|  102|  CA|
    |  3| john|        |  103|  DC|       |  3| john|  103|  DC|
    +---+-----+        +-----+----+       +---+-----+-----+----+

Here's My Pyspark Code

    df1_schema = StructType([StructField("id",IntegerType()),StructField("name",StringType())])
    df1 = spark.sparkContext.parallelize([(1, "sammy"),(2, "jill"),(3, "john")])

    df1 = spark.createDataFrame(df1, schema=df1_schema)

    df2_schema = StructType([StructField("secNo",IntegerType()),StructField("city",StringType())])

    df2 = spark.sparkContext.parallelize([(101, "LA"),(102, "CA"),(103,"DC")])
    df2 = spark.createDataFrame(df2, schema=df2_schema)

    df3_schema = StructType(df1.schema.fields + df2.schema.fields)

    def myFunc(x):
      dt1 = x[0]
      dt2 = x[1]

      id = dt1[0]
      name = dt1[1]
      secNo = dt2[0]
      city = dt2[1]

      return [id,name,secNo,city]


    rdd_merged = df1.rdd.zip(df2.rdd).map(lambda x: myFunc(x))

    df3 = spark.createDataFrame(rdd_merged, schema=df3_schema)

Note that the 2 tables should have the same number of rows. Thank you "Shankar Koirala"