Here What I did to merge 2 Dataframes column-wise in Pyspark (Without Joining) using @Shankar Koirala's Answer
+---+-----+ +-----+----+ +---+-----+-----+----+
| id| name| |secNo|city| | id| name|secNo|city|
+---+-----+ +-----+----+ +---+-----+-----+----+
| 1|sammy| + | 101| LA| => | 1|sammy| 101| LA|
| 2| jill| | 102| CA| | 2| jill| 102| CA|
| 3| john| | 103| DC| | 3| john| 103| DC|
+---+-----+ +-----+----+ +---+-----+-----+----+
Here's My Pyspark Code
df1_schema = StructType([StructField("id",IntegerType()),StructField("name",StringType())])
df1 = spark.sparkContext.parallelize([(1, "sammy"),(2, "jill"),(3, "john")])
df1 = spark.createDataFrame(df1, schema=df1_schema)
df2_schema = StructType([StructField("secNo",IntegerType()),StructField("city",StringType())])
df2 = spark.sparkContext.parallelize([(101, "LA"),(102, "CA"),(103,"DC")])
df2 = spark.createDataFrame(df2, schema=df2_schema)
df3_schema = StructType(df1.schema.fields + df2.schema.fields)
def myFunc(x):
dt1 = x[0]
dt2 = x[1]
id = dt1[0]
name = dt1[1]
secNo = dt2[0]
city = dt2[1]
return [id,name,secNo,city]
rdd_merged = df1.rdd.zip(df2.rdd).map(lambda x: myFunc(x))
df3 = spark.createDataFrame(rdd_merged, schema=df3_schema)
Note that the 2 tables should have the same number of rows. Thank you "Shankar Koirala"