1

When I do full outer join in Pyspark is not giving output.

 from __future__ import print_function
 import sys        
 import json
 import os from pyspark.conf  
 import SparkConf from functools
 import reduce from pyspark.sql
 import SparkSession spark = SparkSession \.builder \.appName("SPARK-TEST") 
 \.config("spark.sql.shuffle.partitions", u"00") 
 \.config("spark.speculation.interval", "2000ms") 
 \ .getOrCreate()col_join="yearmonth" 

 intpath ='s3://....' 
 awspath ='s3://.....'  

 col_join="yearmonth"


 DF1=spark.read.option("inferSchema",True).option("header",True).csv(intpath)       
 DF1.show(10,False)                    
 count=DF1.count()      
 print("int count is "+str(count)) 

 DF2 = spark.read.option("inferSchema", True).option("header",True).csv(awspath)           
 DF2.show(10,False)                
 count=DF2.count()           
 print("aws count is "+str(count))                                             
 DF1.createOrReplaceTempView("int")
 DF2.createOrReplaceTempView("aws")   
 #final_result_notmatching = DF1.join(DF2, DF1[col_join] == 
 DF2[col_join],"FullOuter" )            

 final_result_notmatching = spark.sql("select * from  int full outer join aws 
 ON int.{} = aws.{}".format(col_join,col_join)) 

Output is zero is coming when we do full outer join what is wrong with this, inner and left outer is fine.

+---------+---------+--------------------+
|yearmonth|total_rec|total_unique_count  |
+---------+---------+--------------------+
|2018-10  |1160863  |1160863             |
|2018-11  |1042284  |1042284             |
|2019-01  |172704   |172704              |
|2018-12  |952276   |952276              |
|2018-06  |1177168  |1177168             |
|2018-07  |1183703  |1183703             |
|2018-08  |1183003  |1183003             |
|2018-09  |1176182  |1176182             |
+---------+---------+--------------------+

int count is 8

+---------+---------+--------------------+
|yearmonth|total_rec|total_unique_count  |
+---------+---------+--------------------+
|2018-06  |1154341  |1154341             |
|2018-08  |1112278  |1112278             |
|2018-11  |6794     |6794                |
|2018-07  |1155195  |1155195             |
|2018-09  |1059808  |1059808             |
|2018-10  |988629   |988629              |
+---------+---------+--------------------+

aws count is 6

 Output is zero 

+---------+---------+--------------------+---------+---------+--------------------+
|yearmonth|total_rec|total_unique_count  |yearmonth|total_rec|total_unique_count  |
+---------+---------+--------------------+---------+---------+--------------------+
+---------+---------+--------------------+---------+---------+--------------------+

Am I doing something wrong? Any issue with query?

I have tried with both DF and SQL. Both are giving same results.

Any null conditions I need to add ?

Radhika k
  • 11
  • 3

0 Answers0