0

I'm having a problem exporting a pyspark dataframe to a CSV. Perhaps I'm misunderstanding how Spark works.

I've tried exporting the dataframe to a Pandas dataframe then to a csv but it didn't work:

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, HiveContext, SQLContext
from datetime import datetime
import time
import sys
import subprocess

start = time.time()
start_time = datetime.now()

myhost = sys.argv[1]
oracle_cnx = sys.argv[2]
thrift_host = "thrift://"+ myhost+":9083"
print thrift_host

# Initialise Hive metastore
SparkContext.setSystemProperty("hive.metastore.uris", thrift_host)

# Create Spark Session
spark = (SparkSession
            .builder
            .appName('Pyspark-read-and-write-from-hive')
            .config("hive.metastore.uris", thrift_host)
            .enableHiveSupport()
            .getOrCreate())

hive_query = """
select  cast(u.policy_key as int) , cast(u.endorsement_cde as int) from (
    select policy_key, endorsement_cde
    from v_od_genpolicyshadow_listall_perm
union
    select policy_key, endorsement_cde
    from v_od_genpolicy_listall_perm
) u
left join v_od_genpolicyshadow_listall_perm s
on s.policy_key = u.policy_key
and s.endorsement_cde = u.endorsement_cde
where s.policy_key is NULL
"""

print('Start the HIVE query')
df_hive = spark.sql(hive_query)
#print "Number of rows in HIVE final policies DF"
#print df_hive.count()   #Produces an error

from pyspark.sql.functions import *

#print df_hive.printSchema()

# Start the oracle query, it's a python script that loads two columns 
# in a pandas dataframe then saves the dataframe into a csv file. 
# (The file is populated after the oracle_query.py script finishes)

subprocess.call(['python', 'oracle_query.py', oracle_cnx])

# Read the oracle csv file
df_oracle = spark\
.read\
.option("inferSchema", "true")\
.option("header", "true")\
.option('delimiter', '|')\
.csv("data/oracle_query.csv")

df_merge = df_hive.join(df_oracle, (df_hive.policy_key == df_oracle.policy_key_g) & (df_hive.endorsement_cde == df_oracle.endorsement_cde_g), how='left')

df_merge = df_merge.toPandas().collect()
df_final = df_merge.to_csv('data/test.csv', sep="|", encoding='utf-8')

I get the following error :

Start the HIVE query
Start the oracle query
Insert into csv
Conversion to Pandas dataframe                                                  
root
 |-- policy_key: integer (nullable = true)
 |-- endorsement_cde: integer (nullable = true)
 |-- policy_key_g: integer (nullable = true)
 |-- endorsement_cde_g: integer (nullable = true)

None
2019-01-17 17:37:55 WARN  Utils:66 - Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.

Traceback (most recent call last):
  File "main.py", line 77, in <module>
    df_merge = df_merge.toPandas().collect()
  File "/usr/local/lib/python2.7/site-packages/pyspark/sql/dataframe.py", line 1966, in toPandas
    pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
  File "/usr/local/lib/python2.7/site-packages/pyspark/sql/dataframe.py", line 466, in collect
    port = self._jdf.collectToPython()
  File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 1160, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/local/lib/python2.7/site-packages/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/local/lib/python2.7/site-packages/py4j/protocol.py", line 320, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o55.collectToPython.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:

The log is still long, but I think the main part is this one

I have also tried this :

df_merge.coalesce(1).write.csv('data/test.csv', sep="|", header=True)

But instead of having a file I get a folder created named test.csv, and no file is in it.

I basically have two dataframes that I'm merging and want the output in one csv file. What am I missing?

GeoSal
  • 333
  • 1
  • 2
  • 15
  • 1
    Take a look at [this](https://stackoverflow.com/a/47932523/5858851). – pault Jan 17 '19 at 18:28
  • 1
    Your dataframe might empty, check with `df_merge.count()`. – serge_k Jan 18 '19 at 05:45
  • @serge_k df_merge.count() gives me an error : Traceback (most recent call last): File "main.py", line 45, in print df_hive.count() File "/usr/local/lib/python2.7/site-packages/pyspark/sql/dataframe.py", line 455, in count return int(self._jdf.count()) File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 1160, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/usr/local/lib/python2.7/site-packages/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) – GeoSal Jan 18 '19 at 10:07
  • @serge_k File "/usr/local/lib/python2.7/site-packages/py4j/protocol.py", line 320, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o38.count. – GeoSal Jan 18 '19 at 10:07
  • @serge_k You're probably right.. I'll post my entire script so you can take a look at the whole picture. Thanks in advance for your help! – GeoSal Jan 18 '19 at 10:08
  • @serge_k I edit my post to include the whole script, could you please take a look at it? Thanks again! – GeoSal Jan 18 '19 at 10:20
  • I would suggest you to execute the script line-by-line in pyspark shell to see where it fails. In your sql query you are making lef join then filter `s.policy_key is NULL`. I think it's the same as inner join. Try to call after `df_hive = spark.sql(hive_query)` or maybe `df_hive.show()` – serge_k Jan 18 '19 at 10:36
  • plus check that your join keys have the same types in both tables – serge_k Jan 18 '19 at 10:37
  • @serge_k It seems as though the hive query is not executed in the first place.. I've tried replacing the query with a simple select, still got nothing.. – GeoSal Jan 18 '19 at 12:54

0 Answers0