Where is spark/pyspark saving my parquet files?

Question

I'm saving a dataframe in pyspark to a particular location, but cannot see the file/files in the directory. Where are they? How do I get to them out side of pyspark? And how do I delete them? And what is it that I am missing about how spark works?

Here's how I save them...

df.write.format('parquet').mode('overwrite').save('path/to/filename')

And subsequently the following works...

df_ntf = spark.read.format('parquet').load('path/to/filename')

But no files ever appear in path/to/filename.

This is on a cloudera cluster, let me know if any other details are needed to diagnose the problem.

EDIT - This is the command I use to set up my spark contexts.

os.environ['SPARK_HOME'] = "/opt/cloudera/parcels/Anaconda/../SPARK2/lib/spark2/"
os.environ['PYSPARK_PYTHON'] = "/opt/cloudera/parcels/Anaconda/envs/python3/bin/python"                                           

conf = SparkConf()
conf.setAll([('spark.executor.memory', '3g'),
             ('spark.executor.cores', '3'),
             ('spark.num.executors', '29'),
             ('spark.cores.max', '4'),
             ('spark.driver.memory', '2g'),
             ('spark.pyspark.python', '/opt/cloudera/parcels/Anaconda/envs/python3/bin/python'),
             ('spark.dynamicAllocation.enabled', 'false'),
             ('spark.sql.execution.arrow.enabled', 'true'),
             ('spark.sql.crossJoin.enabled', 'true')
             ])

print("Creating Spark Context at {}".format(datetime.now()))

spark_ctx = SparkContext.getOrCreate(conf=conf)

spark = SparkSession(spark_ctx)
hive_ctx = HiveContext(spark_ctx)
sql_ctx = SQLContext(spark_ctx)

what is your resource manager and where are you trying to save the file local or hdfs? which mode you are ruuing spark job(local/cluster/client) — data_addict, Jul 18 '19 at 10:01
@user805. Honestly no idea, it's a black box that I've been told to use with minimal training. I'll edit my answer to demonstrate the way I've been taught to create my spark context and hopefully that will be illuminating! — EddyTheB, Jul 18 '19 at 10:28

score 2 · Answer 1 · answered Jul 18 '19 at 14:03

Ok, a colleague and I have figured it out. It's not complicated but we are but simple data scientists so it wasn't obvious to us.

Basically the files were being saved in a different hdfs drive, not the drive from which we run our queries using Jupyter notebooks.

We found them by doing;

hdfs dfs -ls -h /user/my.name/path/to

Where is spark/pyspark saving my parquet files?

1 Answers1