I have a very big pyspark dataframe. So I want to perform pre processing on subsets of it and then store them to hdfs. Later I want to read all of them and merge together. Thanks.
Asked
Active
Viewed 3.3k times
1 Answers
13
writing DataFrame to HDFS (Spark 1.6).
df.write.save('/target/path/', format='parquet', mode='append') ## df is an existing DataFrame object.
some of the format options are csv
, parquet
, json
etc.
reading DataFrame from HDFS (Spark 1.6).
from pyspark.sql import SQLContext sqlContext = SQLContext(sc) sqlContext.read.format('parquet').load('/path/to/file')
the format method takes argument such as parquet
, csv
, json
etc.
-
Hey I get attributError : DataFrameWriter' object has no attribute 'csv. Also I need to read that dataframe later that is I think in new spark session. – Ajg May 31 '17 at 17:23
-
what is the version of your spark installation? – rogue-one May 31 '17 at 17:24
-
spark version 1.6.1 – Ajg May 31 '17 at 17:26
-
Thanks a lot. I have one doubt, while reading what if there are multiple files in that location. How to specify which file I want to read. Thanks – Ajg May 31 '17 at 17:55
-
if you want to read only one file among many. you will have to just specify the full file path. if you want to read all the files you can use glob patterns like `*` in the path. – rogue-one May 31 '17 at 17:57
-
Thanks. Will try that. – Ajg May 31 '17 at 19:10
-
Sorry for one more question: Can you please tell how to delete those dataframes from HDFS afterwards. – Ajg May 31 '17 at 20:29
-
2to delete the data from hdfs you can use HDFS shell commands like `hdfs dfs -rm -rf
`. you can execute this using python subprocess like `subprocess.call(["hdfs", "dfs", "-rm", "-rf", `path`])` – rogue-one May 31 '17 at 21:33 -
what is target path? where does hdfs actually live on my pc? – ERJAN Jul 22 '20 at 14:44