10

I am new to Apache Spark 1.3.1. How can I convert a JSON file to Parquet?

halfer
  • 19,824
  • 17
  • 99
  • 186
odbhut.shei.chhele
  • 5,834
  • 16
  • 69
  • 109
  • You could also use Apache Drill (maybe easier to setup), you could convert JSON from a local-filesystem to HDFS Parquet in 1 line of SQL: "CREATE TABLE dfs.drill.`/test5/` AS (SELECT * FROM dfs.gen.`/2016/10/*/*.json` e);", if you are interested => https://drill.apache.org/docs/parquet-format/. – Thomas Decaux Oct 05 '16 at 07:14

1 Answers1

19

Spark 1.4 and later

You can use sparkSQL to read first the JSON file into an DataFrame, then writing the DataFrame as parquet file.

val df = sqlContext.read.json("path/to/json/file")
df.write.parquet("path/to/parquet/file")

or

df.save("path/to/parquet/file", "parquet")

Check here and here for examples and more details.

Spark 1.3.1

val df = sqlContext.jsonFile("path/to/json/file")
df.saveAsParquetFile("path/to/parquet/file")

Issue related to Windows and Spark 1.3.1

Saving a DataFrame as a parquet file on Windows will throw a java.lang.NullPointerException, as described here.

In that case, please consider to upgrade to a more recent Spark version.

Rami
  • 8,044
  • 18
  • 66
  • 108