7

I can read a json file into a dataframe in Pyspark using

spark = SparkSession.builder.appName('GetDetails').getOrCreate()
df = spark.read.json("path to json file")

However, when i try to read a bz2(compressed csv) into a dataframe it gives me an error. I am using:

spark = SparkSession.builder.appName('GetDetails').getOrCreate()
df = spark.read.load("path to bz2 file")

Could you please help correct me?

philantrovert
  • 9,904
  • 3
  • 37
  • 61
Leonius
  • 71
  • 1
  • 2
  • 1
    What error did you get? Try to include that error in your question. – ruseel Jun 05 '18 at 01:49
  • I believe the error contains this clue: "Caused by: java.lang.RuntimeException: file:path/to/json.bz2 is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [0, 108, 106, -40]" as by default `spark.read.load` expects "parquet" format. – Sergiy Sokolenko Mar 14 '21 at 11:10

1 Answers1

2

The method spark.read.load() has an optional parameter format which by default is 'parquet'.

So, for your code to work it should look like this:


df = spark.read.load("data.json.bz2", format="json")

Also, spark.read.json will perfectly work for compressed JSON files, e.g.:


df = spark.read.json("data.json.bz2")

Sergiy Sokolenko
  • 5,967
  • 35
  • 37