Which is the fastest way to read Json Files from S3 : Spark

Question

I have a directory with folders and each folder contains compressed JSON file (.gz). Currently I am doing like:

val df = sqlContext.jsonFile("s3://testData/*/*/*")
df.show()

Eg:

testData/May/01/00/File.json.gz

Each compressed file is about 11 to 17 GB.

I have:

Master: 1 c3.4xlarge
Core: 19 c3.4xlarge
Spark 1.5.2
emr-4.2.0

The compressed files have multiple json objects/file. This process takes huge amount of time just to read (just the the above two statements). Is there any faster way to do this? The schema is little complex as well. I am planning to write some queries to analysis the data set. But I am worried about the time it takes to read data from s3.

Maximum load can be 10TB. I am planning to use cache to process queries later.

http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219 — zero323, Jul 06 '16 at 00:05

score 10 · Accepted Answer · answered Aug 22 '16 at 21:13

If your JSON is uniformly structured I would advise you to give Spark the schema for your JSON files and this should speed up processing tremendously.

When you don't supply a schema Spark will read all of the lines in the file first to infer the schema which, as you have observed, can take a while.

See this documentation for how to create a schema: http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema

Then you'd just have to add the schema you created to the jsonFile call:

val df = sqlContext.jsonFile("s3://testData/*/*/*", mySchema)

At this time (I'm using Spark 1.6.2) it seems as if jsonFile has been deprecated, so switching to sqlContext.read.schema(mySchema).json(myJsonRDD) (where myJsonRDD is of type RDD[String]) might be preferable.

I created a schema to see if using one would speed up the process of reading a file, and my query went from executing in 44s to taking 47m to execute. There could be some confounding variable for the load on the cluster at that time, but I thought it was interesting that it took so much longer. I am running Spark version 1.5.1. — satoukum, Nov 10 '16 at 23:37
For those using newer versions of Spark `sqlContext.jsonFile("...")` is deprecated. Use `sqlContext.read.json("...")` instead. — ADAM, Apr 04 '17 at 23:51

Which is the fastest way to read Json Files from S3 : Spark

1 Answers1