I have a directory with folders and each folder contains compressed JSON file (.gz
). Currently I am doing like:
val df = sqlContext.jsonFile("s3://testData/*/*/*")
df.show()
Eg:
testData/May/01/00/File.json.gz
Each compressed file is about 11 to 17 GB.
I have:
- Master: 1 c3.4xlarge
- Core: 19 c3.4xlarge
- Spark 1.5.2
- emr-4.2.0
The compressed files have multiple json objects/file. This process takes huge amount of time just to read (just the the above two statements). Is there any faster way to do this? The schema is little complex as well. I am planning to write some queries to analysis the data set. But I am worried about the time it takes to read data from s3.
Maximum load can be 10TB. I am planning to use cache to process queries later.