2

I am new at this concept, and still learning. I have total 10 TB json files in AWS S3, 4 instances(m3.xlarge) in AWS EC2 (1 master, 3 worker). I am currently using spark with python on Apache Zeppelin.

I am reading files with the following command;

hcData=sqlContext.read.option("inferSchema","true").json(path)

In zeppelin interpreter settings:

master = yarn-client
spark.driver.memory = 10g
spark.executor.memory = 10g
spark.cores.max = 4

It takes 1 minute to read 1GB approximately. What can I do more for reading big data more efficiently?

  • Should I do more on coding?
  • Should I increase instances?
  • Should I use another notebook platform?

Thank you.

Community
  • 1
  • 1
Beril Boga
  • 97
  • 2
  • 9

3 Answers3

2

For performance issue, the best is to know where is the performance bottleneck. Or try to see where the performance problem could be.

Since 1 minute to read 1GB is pretty slow. I would try the following steps.

  • Try to explicitly specify schema instead of inferSchema
  • Try to use Spark 2.0 instead of 1.6
  • Check the connection between S3 and EC2, in case there were some misconfiguration
  • Using different file format like parquet other than json
  • Increase the executor memory and decrease the driver memory
  • Use Scala instead of Python, although in this case is the least likely the issue.
Rockie Yang
  • 4,725
  • 31
  • 34
  • Thank you so much. This was very explanatory answer for me. So, for 10 TB data, 3 worker and 1 master (each m3.xlarge) should be enough, right? – Beril Boga Nov 10 '16 at 05:05
  • This actually depends what you wanna do. For simple statistics, it shall be ok with some help of intermediate aggregation. For intensive machine learning, might not. – Rockie Yang Nov 10 '16 at 07:50
2

I gave a talk on this topic back in october: Spark and Object Stores

Essentially: use parquet/orc but tune settings for efficient reads. Once it ships, grab Spark 2.0.x built against Hadoop 2.8 for lots of speedup work we've done, especially working with ORC & Parquet. We also add lots of metrics too, though not yet pulling them all back in to the spark UI.

Schema inference can be slow, if it has to work through the entire dataset (CSV inference does; I don't know about JSON). I'd recommend doing it once, grabbing the schema details and then explicitly declaring it as the schema next time wround.

stevel
  • 12,567
  • 1
  • 39
  • 50
1

You can persist the data in parquet format after json read

hcData=sqlContext.read.option("inferSchema","true").json(path)
hcData.write.parquet("hcDataFile.parquet")
val hcDataDF = spark.read.parquet("hcDataFile.parquet")

// create a temporary view in Spark 2.0 or registerAsTemp table in Spark 1.6 and use SQL for further logic

hcDataDF.createOrReplaceTempView("T_hcDataDF")

//This is a manual way of doing RDD checkingpointing (not supported for DataFrames), this will reduce RDD Lineage which will improve performance.

For execution, use Dyanamic Resource Allocation for your spark-submit command:

//Make sure the following are enabled in your cluster, otherwise you can use these parameters in spark-summit command as --conf

•   spark.dynamicAllocation.enabled=true
•   spark.dynamicAllocation.initialExecutors=5 
•   spark.dynamicAllocation.minExecutors=5
•   spark.shuffle.service.enabled=true
•   yarn.nodemanager.aux-services=mapreduce_shuffle,spark_shuffle
•   yarn.nodemanager.aux-services.spark_shuffle.class
    =org.apache.spark.network.yarn.YarnShuffleService

//Spark-submit command

 ./bin/spark-submit --class package.hcDataclass \
 --master yarn-cluster \
 --deploy-mode cluster \
 --driver-memory 1G \
 --executor-memory 5G\
 hcData*.jar

//For dynamic Resource Allocation we don't need to specify the # of executors. //Job will automatically get the resources based on cluster bandwidth.

Arvind Kumar
  • 1,325
  • 1
  • 19
  • 27