0

Setup: latest (5.29) AWS EMR, spark, 1 master 1 node.

step 1. I have used S3Select to parse a file & collect all file keys for pulling from S3. step 2. Use pyspark iterate the keys in a loop and do the following

spark .read .format("s3selectCSV") .load(key) .limit(superhighvalue) .show(superhighvalue)

It took be x number of minutes.

When I increase the cluster to 1 master and 6 nodes, I am not seeing difference in time. It appears to me that I am not using the increased core nodes.
Everything else, config wise are defaults out of the box, I am not setting anything.

So, my question is does cluster size matters to read and inspect (say log or print) data from S3 using EMR, Spark?

Jason B
  • 21
  • 7

2 Answers2

0

Few thing to keep in mind.

  1. are you sure that the executors have indeed increased because of increase of nodes? or u can specify them during spark submit --num-executors 6. MOre nodes doenst mean nore executors are spinned.
  2. next thing, wht is the size of csv file? some 1MB? then u will not see much difference. Make sure to have atleast 3-4 GB
chendu
  • 684
  • 9
  • 21
  • Yes, I did play with **--num-executors** option with my **spark-submit** command. It was 1 (when core was 1) and it did update to 4 when I created a new cluster with number of cores >1.
    csv files range fom 2MB to 100MB.
    – Jason B Feb 05 '20 at 00:17
  • I tried the following and I think my code is not taking advantage of spark parallelization. Can we do this? I tired it, it finishes fast, however, when I try to load and show it does not display anything to spark log.
    `//s3fileKeysList lets say a list of s3 file keys are in this variable sc=sparkSession.SparkContext fileKeyListParallelized=sc.parallelize(s3fileKeysList) allData=fileKeyListParallelized.map (lambda file: spark .read .format("s3selectCSV") .load("s3://path/to/my/"+file) )`
    – Jason B Feb 05 '20 at 03:45
  • can u try this: `val sqlContext = new SQLContext(sc) val df = sqlContext.read .format("com.databricks.spark.csv") .option("inferSchema", "true") .load("s3a://bucket/prefix/foldername/")` https://github.com/databricks/spark-csv – chendu Feb 06 '20 at 05:56
  • I am not using databricks. Anyhow, I could conclude that generally speaking cluster size does matter for reading from S3. While debugging this further the key for me was **parallelize**. I could see the time went down. – Jason B Feb 07 '20 at 23:55
0

Yes, size does matter. For my use case, sc.parallelize(s3fileKeysList), parallelize turned out to be the key.

Jason B
  • 21
  • 7