3

I am a Spark Newbie. I have a simple pyspark script. It reads a json file, flattens it and writes it to S3 location as parquet compressed file.

The read and transformation steps run very fast and uses 50 executors (which I set in the conf). But the write stage takes a long time and writes only one large file (480MB).

How is the number of files saved decided? Can the write operation be sped up somehow?

Thanks, Ram.

Ram
  • 63
  • 2
  • 7

2 Answers2

3

The number of files output is equal to the the number of partitions of the RDD being saved. In this sample, the RDD is repartitioned to control the number of output files.

Try:

repartition(numPartitions) - Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.

>>> dataRDD.repartition(2).saveAsTextFile("/user/cloudera/sqoop_import/orders_test")

The number of files output is the same as the number of partitionds of the RDD.

$ hadoop fs -ls /user/cloudera/sqoop_import/orders_test
Found 3 items
-rw-r--r--   1 cloudera cloudera          0 2016-12-28 12:52 /user/cloudera/sqoop_import/orders_test/_SUCCESS
-rw-r--r--   1 cloudera cloudera    1499519 2016-12-28 12:52 /user/cloudera/sqoop_import/orders_test/part-00000
-rw-r--r--   1 cloudera cloudera    1500425 2016-12-28 12:52 /user/cloudera/sqoop_import/orders_test/part-00001

Also check this: coalesce(numPartitions)

source-1 | source-2


Update:

The textFile method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 64MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.

... but this is minimum number of possible partitions so they are not guaranteed.

so if you want to partition on read, you should use this....

dataRDD=sc.textFile("/user/cloudera/sqoop_import/orders").repartition(2)
Community
  • 1
  • 1
Ronak Patel
  • 3,819
  • 1
  • 16
  • 29
  • Thanks! When should the repartitioning happen? Is it possible to partition the RDD during read? Or does it have to be a separate step? – Ram Dec 28 '16 at 21:24
  • 1
    @Ram - see updated answer - if my efforts helped to solve your problem please accept my answer as accepted answer (click on correct sign next to up/down arrows above, also click on up arrow) cheers :) – Ronak Patel Dec 29 '16 at 01:08
  • 1
    It should be mentioned that in the case of reducing the number of partitions, one should prefer `coalesce` over `repartition`, as it avoids a full shuffle. This is because Spark knows it can keep data on the desired number of partitions, only moving data off extra nodes. – user4601931 Dec 29 '16 at 01:35
  • @dmdmdmdmdmd - I tired this before but it defines minimum of partitions, so in your example it will return 2 OR more partitions. I tried this and got 12 output files instead of 2.... – Ronak Patel Dec 29 '16 at 01:42
  • also - [coalesce(numPartitions)](http://spark.apache.org/docs/latest/programming-guide.html#transformations) - Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset. – Ronak Patel Dec 29 '16 at 01:47
  • Sorry, I deleted my comment immediately, as it was wrong. You are correct that passing a `minPartitions` argument to, for instance, `sc.textFile` doesn't guarantee the number of partitions. It's passed to Hadoop's [`InputFormat.getSplits`](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/InputFormat.html#getSplits(org.apache.hadoop.mapred.JobConf%2C%20int)), which is merely a suggestion. – user4601931 Dec 29 '16 at 01:49
1

There are 2 different things to consider:-

  1. HDFS Block size:- The block size of HDFS is configurable in HDFS-site.xml (128 Mb by default). If a file is having a size more than the block size, a new block will be assigned in the memory to the rest of the file data. But, that is not something you can see. It is done internally. The whole process is sequential.

  2. Partitions:- When Spark comes into picture, so does parallelism. Ideally, if you do not manually provide the number of partitions, it would be equal to the block size in the default configuration. On the other hand, if you want to customize the number of partitioned files, you could go ahead and use the API , where n being the number of partition. These partitions are visible to you in the HDFS when you browse it.

Also, To increase the performance, you could give some specifications such as num executors, executor memory, cores per executor, etc. while spark-submit / pyspark /spark-shell. The performance while writing any file depends on the format and compression codec used for the same wildly.

Thanks for reading.

Ankit Bhardwaj
  • 101
  • 1
  • 2