3

I have a cluster and I execute wholeTextFiles which should pull about a million text files who sum up to approximately 10GB total I have one NameNode and two DataNode with 30GB of RAM each, 4 cores each. The data is stored in HDFS.

I don't run any special parameters and the job takes 5 hours to just read the data. Is that expected? are there any parameters that should speed up the read (spark configuration or partition, number of executors?)

I'm just starting and I've never had the need to optimize a job before

EDIT: Additionally, can someone explain exactly how the wholeTextFiles function works? (not how to use it, but how it was programmed). I'm very interested in understand the partition parameter, etc.

EDIT 2: benchmark assessment

So I tried repartition after the wholeTextFile, the problem is the same because the first read is still using the pre-defined number of partitions, so there are no performance improvements. Once the data is loaded the cluster performs really well... I have the following warning message when dealing with the data (for 200k files), on the wholeTextFile:

15/01/19 03:52:48 WARN scheduler.TaskSetManager: Stage 0 contains a task of very large size (15795 KB). The maximum recommended task size is 100 KB.

Would that be a reason of the bad performance? How do I hedge that?

Additionally, when doing a saveAsTextFile, my speed according to Ambari console is 19MB/s. When doing a read with wholeTextFiles, I am at 300kb/s.....

It seems that by increase the number of partitions in wholeTextFile(path,partitions), I am getting better performance. But still only 8 tasks are running at the same time (my number of CPUs). I'm benchmarking to observe the limit...

Stephane Maarek
  • 5,202
  • 9
  • 46
  • 87
  • 5 hours sounds high. Have you tried with a smaller subset ? Say 10K or 100K files before you go to 1 million. Second, *if* you don't need (filename,content) then you zip all the data and read using .textFile. After you have read the data, try calling `repartition ( numPartitions )` on the RDD. You can experiment with `numPartitions` with values 8, 16, 32 etc see if it makes a difference. You can look at the implementation here https://github.com/apache/spark/blob/e200ac8e53a533d64a79c18561b557ea445f1cc9/core/src/main/scala/org/apache/spark/SparkContext.scala#L583 – Soumya Simanta Jan 17 '15 at 22:01
  • I have tried on 200k files and it takes about an hour, so the estimate sounds linear.... I am using wholeTextFiles because I then parse every of them to convert to an xml. I can't read textFile because it will have a line by line read and I can't parse anymore... unless I am wrong? – Stephane Maarek Jan 18 '15 at 16:59
  • Did you try `repartition` ? The reason I asked to try `textFile` was to see if IO(read) was slow because of number of files or the implementation of `wholeTextFiles` – Soumya Simanta Jan 18 '15 at 19:33
  • First, set the correct executing parameters instead of default ones. I'd recommend `--num-executors 4 --executor-memory 12g --executor-cores 4`, that would improve your level of paralellizm. Second, it is really bad to store the data this way on HDFS and the first task you should do after sc.wholeTextFiles is saving them out to a single compressed Sequence File with block compression and Snappy/gzip codec. The bottlenecks in your computations are the amount of threads you start and the amount of separate files you read (loading NameNode) – 0x0FFF Jan 19 '15 at 15:46
  • @0x0FFF, how do you know I should put num executors to 4 if I have two datanodes only? (still starting so probably newbie question). I'll try with your parameters and let you know. It seems that increasing the number of partitions to 32 in wholeTextFile really helps using all the resources available (8 simultaneously instead of 2) and reducing each task's size. How do I save to Sequence file with snappy ? Thanks for your suggestions and I'll let you know how this goes – Stephane Maarek Jan 19 '15 at 16:58
  • 1
    Here you can find an example of how to save in compressed sequence file: http://0x0fff.com/spark-hdfs-integration/ . About 4 - it is just an assumption, in the configuration I provided you would have 4 JVM processes with 12GB heap each, and each of them will utilize 4 cores (running 4 spark tasks in parallel) giving you 16 parallel reader threads – 0x0FFF Jan 19 '15 at 17:04
  • @0x0FFF, I have successfully achieved your steps. I take the data with wholeTextFiles (in parts, 200k at a time), and then save it into 32 partitions in HDFS. Loading it back from HDFS and only 32 partitions is blazing fast then. I think the main bottleneck is the "request" of HDFS files. There are too many so the system takes a "long time" to find them or something – Stephane Maarek Jan 20 '15 at 12:45

1 Answers1

7

To summarize my recommendations from the comments:

  1. HDFS is not a good fit for storing many small files. First of all, NameNode stores metadata in memory so the amount of files and blocks you might have is limited (~100m blocks is a max for typical server). Next, each time you read file you first query NameNode for block locations, then connect to the DataNode storing the file. Overhead of this connections and responses is really huge.
  2. Default settings should always be reviewed. By default Spark starts on YARN with 2 executors (--num-executors) with 1 thread each (--executor-cores) and 512m of RAM (--executor-memory), giving you only 2 threads with 512MB RAM each, which is really small for the real-world tasks

So my recommendation is:

  1. Start Spark with --num-executors 4 --executor-memory 12g --executor-cores 4 which would give you more parallelism - 16 threads in this particular case, which means 16 tasks running in parallel
  2. Use sc.wholeTextFiles to read the files and then dump them into compressed sequence file (for instance, with Snappy block level compression), here's an example of how this can be done: http://0x0fff.com/spark-hdfs-integration/. This will greatly reduce the time needed to read them with the next iteration
Community
  • 1
  • 1
0x0FFF
  • 4,948
  • 3
  • 20
  • 26
  • great summary. Last question, if the files are compressed under a gz file initially, and then the gz file contains many small files, is there a way for me to read from that gz file and then uncompress and wholeTextFile directly in memory? – Stephane Maarek Jan 20 '15 at 13:51
  • gzip does not allow to compress many files in a single archive, single gz archive is a single file. Gzip-compressed files are decompressed automatically. But if it is tar.gz archive I'm afraid you have to write your own InputFormat – 0x0FFF Jan 20 '15 at 14:25
  • yes it's .tar.gz. if you have a link on how to write my own InputFormat I'll take it otherwise I'll look around. Many thanks for your help! – Stephane Maarek Jan 20 '15 at 14:31
  • 2
    Start with this: http://stackoverflow.com/questions/17875277/reading-file-as-single-record-in-hadoop - reading whole file in a single shot. This way you would have a full file in a memory buffer, after which you would be able to apply Java libraries to gunzip this buffer and untar its contents – 0x0FFF Jan 20 '15 at 14:40
  • However, @0x0FFF, I don't think it is possible for executors to share cores, so I don't think allocating `4` cores per executor, with `4` executors ( `16` cores total) will work since there are only `8` cores total. – makansij Dec 13 '15 at 19:40
  • It will, this is simply called "overcommitting". Usually the best CPU throughput can be achieved when overcommitting CPU resources by ~2x – 0x0FFF Dec 13 '15 at 19:59
  • @0x0FFF I tried to give more no of cores i.e my cluster is having 4 cores and I have given 12 but I got below error. org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested resource type=[vcores] < 0 or greater than maximum allowed allocation. Requested resource=, maximum allowed allocation=, please note that maximum allowed allocation is calculated by scheduler based on maximum resource of registered NodeManagers, which might be less than configured maximum allocation= – Pirate Feb 21 '20 at 05:53