Hadoop - how to improve performance of my case?

Question

Currently I use AWS-EMR as the cluster. For the library, I use cascading.

The input data is stored in aws S3, in a directory. The directory contains many files, each about 100mb large (not compressed, plain text), and the files can easily reach 100 in number daily. The filename of each file contains a date. At the end of the day, I process all files produced on that date.

Currently my hadoop application process happens like this:

Use S3 folder as input tap using GlobHfs
The GlobHFS has a custom filter where it's filtering the filename and only accepts today's file
process only filtered files and set output tap to S3.

My question:

Should I use compression? If so what compression type should I use? I read .gz compression makes it only one mapper can do the job, in my case where the folder has many files, is that relevant? Should I use LZO with each file?
Should I store bigger files, or is the current format (many smaller files) good enough?

What do you want to improve? Is your job takes too long or you want to save disk space? — Mehraban, Oct 16 '14 at 09:56
I don't think it be possible. Compression is time consuming. How many nodes do you have and what is the block size of fs? — Mehraban, Oct 16 '14 at 10:31

score 1 · Answer 1 · answered Oct 16 '14 at 14:25

Compression will help in reducing network flow of data. LZO compression is more suitable for MR jobs. But as your files are stored in S3 instead of HDFS, each file will be processed by a mapper irrespective of compression used. As per my knowledge block size doesn't apply in case of S3.

A suggestion here will be create keys under your bucket where each key corresponds to a date. This will speed up input filtering. e.g. //

Type of node used for EMR cluster can be one of the deciding factor for file size. If nodes are highly efficient node like r3.8xlarge then input file size can be more. On the other hand if it is a m1.medium file size has to be small to properly use your cluster.

So compression won't help the mapper because each file can be processed by different mapper? — dieend, Oct 16 '14 at 17:12
Yes. Compression will help in minimizing data transfer time and size of data to be transferred. Also it will reduce your S3 cost. — Swaroop Kumar Patra, Oct 17 '14 at 10:05

score 0 · Answer 2 · answered Oct 19 '14 at 09:40

0

Note that listing the files in S3 in case of a glob may take a long time with s3n:// .

You should experiment with s3distcp that can copy, merge, compress, etc. the data and does the listing a lot faster.

answered Oct 19 '14 at 09:40

SNeumann

1,158
9
12

Hadoop - how to improve performance of my case?

2 Answers2