0

Currently I use AWS-EMR as the cluster. For the library, I use cascading.

The input data is stored in aws S3, in a directory. The directory contains many files, each about 100mb large (not compressed, plain text), and the files can easily reach 100 in number daily. The filename of each file contains a date. At the end of the day, I process all files produced on that date.

Currently my hadoop application process happens like this:

  • Use S3 folder as input tap using GlobHfs
  • The GlobHFS has a custom filter where it's filtering the filename and only accepts today's file
  • process only filtered files and set output tap to S3.

My question:

  • Should I use compression? If so what compression type should I use? I read .gz compression makes it only one mapper can do the job, in my case where the folder has many files, is that relevant? Should I use LZO with each file?
  • Should I store bigger files, or is the current format (many smaller files) good enough?
Hans Roerdinkholder
  • 3,000
  • 1
  • 20
  • 30
dieend
  • 2,231
  • 1
  • 24
  • 29

2 Answers2

1

Compression will help in reducing network flow of data. LZO compression is more suitable for MR jobs. But as your files are stored in S3 instead of HDFS, each file will be processed by a mapper irrespective of compression used. As per my knowledge block size doesn't apply in case of S3.

A suggestion here will be create keys under your bucket where each key corresponds to a date. This will speed up input filtering. e.g. //

Type of node used for EMR cluster can be one of the deciding factor for file size. If nodes are highly efficient node like r3.8xlarge then input file size can be more. On the other hand if it is a m1.medium file size has to be small to properly use your cluster.

0

Note that listing the files in S3 in case of a glob may take a long time with s3n:// .

You should experiment with s3distcp that can copy, merge, compress, etc. the data and does the listing a lot faster.

SNeumann
  • 1,158
  • 9
  • 12