-1

I want to ask. How if I set the hdfs blocksize to 1 GB, and I'll upload file with size almost 1 GB. Would it become faster to process mapreduce? I think that with larger block size, the container request to resource manager (map task) will fewer than the default. So, it will decrease the latency of initialize container, and also decrease network latency too.

So, what do you think all?

Thanks

Kenny Basuki
  • 625
  • 4
  • 11
  • 27

2 Answers2

2

There are a number of things that this impacts. Most obviously, a file will have fewer blocks if the block size is larger. This can potentially make it possible for client to read/write more data without interacting with the Namenode, and it also reduces the metadata size of the Namenode, reducing Namenode load (this can be an important consideration for extremely large file systems).

With fewer blocks, the file may potentially be stored on fewer nodes in total; this can reduce total throughput for parallel access,and make it more difficult for the MapReduce scheduler to schedule data-local tasks.

When using such a file as input for MapReduce (and not constraining the maximum split size to be smaller than the block size), it will reduce the number of tasks which can decrease overhead. But having fewer, longer tasks also means you may not gain maximum parallelism (if there are fewer tasks than your cluster can run simultaneously), increase the chance of stragglers, and if a task fails, more work needs to be redone. Increasing the amount of data processed per task can also cause additional read/write operations (for example, if a map task changes from having only one spill to having multiple and thus needing a merge at the end).

Usually, it depends on the input data. If you want to maximize throughput for a very large input file, using very large blocks (128MB or even 256MB) is best. For smaller files, using a smaller block size is better. Note that you can have files with different block sizes on the same file system by changing the dfs.block.size parameter when the file is written, e.g. when uploading using the command line tools: "hdfs dfs -put localpath dfspath -D dfs.block.size=xxxxxxx"

Source: http://channel9.msdn.com/Forums/TechOff/Impact-of-changing-block-size-in-Hadoop-HDFS

Useful link to read:

Change block size of dfs file

How Mappers get assigned.

Community
  • 1
  • 1
Sandeep Singh
  • 7,790
  • 4
  • 43
  • 68
  • Hello, I've tried to set block size bigger than default 128M, also 256M. But, I get slower process than before. And as I've tried, the best block size is 128M. Can you explain to me, why is this? And why hadoop choose default 128 not bigeer? Thanks a lot.... – – Kenny Basuki May 23 '15 at 09:38
  • Mapper gets assigned based on number of input splits/blocks. Hence Increasing the block size is deceases the mapper count. Job performance depends on number of mapper and reducer allocated to your job. 128 MB or 256 MB block is ideal for very large data sets on large cluster. For small datasets 64 MB (default size) is ideal. – Sandeep Singh May 23 '15 at 09:59
0

The up is right.You couldn't just to determine the goodness and badness of Hadoop system by adjust the blocksize.

But according to my test that used different blocksize in hadoop, the 256M is a good choice.

gwgyk
  • 460
  • 4
  • 13
  • Hello, I've tried to set block size bigger than default 128M, also 256M. But, I get slower process than before. And as I've tried, the best block size is 128M. Can you explain to me, why is this? And why hadoop choose default 128 not bigeer? Thanks a lot.... – Kenny Basuki May 23 '15 at 09:35
  • I forget to tell you, because my hadoop cluster used SSD disk, so 256M is a good choice for me.But as I showed to you, the hadoop developer must be obtain the suitable BlockSize by the many test. I do not know your cluster config, so I don't know why slower. – gwgyk May 23 '15 at 10:24