1

what is you recommendation about disks for Hadoop?

Do you recommend using SAS, or just attaching disk over SATA? Or maybe something else? What are pros and cons of every option?

(Decision about disk size has been made, and there will be about 5-6 2TB disks on each server)

wlk
  • 1,713
  • 3
  • 14
  • 19
  • There's no way we can make a recommendation based on the information provided. I would suggest buying server-grade disks, of a size and speed appropriate to your application. Only you can determine what "a size and speed appropriate to your application" means... – voretaq7 Dec 08 '12 at 03:52

4 Answers4

3

Modern Hadoop installations typically go for several consumer grade SATA drives per box.

Exactly how many disks per node depends a lot on what your application is. At Yahoo, for instance, they are mostly disk size bound so lots of disks per node makes sense. I have seen stealth technology that can saturate a large number of drive channels so multiple back planes with lots of disks makes sense there.

If you are just starting, I would recommend either 6 x 2TB SATA or 12 x 2TB SATA. There are some nice Supermicro boxes that give you four nodes in a single 2U chassis with 12 drives on the front which is nice and compact, but having only 2 x 2TB drives per node can be kind of limiting. That same 2U form factor can also host 1 or 2 nodes with the same 12 drives on the face plate. Since the chassis itself costs money, this can make a difference.

Another consideration is that many data centers are limited by power per square foot. Power expended gets divided two ways in a Hadoop cluster, some to CPU's/memory and and a large portion to keeping the drives spinning. Since these limits are likely to keep you from filling a rack with super compact 4 x node boxes, you might rather go ahead and get single node boxes so that you can add drives later as you see fit.

If you aren't limited by disk space, you should consider total network bandwidth. Having more NIC's per drive is good here so the quad boxes are nice.

In a similar vein, what are your memory requirements? 24GB RAM for a dual quad core machine is pretty standard lately, but you might need more or be able to get away with less. Having a larger aggregate amount of memory across the same number of drives might be good for your application.

Ted Dunning
  • 306
  • 1
  • 6
1

Well, since you use Hadoop, the redundancy is in the application, hence you shouldn't need to think of redundancy on each node regarding storage. This should of course be backed up with good routines on how to bring a node online again in event of storage failure.

I think 2xSATA disks in RAID0 should do it. But I don't really know if you will gain anything on this performance vice with Hadoop, it may only add complexity.

tore-
  • 1,396
  • 2
  • 10
  • 18
1

In this situation the only performance related concern I'd have is that SAS disks generally behave better in high-load scenarios - but only you know your anticipated load.

What I would say is that you want to pick enterprise-class disks whichever way you go, Hadoop can be quite intensive throughout a 24 hour period and you want a disk that was designed for 24/365 operation and many of the cheaper disks simply won't do this reliably.

WD's WD2003FYYS is highly regarded.

Chopper3
  • 101,299
  • 9
  • 108
  • 239
1

Design with failure in mind and Hadoop will impress. I run all my clusters with non-enterprise drives and have had no failures in my 24/7 operations. The cost savings well out-weight any potential failures, furthermore most disks come with 5year warranties so you just send them to get RMAd and move on.

In my experience I usually end up upgrading drives before they die, but YMMV.

All datanodes should run as ext2, do not run journaling nor use any RAID whatsoever...Hadoop is your raid with how you set replication levels.

SysEngAtl
  • 176
  • 1