3

It's easy to find well regarded references stating that HDFS should not be stretched across data centers [1], while Kafka should be stretched [2].

What specific issues make HDFS ill-suited to being stretched?

I'm considering stretching HDFS across two DCs that are less than 50km apart, with an average latency of less than 1ms. I'm planning on running a soak test spanning a couple of weeks, with representative read and write workloads, but with volumes of a few hundred GB - orders of magnitude less than the cluster will store in a few years.

If the tests succeed, what level of confidence does this provide that stretching HDFS is likely to succeed? Specifically, are issues related to the relatively long inter-host latency likely to be hidden; that such issues would only be exposed with far larger volumes e.g. a couple of hundred TB?

Finally, if the inter-DC latency spikes e.g. to 10ms for a few minutes, what issues I am likely to encounter?

[1] Tom White: Hadoop: The Definitive Guide

[2] https://www.confluent.io/blog/design-and-deployment-considerations-for-deploying-apache-kafka-on-aws/

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Paul Carey
  • 1,768
  • 1
  • 17
  • 19
  • See sections on MapReduce and data locality – OneCricketeer Jul 22 '17 at 00:38
  • Also, the Confluent blog only says availability zones within one region, which are geographically nearby in AWS, are recommended ... Stretching across a network to different regions is discouraged for you will be network bound – OneCricketeer Jul 22 '17 at 00:41
  • @cricket_007 availability zones are up to a few hundred km apart - significantly further than the 50km I'm considering. I had understood the objections to stretching HDFS were more fundamental than concerns regarding data locality. In case, any MR jobs / Spark apps will typically run on a relatively small dataset of recently written data. – Paul Carey Jul 22 '17 at 00:56
  • Ah, I was under the impression that availability zones in regions were very inter-connected. In theory you could have 2 hdfs replicas on-site and 1 off, but it really just becomes the bottleneck from everything I've learned – OneCricketeer Jul 22 '17 at 06:02

0 Answers0