HDFS resiliency to machine restarts in DC/OS

Question

I have installed HDFS from universe on my DCOS cluster of 10 Core OS machines (3 master nodes, 7 agent nodes). My HA HDFS config has 2 name nodes, 3 journal nodes and 5 data nodes. Now, my question is. Shouldn’t the HDFS be resilient to machine restarts? If I restart a machine where a data node is installed the data node gets rebuilt as a mirror of the others (only after restarting the HDFS service from the DC/OS UI). In the case of a restart where a journal node or a name node is, the nodes will be just marked as lost and never rebuilt.

score 1 · Accepted Answer · answered Feb 15 '17 at 09:23

Eventually the problem was found in a buggy version of the universe HDFS package for DC/OS. However, a completely new HDFS package for DC/OS will be released on Universe in the next few weeks.

https://dcos-community.slack.com/archives/data-services/p1485717889001709

https://dcos-community.slack.com/archives/data-services/p1485801481001734

score 0 · Answer 2 · answered Jan 31 '17 at 17:46

A quick summary of the HDFS resiliency model for an HA deployment like yours:

The two NameNodes form an active/standby pair. In the event of a machine restart of the active, then the system detects failure of the active and the standby takes over as the new active. Once the machine completes its restart, the NameNode process runs again, and it becomes the new standby. There is no downtime unless both NameNodes are down simultaneously. The data on the host (e.g. the fsimage metadata file) is typically maintained between restarts. If this is not the case in your environment, then you'll need additional recovery steps to re-establish the standby, such as by running the hdfs namenode -bootstrapStandby command.
The 3 JournalNodes form a quorum. In the event of a machine restart, the NameNode can continue writing its edit log transactions to the remaining 2 JournalNodes. Once the machine completes its restart, the JournalNode process runs again, catches up with transactions it may have missed, and then the NameNode writes to all 3 again. There is no downtime unless 2 or more JournalNodes are down simultaneously. If data (e.g. the edits files) are not maintained across restarts, then the restarted JournalNode will catch up by copying from a running JournalNode.
DataNodes are mostly disposable. In the event of a machine restart, clients will be rerouted to other running DataNodes for their reads and writes (assuming the typical replication factor of 3). Once the machine completes its restart, the DataNode process runs again, and it can start serving read/write traffic from clients again. There is no downtime unless a mass simultaneous failure event (extremely unlikely and probably correlated with bigger data center problems) causes all the DataNodes hosting replicas of a particular block are down simultaneously. If data (the block file directory) is not maintained across restarts, then after a restart, it will look like a whole new DataNode coming online. If this causes cluster imbalance, then that can be remedied by running the HDFS Balancer.

Thank you Chris, this is exactly how I expected it to work, but on DC/OS (where HDFS is run with Apache Mesos) only Data Nodes are restarted after a machine restart whereas Journal Nodes and Name Nodes get never restarted. Mesos marks their related tasks as lost and is not able to relaunch them. — Andrea T. Bonanno, Feb 01 '17 at 11:19
Thank you for the further clarification. Unfortunately, I don't have any experience running HDFS on DC/OS or Mesos, so I can't provide further information on that. Hopefully your question will attract some DC/OS or Mesos expertise. — Chris Nauroth, Feb 01 '17 at 17:17

HDFS resiliency to machine restarts in DC/OS

2 Answers2

Linked