I have been reading some papers on Hadoop and map-reduce. It seems that the current design enables Hadoop to tolerate failures like worker crashes, but doesn't provide much support for handling arbitrary faults(non fail-silent ones). Just wondering is this true? If true, it means we can not always fully trust the correctness of a hadoop job output?
Asked
Active
Viewed 106 times
0
-
1What failures do you think of? – Thomas Jungblut Nov 14 '13 at 17:33
-
say a node whose local states is corrupted and sends arbitrary messages? – awesomeIT Nov 15 '13 at 10:29
-
I think it is nothing the framework should care about, especially if you can buy things like ECC RAM. – Thomas Jungblut Nov 15 '13 at 10:31
-
If we use ECC, we spend more money.In this way if the framework itself can handle it(e.g. by adding more machines work on same task and get the output by majority voting), it may save the cost.The example again is expensive since it uses more workers. Feel like it ends up with being a trade-off, of where we should put the logic in, the framework or the Ram – awesomeIT Nov 15 '13 at 11:11
-
I think buying new servers is much more expensive than buying ECC memory. Also majority voting on output is a huge waste of resources. – Thomas Jungblut Nov 15 '13 at 11:19
-
I agree. Yet there are a whole bunch of failures that when occurred still allow computer continue working. Not only memory, maybe network congestion, bad hard disk sector, or even some bugs that 99% time generates correct results but fail in the rest 1%. Isn't it nice to have the framework able to detect such errors? Nowadays big company may use a whole data centre to train a machine learning model. This may take half a year time. Are we confident to say that no non-fail-stop failures occurred at all during the process?If no, can we trust the correctness/precision of the final result? – awesomeIT Nov 15 '13 at 11:30