0

In a elastic mapreduce streaming job, what is going to happen if a mapper suddenly dies? The data that were already processed will be replayed? If so, is there any option to disable that?

I am asking because I am using EMR to insert some data to third party database. Every mapper sends the data coming in through HTTP. In this case if a mapper crashes I don't want to replay the HTTP requests and I need to continue where I were left.

Vame
  • 2,033
  • 2
  • 18
  • 29
  • What do you mean by streaming? In the context of Hadoop, streaming is support for writing MR programs in any languages without using the Java Hadoop bindings. Is this what you are looking for or real time processing of data? – Praveen Sripati Apr 29 '14 at 10:14
  • please see Sudarshan's answer and my comment below. – Vame Apr 29 '14 at 12:00

1 Answers1

1

MR is a fault tolerant framework. When a Map task fails (streaming API or Java API) the behavior is the same.

Once the job tracker is notified that the task has failed it will try and reschedule the task. The temporary output generated by the failed task is deleted.

A more detailed discussion on how failures are handled in MR can be seen here

For your particular case I think you need to refer to the external source in your setup() method to find out the records which have been processed, then use this information in your mapper() methods to decide whether a particular record should be processed or not.

Sudarshan
  • 8,574
  • 11
  • 52
  • 74
  • I don't care about the output. I just do some processing with the input, and I don't want it to be replayed. Is there a way to disable this fail-over functionality? I need my new mapper to start again from the point it left. – Vame Apr 29 '14 at 12:00
  • I am not sure if i get what you mean, your processing failed mid way ... so it will discard any half processing done by the failed task and start a fresh, it cannot start from the point the failed task left off – Sudarshan Apr 29 '14 at 12:38
  • What exactly I am using it for: each mapper hits via http a third party app using the data it gets from streaming. Let's say I am throwing data in multiple databases using mapreduce for doing it in a distributed manner. It helps me because it distributes the task to as many machines I need. I don't know if MR is the right tool for my case, but I don't know any alternatives. – Vame Apr 29 '14 at 14:21
  • So are you saying while processing a input split, if half the records are processed and then the task fails, when the task is restarted by the MR framework, you would expect it to process only the second half of the file (unprocessed half) ? – Sudarshan Apr 29 '14 at 17:00
  • I expect to replay the whole batch, but I want to continue only with the unprocessed part of it. – Vame Apr 30 '14 at 08:16
  • I think you need to refer to the external source in your setup() method to find out the records which have been processed, then use this information in your mapper() methods to decide whether a particular record should be processed or not. Basically this use case needs to be handled via application logic nothing inherent in Hadoop can assist you here – Sudarshan Apr 30 '14 at 09:00
  • hmm, ok this is what i also thought of. too bad it cannot be done differently. please add your comments into your answer and i will mark it as answer. – Vame Apr 30 '14 at 09:30
  • I have edited my answer, you might want to edit your question too with the additional details :) – Sudarshan Apr 30 '14 at 09:34