Sharing conf variable or using two jobs? Which is better?

Question

I've been given a big textfile. With data taken from a company's taxis. This data is organised between travels. Example:

Driver's license(32421ALED), Fare(US$6)
Driver's license(9167825HF), Fare(US$15)

The purpose of my hadoop's map-reduce program is to return the driver with the biggest amount of money collected.

So:

Mapper processes the textfile with a tokenizer so that, for each travel, it assosiates each driver's license with a fare.
Reducer takes the mapper's output and then get each driver's collected money by adding all it's fares.

What now? I would need another reducer so that once i get each driver's collected money i would only have to get the one with the highest amount collected. This is the problem.

I searched through stackoverflow and i found two possible solutions:

Sharing a Conf variable through the host that contains the biggest amount of collected money and it's driver (could be two).
Use two jobs. The first one to get each driver's collected money and the second one to get the driver with the biggest amount of collected money.

Which would be the best choice for my problem? Are there another?

score 0 · Answer 1 · answered Oct 22 '16 at 17:35

0

Actually you can do this in one program. you can maintain a local comparison object(driver license, sum collected money) in reducer and compare it with current processing (key, values) after summing the total $ if it is greater than the local object, assign with the current one. here you are not going to do context.write() in reduce method. and finally, you will do the context.write() of the local comparison object in cleanup method of Reducer, which will have the driver license and highest collection of $.

Note: you need to run this with only one reducer by setting job.setNumReduceTasks(1); not sure if this is fine with your requirement.

answered Oct 22 '16 at 17:35

Meeran0823

94
6

Thanks for answering. Mmm no, that wont be an option. What if i used distributed cache or Configuration.set()? – Hernan Oct 22 '16 at 22:16
You want to write this MapReduce program to find the driver with the biggest amount of money collected. I could not understand how you could set that biggest amount in Configuration.set() ??? if you already know that, then what is the purpose of writing this program then? could you be clearer, please – Meeran0823 Oct 23 '16 at 16:01

Sharing conf variable or using two jobs? Which is better?

1 Answers1