I've been given a big textfile. With data taken from a company's taxis. This data is organised between travels. Example:
- Driver's license(32421ALED), Fare(US$6)
- Driver's license(9167825HF), Fare(US$15)
The purpose of my hadoop's map-reduce program is to return the driver with the biggest amount of money collected.
So:
- Mapper processes the textfile with a tokenizer so that, for each travel, it assosiates each driver's license with a fare.
- Reducer takes the mapper's output and then get each driver's collected money by adding all it's fares.
What now? I would need another reducer so that once i get each driver's collected money i would only have to get the one with the highest amount collected. This is the problem.
I searched through stackoverflow and i found two possible solutions:
Sharing a Conf variable through the host that contains the biggest amount of collected money and it's driver (could be two).
Use two jobs. The first one to get each driver's collected money and the second one to get the driver with the biggest amount of collected money.
Which would be the best choice for my problem? Are there another?