I know that during the intermediate steps between mapper and reducer, hadoop will sort and partition the data on its way to the reducer.
Since I am dealing with already partitioned data in my input to the mapper, is there a way to take advantage of it and possibly accelerate the intermediate processing so no more sorting or grouping-by will take place?
Adding some details:
As I store data on S3, let's say I only have two files in my bucket. First file will store records of the lower half users ids, the other file will store values of the upper half of user ids. Data in each file is not necessarily sorted, but it is guaranteed that all data pertaining to a user is located in the same file.
Such as:
\mybucket\file1
\mybucket\file2
File1 content:
User1,ValueX
User3,ValueY
User1,ValueZ
User1,ValueAZ
File2 content:
User9,ValueD
User7,ValueB
User7,ValueD
User8,ValueB
From what I read, I can use a streaming job and two mappers and each mapper will suck in one of the two files, but the whole file. Is this true?
Next, Let's say the mapper will only output a unique Key just once, with the associated value being the number of occurrences of that Key. (which I realize it is more of a reducer responsibility, but just for our example here)
Can the sorting and partitioning of those output keys from the Mapper be disabled and let them fly freely to the reducer(s) ?
Or to give another example: Imagine all my input data contains just one line for each Unique Key, and I don't need that data to be sorted in the final output of the reducer. I just want to Hash the Value for each Key. Can I disable that sorting and partitioning step before the reducer?