In "Hadoop - Definitive Guide", it says -->
The client running the job calculates the splits for the job by calling getSplits(), then sends them to the jobtracker, which uses their storage locations to schedule map tasks to process them on the tasktrackers.
public abstract class InputSplit {
public abstract long getLength() throws IOException, InterruptedException;
public abstract String[] getLocations() throws IOException,
}
We know that the getLocations() return a array of hostnames.
Question 1: How does the client knows which hostnames to return. Isn't it the job of the jobtracker?
Question 2: Can 2 different InputSplit objects return the same hostname? How are the hostnames decided. Who does that?
I feel the client contacts the namenode to get all the hostnames of a file (replicas included) , does some maths to arrive at the location set for each inputsplit. Is it true?