1

I just want to clarify this quote "Code moves near data for computation",

  1. does this mean all java MR written by developer deployed to all servers in cluster ?

  2. If 1 is true, if someone changes a MR program, how its distributed to all the servers ?

Thanks

realnumber
  • 2,124
  • 5
  • 25
  • 33
  • Can someone please explain the concept of "Code moves near data" in terms of design. Can this be understood without prior knowledge of hadoop? – Sankalp Nov 15 '16 at 04:22

2 Answers2

3
  1. Hadoop put MR job's jar to the HDFS - its distributed file system. The task trackers which needed it will take it from there. So it distributed to some nodes and then loaded on-demand by nodes which actually needs them. Usually this needs mean that node is going to process local data.
  2. Hadoop cluster is "stateless" in relation to the jobs. Each time job is viewed as something new and "side effects" of the previous job are not used.

Indeed, when some small number of files (or splits to be precise) are to be processed on large cluster, optimization of sending jar to only few hosts where data indeed reside might somewhat reduce the job latency. I do not know if such optimization is planned.

David Gruzman
  • 7,900
  • 1
  • 28
  • 30
  • Thanks for the clarification!, do you know any articles which points how to download jar dynamically and execute in remote server ? which sounds interesting. – realnumber Jul 23 '12 at 00:13
  • Hadoop automatically takes care of distributing the jars to all the computing nodes. Check out the Hadoop [documentation](http://www.thecloudavenue.com/p/hadoopresources.html). – Praveen Sripati Jul 23 '12 at 00:47
2

In hadoop cluster you use the same nodes for data and computation. That means your hdfs datanode is setup on the same cluster used by task tracker for computation. So now when you execute MR jobs job tracker looks where your data is stored. Whereas in other computation model data is not stored in the same cluster and you may have to move data while you are doing your computation on some compute node.

After you start a job all the map functions will get splits of your input file. These map functions are executed so that split of input file is closer to them or in other words in the same rack. This is what we mean by computation is done closer to data.

So to clarify your question, every time you run MR job its code is copied to all the nodes. So if we change a code a new code is copied to all the nodes.

Animesh Raj Jha
  • 2,704
  • 1
  • 21
  • 25