17
  • AFAIK, passwordless ssh is needed so that the master node can start the daemon processes on each slave node. Apart from that, is there any use of having passwordless ssh for Hadoop's operation?

  • How are the user code jars and data chunks transferred across the slave nodes? I want to know the mechanism and the protocol used.

  • The passwordless SSH should ONLY be configured for master-slave pairs or even for amongst the slaves?

Michael Petrotta
  • 59,888
  • 27
  • 145
  • 179
Tejas Patil
  • 6,149
  • 1
  • 23
  • 38

2 Answers2

14

You are correct. If ssh is not passwordless, you have to go on each individual machine and start all the processes there, manually. For your second question, all the communication in HDFS happens over TCP/IP and for the data movement HTTP is used. Mechanism goes like this :

A client establishes a connection to a configurable TCP port on the NameNode machine. It talks the ClientProtocol with the NameNode. The DataNodes talk to the NameNode using the DataNode Protocol. A Remote Procedure Call (RPC) abstraction wraps both the Client Protocol and the DataNode Protocol.

And for the third question, it's not necessary to have a passwordless ssh among the slave nodes.

Trygve Laugstøl
  • 7,440
  • 2
  • 36
  • 40
Tariq
  • 34,076
  • 8
  • 57
  • 79
  • Can you please elaborate why passwordless ssh is not required among slave/worker nodes? How could data be exchanged in a shuffle without passwordless ssh among slaves/worker nodes? – sherminator35 Jul 25 '22 at 22:43
  • I could not find the documentation around why passwordless ssh is required on hadoop website. If you have seen what role passwordless ssh plays in hadoop working, please share. – vinayakshukre Oct 12 '22 at 15:22
  • I did not find anything either in hadoop's documentation. My assumption was that passwordless ssh is required for communication among slave/worker nodes. I was wondering if there exists some other protocol for data transfer (e.g., in a shuffle) that works without passwordless ssh. – sherminator35 Oct 14 '22 at 16:02
9

Answer to the first question:

The Hadoop core uses Shell (SSH) to launch the server processes on the slave nodes. It requires password-less SSH connection between the master and all the slaves and the secondary machines.

We need a password-less SSH in a Fully-Distributed environment because when the cluster is LIVE and running in Fully Distributed environment, the communication is too frequent. The job tracker should be able to send a task to task tracker quickly.

Damon
  • 3,004
  • 7
  • 24
  • 28
Dorababu G
  • 403
  • 5
  • 9