I have troubles to make my HDFS setup work in docker swarm. To understand the problem I've reduced my setup to the minimum :
- 1 physical machine
- 1 namenode
- 1 datanode
This setup is working fine with docker-compose, but it fails with docker-swarm, using the same compose file.
Here is the compose file :
version: '3'
services:
namenode:
image: uhopper/hadoop-namenode
hostname: namenode
ports:
- "50070:50070"
- "8020:8020"
volumes:
- /userdata/namenode:/hadoop/dfs/name
environment:
- CLUSTER_NAME=hadoop-cluster
datanode:
image: uhopper/hadoop-datanode
depends_on:
- namenode
volumes:
- /userdata/datanode:/hadoop/dfs/data
environment:
- CORE_CONF_fs_defaultFS=hdfs://namenode:8020
To test it, I have installed an hadoop client on my host (physical) machine with only this simple configuration in core-site.xml :
<configuration>
<property><name>fs.defaultFS</name><value>hdfs://0.0.0.0:8020</value></property>
</configuration>
Then I run the following command :
hdfs dfs -put test.txt /test.txt
With docker-compose (just running docker-compose up) it's working and the file is written in HDFS.
With docker-swarm, I'm running :
docker swarm init
docker stack deploy --compose-file docker-compose.yml hadoop
Then when all services are up, I put my file on HDFS it fails like this :
INFO hdfs.DataStreamer: Exception in createBlockOutputStream
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/x.x.x.x:50010]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.hdfs.DataStreamer.createSocketForPipeline(DataStreamer.java:259)
at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1692)
at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1648)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:704)
18/06/14 17:29:41 WARN hdfs.DataStreamer: Abandoning BP-1801474405-10.0.0.4-1528990089179:blk_1073741825_1001
18/06/14 17:29:41 WARN hdfs.DataStreamer: Excluding datanode DatanodeInfoWithStorage[10.0.0.6:50010,DS-d7d71735-7099-4aa9-8394-c9eccc325806,DISK]
18/06/14 17:29:41 WARN hdfs.DataStreamer: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /test.txt._COPYING_ could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation.
If I look in the web UI the datanode seems to be up and no issue is reported...
Update : it seems that dependsOn is ignored by swarm, but it does not seem to be the cause of my problem : I've restarted the datanode when the namenode is up but it did not work better.
Thanks for your help :)