31

I am trying to set up a Spark + HDFS deployment on a small cluster using Docker Swarm as a stack deployment. I have it generally working, but I ran into an issue that is preventing Spark from taking advantage of data locality.

In an attempt to enable data locality, I made a single "worker node" container on each server that contains both the Spark worker and the HDFS datanode. The idea here is that they then should both have the same IP address on the stack's overlay network because they are running in the same container. However, they do not. It would seem the container gets one VIP on the overlay network and the service that is defined in the compose file the stack uses gets another VIP.

It turns out that HDFS datanode process binds to the containers VIP and the Spark worker process binds to the service's VIP (as best I can determine). As a result, Spark doesn't know that the Spark worker and the HDFS datanode are actually on the same machine and only schedules tasks with ANY locality.

I am sure I am missing something, but I (of course) don't know what.

The Docker stack compose file entry that I use for defining each worker node service looks like this:

version: '3.4'
services:

    ...

    worker-node2:
        image: master:5000/spark-hdfs-node:latest
        hostname: "worker-node2"
        networks:
            - cluster_network
        environment:
            - SPARK_PUBLIC_DNS=10.1.1.1
            - SPARK_LOG_DIR=/data/spark/logs
        depends_on:
            - hdfs-namenode
        volumes:
            - type: bind
              source: /mnt/data/hdfs
              target: /data/hdfs
            - type: bind
              source: /mnt/data/spark
              target: /data/spark
        deploy:
            mode: replicated
            replicas: 1
            placement:
                constraints:
                    - node.hostname == slave1
            resources:
               limits:
                   memory: 56g

    ...

networks:
    cluster_network:
        attachable: true
        ipam:
            driver: default
            config:
                - subnet: 10.20.30.0/24

The Hadoop HDFS-site.xml configuration looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/data/hdfs/datanode</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/data/hdfs/namenode</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
        <description>The default replication factor of files on HDFS</description>
    </property>
    <property>
        <name>dfs.webhdfs.enabled</name>
        <value>true</value>
    </property>
    <property> 
        <name>dfs.block.size</name>
        <value>64m</value>
        <description>The default block size in bytes of data saved to HDFS</description>
    </property>
    <property>
        <name>dfs.namenode.datanode.registration.ip-hostname-check</name>
        <value>false</value>
    </property>
    <property>
        <name>dfs.client.use.datanode.hostname</name>
        <value>true</value>
    </property>
    <property>
        <name>dfs.datanode.use.datanode.hostname</name>
        <value>true</value>
    </property>

    <property>
        <name>dfs.namenode.rpc-bind-host</name>
        <value>0.0.0.0</value>
        <description>
            controls what IP address the NameNode binds to. 
            0.0.0.0 means all available.
        </description>
    </property>
    <property>
        <name>dfs.namenode.servicerpc-bind-host</name>
        <value>0.0.0.0</value>
        <description>
            controls what IP address the NameNode binds to. 
            0.0.0.0 means all available.
        </description>
    </property>
    <property>
        <name>dfs.namenode.http-bind-host</name>
        <value>0.0.0.0</value>
        <description>
            controls what IP address the NameNode binds to. 
            0.0.0.0 means all available.
        </description>
    </property>
    <property>
        <name>dfs.namenode.https-bind-host</name>
        <value>0.0.0.0</value>
        <description>
            controls what IP address the NameNode binds to. 
            0.0.0.0 means all available.
        </description>
    </property>

</configuration>

My full setup can be viewed here on GitHub.

Does anyone have any ideas what I am doing wrong that is preventing the Spark worker and HDFS datanode processes in the same Docker container from binding to the same IP address?

Software Engineer
  • 15,457
  • 7
  • 74
  • 102
kamprath
  • 2,220
  • 1
  • 23
  • 28
  • An exemplary question! Very seldom these days! – Peter Mortensen Nov 09 '19 at 21:04
  • What host operating system is your docker engine running on? – Z4-tier Nov 15 '19 at 04:15
  • Ubuntu server 18.04.3 – kamprath Nov 16 '19 at 06:19
  • Its By default, service discovery assigns a virtual IP address (VIP) and DNS entry to each service in the swarm, making it available by its service name to containers on the same network.You can configure the service to use DNS round-robin instead of a VIP.Source: https://stackoverflow.com/questions/38812357/whats-the-purpose-of-binding-vip-addr-in-every-container-of-a-service-in-docker – redhatvicky Nov 18 '19 at 10:10
  • Docker uses embedded DNS to provide service discovery for containers running on a single Docker Engine and tasks running in a Docker Swarm. Docker Engine has an internal DNS server that provides name resolution to all of the containers on the host in user-defined bridge, overlay, and MACVLAN networks. Source:https://stackoverflow.com/questions/44724497/what-is-overlay-network-and-how-does-dns-resolution-work – redhatvicky Nov 18 '19 at 10:11
  • you said you are dockerizing both these components together so that you can make use of the "data locality". does that mean that you need to leverage the fact that both these components have a shared file system? – Kusal Hettiarachchi May 14 '20 at 14:20
  • The key issue is that the HDFS data node and the spark executor need to share the same IP address for Spark to schedule tasks to take advantage of data locality. – kamprath May 15 '20 at 02:15
  • I think your inclusion of Spark and HDFS made most docker users skip this question. The root cause seems to be mostly in that you don't know how to control docker to give out the right IP addresses. Maybe you could ask or search for more 'pure' docker question without service specific distractions. – Dennis Jaheruddin Jul 31 '20 at 09:14
  • @DennisJaheruddin probably, but realize this question in inextricably linked to the fact that Spark and HDFS are binding to IP addresses in a way I don't readily understand, hence my question. I do understand how to set a docker service IP address (or at least the IP address range), and that's not really relevant here because one app binds to the container IP address and the other binds to the service IP address, which are different. – kamprath Aug 03 '20 at 21:40
  • you can run multiple replicas on same node? – stackit Sep 25 '20 at 14:57
  • @stackit I am not sure what you are asking. Do you mean HDFS replication? – kamprath Sep 27 '20 at 07:22
  • No basically I am asking if we can run multiple Spark workers on the same machine – stackit Sep 27 '20 at 07:24

1 Answers1

0

Isn't it linked to the use of this :

    <property>
        <name>dfs.client.use.datanode.hostname</name>
        <value>true</value>
    </property>

Using the hostname means being bound to the container and not the service itself, if I'm correct.

Faeeria
  • 786
  • 2
  • 13