3

(Query section below towards middle) Cross posted at https://developer.jboss.org/message/982355

Environment: Infinispan 9.13, Embedded cache in a cluster with jgroups, Single file store, Using JGroups inside Docker services in a single docker host/daemon (Not in AWS yet).

Infinispan.xml below:

<jgroups><stack-file name="external-file" path="${path.to.jgroups.xml}"/></jgroups>

Application = 2 webapps + database

Issue:

When I deploy the 2 webapps in separate tomcats directly on a machine (not docker yet), the Infinispan manager initializing the cache (in each webapp) forms a cluster using jgroups (i.e it works). But with the exact same configuration (and same channel name in jgroups), when deploying the webapps as services in docker, they don't join the same cluster (rather they are separate and have just one member in view - logs below).

The services are docker containers from images = linux + tomcat + webapp and are launched using docker compose v3.

I have tried the instructions at https://github.com/belaban/jgroups-docker for a container containing JGroups and a couple of demos where it suggests either using --network=host mode for the docker services (this does work but we cannot do this as the config files would need to have separate ports if we scale), or passing the external_addr=docker_host_IP_address field in jgroups.xml (this is NOT working and the query is how to make this work).

Its not a timing issue as I also tried putting a significant delay in starting the 2nd service deployed in the stack, but still the 2 apps's Infinispan cluster have just one member in its view (that container itself). Calling the cacheManager.getMembers() also shows just one entry inside each app (should show 2).

Log showing just one member in first app:

org.infinispan.remoting.transport.jgroups.JGroupsTransport.receiveClusterView ISPN000094: Received new cluster view for channel CHANNEL_NAME: [FirstContainerId-6292|0] (1) [FirstContainerId-6292].

org.infinispan.remoting.transport.jgroups.JGroupsTransport.startJGroupsChannelIfNeeded ISPN000079: Channel CHANNEL_NAME local address is FirstContainerId-6292, physical addresses are [10.xx.yy.zz:7800]

Log showing just one member in second app:

org.infinispan.remoting.transport.jgroups.JGroupsTransport.receiveClusterView ISPN000094: Received new cluster view for channel CHANNEL_NAME: [SecondContainerId-3502|0] (1) [SecondContainerId-3502]

29-Apr-2018 11:47:42.357 INFO [localhost-startStop-1] org.infinispan.remoting.transport.jgroups.JGroupsTransport.startJGroupsChannelIfNeeded ISPN000079: Channel CHANNEL_NAME local address is 58cfa4b95c16-3502, physical addresses are [10.xx.yy.zz:7800]

The docker compose V3 is below and shows the overlay network:

version: "3"
services:  
  app1:
    image: app1:version
    ports:
       - "fooPort1:barPort"
    volumes:
      - "foo:bar"
    networks:
      - webnet

  app2:
    image: app2:version
    ports:
      -  "fooPort2:barPort"
    volumes:
     - "foo:bar"
    networks:
      - webnet
volumes:
   dbdata:

networks:
   webnet:

Deployed using $docker stack deploy --compose-file docker-compose.yml OurStack

The JGroups.xml has the relevant config part below:

<TCP
         external_addr="${ext-addr:docker.host.ip.address}"

         bind_addr="${jgroups.tcp.address:127.0.0.1}"

         bind_port="${jgroups.tcp.port:7800}"

         enable_diagnostics="false"

         thread_naming_pattern="pl"

         send_buf_size="640k"

         sock_conn_timeout="300"

         bundler_type="sender-sends-with-timer"

         thread_pool.min_threads="${jgroups.thread_pool.min_threads:1}"

         thread_pool.max_threads="${jgroups.thread_pool.max_threads:10}"

         thread_pool.keep_alive_time="60000"/>

    <MPING bind_addr="${jgroups.tcp.address:127.0.0.1}"

           mcast_addr="${jgroups.mping.mcast_addr:228.2.4.6}"

           mcast_port="${jgroups.mping.mcast_port:43366}"

           ip_ttl="${jgroups.udp.ip_ttl:2}"/>

The code is similar to:

DefaultCacheManager manager = new DefaultCacheManager(jgroupsConfigFile.getAbsolutePath());

Cache someCache = new Cache(manager.getCache("SOME_CACHE").getAdvancedCache().withFlags(Flag.IGNORE_RETURN_VALUES));

Query: How do we deploy with docker-compose (as two services in docker containers) and jgroups.xml above so that the Infinispan cache in each of the two webapps join and form a cluster - so both apps can access the same data each other read/write in the cache. Right now they connect to the same channel name and each becomes a cluster with one member, even if we point jgroups to external_addr.

Tried so far:

  • Putting delay in second service's startup so first has enough time to advertise.
  • JGroups - Running JGroups in Docker can deploy the belaban/jgroups containers as two services in a stack using docker compose and they are able to form a cluster (chat.sh inside container shows 2 member view).
  • Tried --net=host which works but is infeasible. Tried external_addr=docker.host.ip in jgroups.xml which is the ideal solution but its not working (the log above is from that).

Thanks! Will try to provide any specific info if required.

PKM
  • 329
  • 4
  • 17

1 Answers1

3

Apparently external_addr="${ext-addr:docker.host.ip.address}" does not resolve (or reseolves to null), so bind_addr of 127.0.0.1 is used. Is docker.host.ip.address set by you (e.g. as an env variable)?

The external_addr attribute should point to a valid IP address.

Bela Ban
  • 2,186
  • 13
  • 12
  • I corrected the logs (sorry, I think I took them without adding the external_addr). When external_addr="10.xx.yy.zz" was added to the jgroups.xml (corrected in question's logs above now), even then the issue was seen. Both apps showed they joined cluster with one view member with logical address containerId-something and physical address 10.xx.yy.zz:7800 (which I believe Docker's internal routing should resolve to the different container). – PKM Apr 30 '18 at 13:19
  • Also one update - I tested today with infinispan-core-9.1.3.Final.jar/default-configs/default-jgroups-udp.xml (with no changes to it), and the containers were able to form a cluster (with 2 view members). Despite this workaround we are hoping to get it to work over TCP for reliability and the default-jgroups-tcp.xml is still showing the same issue. Thanks for responding, btw! – PKM Apr 30 '18 at 13:21
  • MPING is the protocol responsible for the initial discovery; can you double check if its bind_addr is set correctly? E.g. 127.0.0.1 won't work unless both processes run on the same physical box... – Bela Ban May 01 '18 at 17:58
  • 1
    This is the TCP configuration that I use to successfully run multiple instances on the same physical box: – Bela Ban May 01 '18 at 18:03
  • Yes, now TCP also works. The minimal changes I had to make as to set the TCP bind_addr=, and remove the MPING bind_addr field in my xml file - and now all nodes join the cluster and caches are replicating. Thanks! – PKM May 02 '18 at 05:52
  • I was hoping to request your advice for this follow up question - I am now trying to deploy the same (above) on 2 separate docker hosts which are in a docker swarm cluster. Each of the 2 docker hosts have a private IP in the local network and I know their default gateway. What changes would need to be made to the TCP bind_addr as above to allow containers across both docker host servers to discover each other. At present the containers work if they are on the same single docker host server. Thanks.! – PKM Sep 19 '18 at 08:43
  • 1
    Docker swarm does not (yet) support IP multicasting, as detailed in https://github.com/docker/libnetwork/issues/552. I suggest use a docker plugin such as weave which supports multicasting, or switch transport/discover to (e.g.) TCP:TCPGOSSIP, TCP:TCPPING, or other cloud based discovery protocols such as NATIVE_S3_PING,FILE_PING, GOOGLE_PING etc – Bela Ban Sep 28 '18 at 11:20