2

With Ignite 2.7.6 when trying to bring up an embedded ignite server node (in a spring boot app) on a docker bridge network with simple configuration the server start up fails with the below error,

[10:16:16] Ignite node started OK (id=e7276b83)
[10:16:16] >>> Ignite cluster is not active (limited functionality available). Use control.(sh|bat) script or IgniteCluster interface to activate.
[10:16:16] Topology snapshot [ver=1, locNode=e7276b83, servers=1, clients=0, state=INACTIVE, CPUs=1, offheap=0.1GB, heap=0.4GB]
mediation-service - [INFO ] 10:16:16.981 [main] com.**.**.perfmon.common.spring.EmbeddedIgnite    - ====>>> Activating Ignite Cluster
mediation-service - [WARN ] 10:16:17.383 [exchange-worker-#49] org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager     - Started write-ahead log manager in NONE mode, persisted data may be lost in a case of unexpected node failure. Make sure to deactivate the cluster before shutdown.
[10:16:17] Started write-ahead log manager in NONE mode, persisted data may be lost in a case of unexpected node failure. Make sure to deactivate the cluster before shutdown.
mediation-service - [ERROR] 10:16:21.982 [tcp-disco-srvr-#3] org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi        - Failed to accept TCP connection.
java.net.SocketTimeoutException: Accept timed out
        at java.base/java.net.PlainSocketImpl.socketAccept(Native Method)
        at java.base/java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:458)
        at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:565)
        at java.base/java.net.ServerSocket.accept(ServerSocket.java:533)
        at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:5845)
        at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
        at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServerThread.body(ServerImpl.java:5763)
        at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
mediation-service - [WARN ] 10:16:21.982 [RMI TCP Accept-19887] sun.rmi.transport.tcp   - RMI TCP Accept-19887: accept loop for ServerSocket[addr=0.0.0.0/0.0.0.0,localport=19887] throws
java.net.SocketTimeoutException: Accept timed out
        at java.base/java.net.PlainSocketImpl.socketAccept(Native Method)
        at java.base/java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:458)
        at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:565)
        at java.base/java.net.ServerSocket.accept(ServerSocket.java:533)
        at java.rmi/sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:394)
        at java.rmi/sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:366)
        at java.base/java.lang.Thread.run(Thread.java:834)
mediation-service - [WARN ] 10:16:21.982 [RMI TCP Accept-0] sun.rmi.transport.tcp       - RMI TCP Accept-0: accept loop for ServerSocket[addr=0.0.0.0/0.0.0.0,localport=33254] throws
java.net.SocketTimeoutException: Accept timed out
        at java.base/java.net.PlainSocketImpl.socketAccept(Native Method)
        at java.base/java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:458)
        at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:565)
        at java.base/java.net.ServerSocket.accept(ServerSocket.java:533)
        at java.rmi/sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:394)
        at java.rmi/sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:366)
        at java.base/java.lang.Thread.run(Thread.java:834)
mediation-service - [ERROR] 10:16:21.984 [tcp-disco-srvr-#3]    - Critical system error detected. Will be handled accordingly to configured handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.net.SocketTimeoutException: Accept timed out]]

Below are the relevant config,

Ignite config xml snippet:

....
....
<property name="discoverySpi">
            <bean
                class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
                <property name="ipFinder">
                    <bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder"/>
                </property>
            </bean>
</property>
....
....

docker-compose snippet:

services:
  ***-mediation-service:
    image: ***/mediation-service:latest
    build: .
    environment:
    - PERCENTAGE_OF_RAM_FOR_HEAP=80.0
    - SERVICE_NAME=mediation-service
    - SERVICE_PORT=9887
    - IGNITE_TCP_DISCOVERY_ADDRESSES=localhost
    - JAVA_TOOL_OPTIONS=-Dcom.sun.management.jmxremote=true
  -Dcom.sun.management.jmxremote.rmi.port=19887
  -Dcom.sun.management.jmxremote.port=19887
  -Dcom.sun.management.jmxremote.local.only=false
  -Dcom.sun.management.jmxremote.authenticate=false
  -Dcom.sun.management.jmxremote.ssl=false
  -Djava.rmi.server.hostname=$HOST_IP
  -Djava.net.preferIPv4Stack=true
  -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=29887
    ...
    ...
    networks:
      - something-mediation-network

networks:
  something-mediation-network:
    driver: bridge
    ipam:
      driver: default
      config:
      - subnet: 186.30.240.0/24

Any one knows whats going on here?

Thanks Muthu

UPDATE (11/13/2020): I tried the same with 2.9.0 as suggested by @alamar but with the same result..please see below

mediation-service - [ERROR] 01:03:16.871 [tcp-disco-srvr-[:47500]-#3-#50] org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi   - Failed to accept TCP connection.
java.net.SocketTimeoutException: Accept timed out
    at java.base/java.net.PlainSocketImpl.socketAccept(Native Method)
    at java.base/java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:458)
    at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:565)
    at java.base/java.net.ServerSocket.accept(ServerSocket.java:533)
    at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:6620)
    at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
    at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServerThread.body(ServerImpl.java:6543)
    at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58)
mediation-service - [WARN ] 01:03:16.871 [RMI TCP Accept-19887] sun.rmi.transport.tcp   - RMI TCP Accept-19887: accept loop for ServerSocket[addr=0.0.0.0/0.0.0.0,localport=19887] throws
java.net.SocketTimeoutException: Accept timed out
    at java.base/java.net.PlainSocketImpl.socketAccept(Native Method)
    at java.base/java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:458)
    at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:565)
    at java.base/java.net.ServerSocket.accept(ServerSocket.java:533)
    at java.rmi/sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:394)
    at java.rmi/sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:366)
    at java.base/java.lang.Thread.run(Thread.java:834)
mediation-service - [WARN ] 01:03:16.871 [RMI TCP Accept-0] sun.rmi.transport.tcp   - RMI TCP Accept-0: accept loop for ServerSocket[addr=0.0.0.0/0.0.0.0,localport=33351] throws
java.net.SocketTimeoutException: Accept timed out
    at java.base/java.net.PlainSocketImpl.socketAccept(Native Method)
    at java.base/java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:458)
    at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:565)
    at java.base/java.net.ServerSocket.accept(ServerSocket.java:533)
    at java.rmi/sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:394)
    at java.rmi/sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:366)
    at java.base/java.lang.Thread.run(Thread.java:834)
mediation-service - [ERROR] 01:03:16.876 [tcp-disco-srvr-[:47500]-#3-#50]   - Critical system error detected. Will be handled accordingly to configured handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.net.SocketTimeoutException: Accept timed out]]
java.net.SocketTimeoutException: Accept timed out
    at java.base/java.net.PlainSocketImpl.socketAccept(Native Method)
    at java.base/java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:458)
    at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:565)
    at java.base/java.net.ServerSocket.accept(ServerSocket.java:533)
    at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:6620)
    at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
    at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServerThread.body(ServerImpl.java:6543)
    at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58)
mediation-service - [WARN ] 01:03:17.271 [tcp-disco-srvr-[:47500]-#3-#50] org.apache.ignite.internal.processors.cache.CacheDiagnosticManager    - Page locks dump:

UPDATE (11/18/2020):

I have another update which is that if i use Java 8 instead of Java 11 i don't see this issue during cluster activation & things work.

So i suspect this has something to do with the underlying java library use/dependencies..

lmk
  • 654
  • 5
  • 21
  • Note that i initially tried specifying 'localhost' directly in the config xml file with the same result.. – lmk Nov 13 '20 at 10:45
  • Have you tried `2.9.0`? – alamar Nov 13 '20 at 10:59
  • @alamar Not yet, do you think this has a fix in 2.9.0? – lmk Nov 13 '20 at 11:04
  • It's worth checking – alamar Nov 13 '20 at 12:00
  • @alamar checked & got the same result..updated the description above.. – lmk Nov 14 '20 at 01:57
  • 1
    @lmk have you tried setting -Djava.net.preferIPv4Stack=true? – Semyon Danilov Nov 14 '20 at 11:15
  • @SemyonDanilov I already had that setting unfortunately...updated the snippet above to show that.. – lmk Nov 15 '20 at 01:15
  • 1
    @lmk do you, by any chance, have iptables (and netfilter) up and running on your machine? It could be meddling with docker’s ports. – Semyon Danilov Nov 16 '20 at 23:04
  • @SemyonDanilov thank you, i checked that, i don't have any iptables or firewalld & also i see the same issue in all my peer's machines aswell. – lmk Nov 17 '20 at 02:13
  • @lmk, ok, just to make sure: the problem only arises while using docker? – Semyon Danilov Nov 17 '20 at 05:44
  • @SemyonDanilov I suspect so, but sorry haven't checked that yet...but i have another update to report which is that if i use Java 8 instead of Java 11 that i was using it works fine, so it must be something to do with the underlying library dependency behavior.. – lmk Nov 18 '20 at 08:31
  • 1
    Oh, in that case: have you exported all the needed modules and peohibited tlsv1.3? Like this: https://ignite.apache.org/docs/latest/quick-start/java#running-ignite-with-java-11-or-later – Semyon Danilov Nov 18 '20 at 11:59
  • @SemyonDanilov thanks once again for that one...don't know how i missed that..will try that one out...on tls1.3 there is no ssl used & its seen in a single node – lmk Nov 18 '20 at 19:52
  • @lmk any luck fixing it? – Semyon Danilov Nov 23 '20 at 08:21
  • Hi @SemyonDanilov , sorry i got pulled into another work item but will get to this later this week & will make sure to update you on this thread. – lmk Nov 23 '20 at 21:01

1 Answers1

4

The error means that the socket has a timeout set, and no incoming message was received during the timeout.

The funny thing is that the socket that Ignite creates has no timeout! Which suggests a bug somewhere...

... and this time it's in Java: JDK-8237858. The bug description says that the accept can be interrupted by a signal (which is expected), and that causes Java to throw the error (which is the bug).

According to the OpenJDK Jira, this doesn't affect Java 8. Fixed in Java 16, and also doesn't affect Java 13 with default settings.

I don't see mentions of fixes in Java 11 maintenance releases though.

UPDATE: There is a fix for this in 2.12. Basically, Ignite had to embed a workaround for the bug in its own code.

Stanislav Lukyanov
  • 2,147
  • 10
  • 20
  • Thank you @Stanislav Lukyanov . Unfortunately we are using Java 11 & can't upgrade :(..need to check on the maintenance fixes.. – lmk Jul 30 '21 at 00:52
  • Thank you very much @Stanislav Lukyanov, this was a difficult one to catch. Do you know of some workaround? Because although upgrade is an option for us, new JDK versions doesn't seem to be fully supported with Ignite. – lujop Mar 03 '22 at 06:00
  • 1
    @lujop There is a fix for this in 2.12: https://issues.apache.org/jira/browse/IGNITE-15767. Basically, Ignite had to embed a workaround for the bug in its own code. – Stanislav Lukyanov Mar 03 '22 at 13:54