3

When an exception occurs in our Infinispan cluster (version 9.4.8.Final), the node that has the exception sends this information to other nodes in the cluster. This seems to be by design.

This activity can cause so much traffic that it causes timeout exceptions which, in turn, make the nodes want to communicate their timeout exceptions to the other nodes. In production, our 3 node Infinispan cluster completely saturated a 20 Gb/s link.

For example, in a 2 node QA cluster we observe the following:

Node 1:

ISPN000476: Timed out waiting for responses for request 7861 from node2

Node 2:

ISPN000217: Received exception from node1, see cause for remote stack trace

Further down the stack trace printed on node 2, we see:

Timed out waiting for responses for request 7861 from node2

There are a large number of these. We took a packet capture during this and can see that there are 50 KB packets containing a list of the remote errors along with their entire java stack traces.

When this happens, it is a "perfect storm" of sorts. Every timeout produces an error that gets sent over the network. This increases congestion and timeouts. From there things worsen extremely quickly.

I understand that I need to troubleshoot the timeout issue - looking for GC collection pauses, etc, and perhaps consider increasing the timeout. However, I'd like to know if there is a way to stop this behavior from happening when these events occur. When you think about it, it seems odd for node 1 to timeout talking to node 2 and then send node 2 a copy of the error over the network telling it 'I timed out talking to you'.

Is there any way to avoid the transmission of these remote stack traces? Many thanks in advance for any insight or advice.

EDIT

Example stack trace:

2019-12-06 11:37:01,587 ERROR [org.keycloak.services.error.KeycloakErrorHandler] (default task-26) Uncaught server error: org.infinispan.remoting.RemoteException: ISPN000217: Received exception from ********, see cause for remote stack trace
        at org.infinispan.remoting.transport.ResponseCollectors.wrapRemoteException(ResponseCollectors.java:28)
        at org.infinispan.remoting.transport.ValidSingleResponseCollector.withException(ValidSingleResponseCollector.java:37)
        at org.infinispan.remoting.transport.ValidSingleResponseCollector.addResponse(ValidSingleResponseCollector.java:21)
        at org.infinispan.remoting.transport.impl.SingleTargetRequest.receiveResponse(SingleTargetRequest.java:52)
        at org.infinispan.remoting.transport.impl.SingleTargetRequest.onResponse(SingleTargetRequest.java:35)
        at org.infinispan.remoting.transport.impl.RequestRepository.addResponse(RequestRepository.java:52)
        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processResponse(JGroupsTransport.java:1372)
        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processMessage(JGroupsTransport.java:1275)
        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.access$300(JGroupsTransport.java:126)
        at org.infinispan.remoting.transport.jgroups.JGroupsTransport$ChannelCallbacks.up(JGroupsTransport.java:1420)
        at org.jgroups.JChannel.up(JChannel.java:816)
        at org.jgroups.fork.ForkProtocolStack.up(ForkProtocolStack.java:133)
        at org.jgroups.stack.Protocol.up(Protocol.java:340)
        at org.jgroups.protocols.FORK.up(FORK.java:141)
        at org.jgroups.protocols.FRAG3.up(FRAG3.java:171)
        at org.jgroups.protocols.FlowControl.up(FlowControl.java:339)
        at org.jgroups.protocols.FlowControl.up(FlowControl.java:339)
        at org.jgroups.protocols.pbcast.GMS.up(GMS.java:872)
        at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:240)
        at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1008)
        at org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:734)
        at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:389)
        at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:590)
        at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:131)
        at org.jgroups.protocols.FD_ALL.up(FD_ALL.java:203)
        at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:253)
        at org.jgroups.protocols.MERGE3.up(MERGE3.java:280)
        at org.jgroups.protocols.Discovery.up(Discovery.java:295)
        at org.jgroups.protocols.TP.passMessageUp(TP.java:1249)
        at org.jgroups.util.SubmitToThreadPool$SingleMessageHandler.run(SubmitToThreadPool.java:87)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at org.jboss.as.clustering.jgroups.ClassLoaderThreadFactory.lambda$newThread$0(ClassLoaderThreadFactory.java:52)
        at java.lang.Thread.run(Thread.java:745)
        Suppressed: org.infinispan.util.logging.TraceException
                at org.infinispan.interceptors.impl.SimpleAsyncInvocationStage.get(SimpleAsyncInvocationStage.java:41)
                at org.infinispan.interceptors.impl.AsyncInterceptorChainImpl.invoke(AsyncInterceptorChainImpl.java:250)
                at org.infinispan.cache.impl.CacheImpl.executeCommandAndCommitIfNeeded(CacheImpl.java:1918)
                at org.infinispan.cache.impl.CacheImpl.put(CacheImpl.java:1433)
                at org.infinispan.cache.impl.DecoratedCache.put(DecoratedCache.java:685)
                at org.infinispan.cache.impl.DecoratedCache.put(DecoratedCache.java:240)
                at org.infinispan.cache.impl.AbstractDelegatingCache.put(AbstractDelegatingCache.java:116)
                at org.infinispan.cache.impl.AbstractDelegatingCache.put(AbstractDelegatingCache.java:116)
                at org.infinispan.cache.impl.EncoderCache.put(EncoderCache.java:195)
                at org.infinispan.cache.impl.AbstractDelegatingCache.put(AbstractDelegatingCache.java:116)
                at org.keycloak.cluster.infinispan.InfinispanNotificationsManager.notify(InfinispanNotificationsManager.java:155)
                at org.keycloak.cluster.infinispan.InfinispanClusterProvider.notify(InfinispanClusterProvider.java:130)
                at org.keycloak.models.cache.infinispan.CacheManager.sendInvalidationEvents(CacheManager.java:206)
                at org.keycloak.models.cache.infinispan.UserCacheSession.runInvalidations(UserCacheSession.java:140)
                at org.keycloak.models.cache.infinispan.UserCacheSession$1.commit(UserCacheSession.java:152)
                at org.keycloak.services.DefaultKeycloakTransactionManager.commit(DefaultKeycloakTransactionManager.java:146)
                at org.keycloak.services.resources.admin.UsersResource.createUser(UsersResource.java:125)
                at sun.reflect.GeneratedMethodAccessor487.invoke(Unknown Source)
                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                at java.lang.reflect.Method.invoke(Method.java:498)
                at org.jboss.resteasy.core.MethodInjectorImpl.invoke(MethodInjectorImpl.java:139)
                at org.jboss.resteasy.core.ResourceMethodInvoker.internalInvokeOnTarget(ResourceMethodInvoker.java:510)
                at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTargetAfterFilter(ResourceMethodInvoker.java:400)
                at org.jboss.resteasy.core.ResourceMethodInvoker.lambda$invokeOnTarget$0(ResourceMethodInvoker.java:364)
                at org.jboss.resteasy.core.interception.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:355)
                at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTarget(ResourceMethodInvoker.java:366)
                at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:338)
                at org.jboss.resteasy.core.ResourceLocatorInvoker.invokeOnTargetObject(ResourceLocatorInvoker.java:137)
                at org.jboss.resteasy.core.ResourceLocatorInvoker.invoke(ResourceLocatorInvoker.java:106)
                at org.jboss.resteasy.core.ResourceLocatorInvoker.invokeOnTargetObject(ResourceLocatorInvoker.java:132)
                at org.jboss.resteasy.core.ResourceLocatorInvoker.invoke(ResourceLocatorInvoker.java:106)
                at org.jboss.resteasy.core.ResourceLocatorInvoker.invokeOnTargetObject(ResourceLocatorInvoker.java:132)
                at org.jboss.resteasy.core.ResourceLocatorInvoker.invoke(ResourceLocatorInvoker.java:100)
                at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:439)
                at org.jboss.resteasy.core.SynchronousDispatcher.lambda$invoke$4(SynchronousDispatcher.java:229)
                at org.jboss.resteasy.core.SynchronousDispatcher.lambda$preprocess$0(SynchronousDispatcher.java:135)
                at org.jboss.resteasy.core.interception.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:355)
                at org.jboss.resteasy.core.SynchronousDispatcher.preprocess(SynchronousDispatcher.java:138)
                at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:215)
                at org.jboss.resteasy.plugins.server.servlet.ServletContainerDispatcher.service(ServletContainerDispatcher.java:227)
                at org.jboss.resteasy.plugins.server.servlet.HttpServletDispatcher.service(HttpServletDispatcher.java:56)
                at org.jboss.resteasy.plugins.server.servlet.HttpServletDispatcher.service(HttpServletDispatcher.java:51)
                at javax.servlet.http.HttpServlet.service(HttpServlet.java:791)
                at io.undertow.servlet.handlers.ServletHandler.handleRequest(ServletHandler.java:74)
                at io.undertow.servlet.handlers.FilterHandler$FilterChainImpl.doFilter(FilterHandler.java:129)
                at org.keycloak.services.filters.KeycloakSessionServletFilter.doFilter(KeycloakSessionServletFilter.java:90)
                at io.undertow.servlet.core.ManagedFilter.doFilter(ManagedFilter.java:61)
                at io.undertow.servlet.handlers.FilterHandler$FilterChainImpl.doFilter(FilterHandler.java:131)
                at io.undertow.servlet.handlers.FilterHandler.handleRequest(FilterHandler.java:84)
                at io.undertow.servlet.handlers.security.ServletSecurityRoleHandler.handleRequest(ServletSecurityRoleHandler.java:62)
                at io.undertow.servlet.handlers.ServletChain$1.handleRequest(ServletChain.java:68)
                at io.undertow.servlet.handlers.ServletDispatchingHandler.handleRequest(ServletDispatchingHandler.java:36)
                at org.wildfly.extension.undertow.security.SecurityContextAssociationHandler.handleRequest(SecurityContextAssociationHandler.java:78)
                at io.undertow.server.handlers.PredicateHandler.handleRequest(PredicateHandler.java:43)
                at io.undertow.servlet.handlers.security.SSLInformationAssociationHandler.handleRequest(SSLInformationAssociationHandler.java:132)
                at io.undertow.servlet.handlers.security.ServletAuthenticationCallHandler.handleRequest(ServletAuthenticationCallHandler.java:57)
                at io.undertow.server.handlers.PredicateHandler.handleRequest(PredicateHandler.java:43)
                at io.undertow.security.handlers.AbstractConfidentialityHandler.handleRequest(AbstractConfidentialityHandler.java:46)
                at io.undertow.servlet.handlers.security.ServletConfidentialityConstraintHandler.handleRequest(ServletConfidentialityConstraintHandler.java:64)
                at io.undertow.security.handlers.AuthenticationMechanismsHandler.handleRequest(AuthenticationMechanismsHandler.java:60)
                at io.undertow.servlet.handlers.security.CachedAuthenticatedSessionHandler.handleRequest(CachedAuthenticatedSessionHandler.java:77)
                at io.undertow.security.handlers.NotificationReceiverHandler.handleRequest(NotificationReceiverHandler.java:50)
                at io.undertow.security.handlers.AbstractSecurityContextAssociationHandler.handleRequest(AbstractSecurityContextAssociationHandler.java:43)
                at io.undertow.server.handlers.PredicateHandler.handleRequest(PredicateHandler.java:43)
                at org.wildfly.extension.undertow.security.jacc.JACCContextIdHandler.handleRequest(JACCContextIdHandler.java:61)
                at io.undertow.server.handlers.PredicateHandler.handleRequest(PredicateHandler.java:43)
                at org.wildfly.extension.undertow.deployment.GlobalRequestControllerHandler.handleRequest(GlobalRequestControllerHandler.java:68)
                at io.undertow.server.handlers.PredicateHandler.handleRequest(PredicateHandler.java:43)
                at io.undertow.servlet.handlers.ServletInitialHandler.handleFirstRequest(ServletInitialHandler.java:292)
                at io.undertow.servlet.handlers.ServletInitialHandler.access$100(ServletInitialHandler.java:81)
                at io.undertow.servlet.handlers.ServletInitialHandler$2.call(ServletInitialHandler.java:138)
                at io.undertow.servlet.handlers.ServletInitialHandler$2.call(ServletInitialHandler.java:135)
                at io.undertow.servlet.core.ServletRequestContextThreadSetupAction$1.call(ServletRequestContextThreadSetupAction.java:48)
                at io.undertow.servlet.core.ContextClassLoaderSetupAction$1.call(ContextClassLoaderSetupAction.java:43)
                at org.wildfly.extension.undertow.security.SecurityContextThreadSetupAction.lambda$create$0(SecurityContextThreadSetupAction.java:105)
                at org.wildfly.extension.undertow.deployment.UndertowDeploymentInfoService$UndertowThreadSetupAction.lambda$create$0(UndertowDeploymentInfoService.java:1502)
                at org.wildfly.extension.undertow.deployment.UndertowDeploymentInfoService$UndertowThreadSetupAction.lambda$create$0(UndertowDeploymentInfoService.java:1502)
                at org.wildfly.extension.undertow.deployment.UndertowDeploymentInfoService$UndertowThreadSetupAction.lambda$create$0(UndertowDeploymentInfoService.java:1502)
                at org.wildfly.extension.undertow.deployment.UndertowDeploymentInfoService$UndertowThreadSetupAction.lambda$create$0(UndertowDeploymentInfoService.java:1502)
                at org.wildfly.extension.undertow.deployment.UndertowDeploymentInfoService$UndertowThreadSetupAction.lambda$create$0(UndertowDeploymentInfoService.java:1502)
                at org.wildfly.extension.undertow.deployment.UndertowDeploymentInfoService$UndertowThreadSetupAction.lambda$create$0(UndertowDeploymentInfoService.java:1502)
                at io.undertow.servlet.handlers.ServletInitialHandler.dispatchRequest(ServletInitialHandler.java:272)
                at io.undertow.servlet.handlers.ServletInitialHandler.access$000(ServletInitialHandler.java:81)
                at io.undertow.servlet.handlers.ServletInitialHandler$1.handleRequest(ServletInitialHandler.java:104)
                at io.undertow.server.Connectors.executeRootHandler(Connectors.java:364)
                at io.undertow.server.HttpServerExchange$1.run(HttpServerExchange.java:830)
                at org.jboss.threads.ContextClassLoaderSavingRunnable.run(ContextClassLoaderSavingRunnable.java:35)
                at org.jboss.threads.EnhancedQueueExecutor.safeRun(EnhancedQueueExecutor.java:1982)
                at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.doRunTask(EnhancedQueueExecutor.java:1486)
                at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1363)
                ... 1 more
Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 7865 from ********
        at org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)
        at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
        at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        ... 1 more
Josh Harness
  • 377
  • 1
  • 5
  • 16

2 Answers2

7

We were able to work around the issue that caused the massive spike in network traffic. Details are below.

tl;dr:

We switched from a JGroups UDP stack to TCP stack which the ISPN docs mentioned can be more efficient for small clusters using distributed caches like ours.

Reproducing the Issue

To reproduce the issue, we did the following:

  • Configure the Keycloak background job that purges ISPN cache entries to run every 60 seconds so we won't have to wait for the job to run on its 15 minute interval (standalone-ha.xml):

    <scheduled-task-interval>60</scheduled-task-interval>

  • Generate a large volume of user sessions (we used jmeter). In our tests, we ended up generating about 100,000 sessions.

  • Configure the SSO Idle and Max time TTL in Keycloak to be extremely short so that all the sessions will expire (60 seconds)
  • Continue to place load on the system with jmeter (this part is important)

When the job to purge the cache would run, we would see the network flood with traffic (120 MB/s). When this happened, we'd see tons of the following errors on each node in the cluster:

ISPN000476: Timed out waiting for responses for request 7861 from node2

ISPN000217: Received exception from node1, see cause for remote stack trace

Pro-Tip: Configure passivation to a file store to persist your ISPN data. Shut down your cluster and save the ".dat" files elsewhere. Use these files to instantly restore the state of your ISPN cluster in between tests.

Fixing the Issue

Using the technique above, we were able to reproduce the issue upon demand. As such, we set out to resolve it using the approaches described below.

Change JGroups Stack to Use TCP

We changed the JGroups stack from UDP to TCP and also configured TCPPing for discovery. We did this after reading the description for the TCP stack in the following guide:

https://infinispan.org/docs/stable/titles/configuring/configuring.html#preconfigured_jgroups_stacks-configuring

Specifically:

"Uses TCP for transport and UDP multicast for discovery. Suitable for smaller clusters (under 100 nodes) only if you are using distributed caches because TCP is more efficient than UDP as a point-to-point protocol."

This one change completely eliminated our issue

Our Wildfly 16 config in standalone-ha.xml is as follows:

<subsystem xmlns="urn:jboss:domain:jgroups:6.0">
        <channels default="ee">
            <channel name="ee" stack="tcp" cluster="ejb"/>
        </channels>
        <stacks>
            <stack name="udp">
                <transport type="UDP" socket-binding="jgroups-udp"/>
                <protocol type="PING"/>
                <protocol type="MERGE3"/>
                <protocol type="FD_SOCK"/>
                <protocol type="FD_ALL"/>
                <protocol type="VERIFY_SUSPECT"/>
                <protocol type="pbcast.NAKACK2"/>
                <protocol type="UNICAST3"/>
                <protocol type="pbcast.STABLE"/>
                <protocol type="pbcast.GMS"/>
                <protocol type="UFC"/>
                <protocol type="MFC"/>
                <protocol type="FRAG3"/>
            </stack>
            <stack name="tcp">
                <transport type="TCP" socket-binding="jgroups-tcp"/>
                <socket-protocol type="TCPPING" socket-binding="jgroups-tcp">
                  <property name="initial_hosts">HOST-X[7600],HOST-Y[7600],HOST-Z[7600]</property>
                  <property name="port_range">1</property>
                </socket-protocol>
                <protocol type="MERGE3"/>
                <protocol type="FD_SOCK"/>
                <protocol type="FD_ALL"/>
                <protocol type="VERIFY_SUSPECT"/>
                <protocol type="pbcast.NAKACK2"/>
                <protocol type="UNICAST3"/>
                <protocol type="pbcast.STABLE"/>
                <protocol type="pbcast.GMS"/>
                <protocol type="MFC"/>
                <protocol type="FRAG3"/>
            </stack>
        </stacks>
    </subsystem>

Tune JVM Garbage Collector Arguments

We followed some of the advice in the ISPN tuning guide:

https://infinispan.org/docs/stable/titles/tuning/tuning.html

In particular, we changed from JDK 8's default GC to using the CMS collector. Specifically, our Wildfly server now has the following JVM arguments supplied to it:

-Xms6144m -Xmx6144m -Xmn1536M -XX:MetaspaceSize=192M -XX:MaxMetaspaceSize=512m -Djava.net.preferIPv4Stack=true -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+DisableExplicitGC

Misc Changes

We applied other changes particular to our environment:

  • Blocked IP Multicast + UDP at iptables level (since we wanted to be sure we were TCP-only)
  • Configured a bandwidth cap at the network level to prevent the ISPN cluster from saturating the network and impacting other hosts that utilize the same link.
Josh Harness
  • 377
  • 1
  • 5
  • 16
1

No, there is no way to disable the serialization of stack traces in Infinispan 9.4.x

Infinispan 10.0.0.Final does not include the stack trace in exception responses, but it was just a side effect of some other work, and I already opened ISPN-11022 to add back the remote stack traces.

Please add a comment in the issue and include a full stack trace example.

Dan Berindei
  • 7,054
  • 3
  • 41
  • 48
  • Thanks, Dan! I edited the original post and included the requested trace. We're using keycloak 6.0.0 with standalone wildfly (version 16). From looking at the pom files of wildfly, it is using infinispan 9.4.8.Final – Josh Harness Dec 09 '19 at 13:51