For a project we are sending some events to kafka. We use spring-kafka 2.6.2.
Due to usage of spring-vault we have to restart/kill the application before the end of credentials lease (application is automatically restarted by kubernetes). Our problem is that when using applicationContext.close() to proceed with our gracefull shutdown, KafkaProducer gets an InterruptedException Interrupted while joining ioThread inside it's close() method. It means that in our case some pending events are not sent to kafka before shutdown as it's forced to close due to an error during destroy.
Here under a stacktrace
2020-12-18 13:57:29.007 INFO [titan-producer,222efdd2a07966ce,222efdd2a07966ce,true] 1 --- [ scheduling-1] o.s.b.w.e.tomcat.GracefulShutdown : Commencing graceful shutdown. Waiting for active requests to complete
2020-12-18 13:57:29.009 INFO [titan-producer,222efdd2a07966ce,222efdd2a07966ce,true] 1 --- [ scheduling-1] o.apache.catalina.core.StandardService : Stopping service [Tomcat]
2020-12-18 13:57:29.013 INFO [titan-producer,222efdd2a07966ce,222efdd2a07966ce,true] 1 --- [ scheduling-1] o.a.c.c.C.[Tomcat].[localhost].[/] : Destroying Spring FrameworkServlet 'dispatcherServlet'
2020-12-18 13:57:29.014 INFO [titan-producer,,,] 1 --- [tomcat-shutdown] o.s.b.w.e.tomcat.GracefulShutdown : Graceful shutdown complete
2020-12-18 13:57:29.020 WARN [titan-producer,222efdd2a07966ce,222efdd2a07966ce,true] 1 --- [ scheduling-1] o.a.c.loader.WebappClassLoaderBase : The web application [ROOT] appears to have started a thread named [kafka-producer-network-thread | titan-producer-1] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread:
java.base@11.0.9.1/sun.nio.ch.EPoll.wait(Native Method)
java.base@11.0.9.1/sun.nio.ch.EPollSelectorImpl.doSelect(Unknown Source)
java.base@11.0.9.1/sun.nio.ch.SelectorImpl.lockAndDoSelect(Unknown Source)
java.base@11.0.9.1/sun.nio.ch.SelectorImpl.select(Unknown Source)
org.apache.kafka.common.network.Selector.select(Selector.java:873)
org.apache.kafka.common.network.Selector.poll(Selector.java:469)
org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:544)
org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:325)
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:240)
java.base@11.0.9.1/java.lang.Thread.run(Unknown Source)
2020-12-18 13:57:29.021 WARN [titan-producer,222efdd2a07966ce,222efdd2a07966ce,true] 1 --- [ scheduling-1] o.a.c.loader.WebappClassLoaderBase : The web application [ROOT] appears to have started a thread named [micrometer-kafka-metrics] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread:
java.base@11.0.9.1/jdk.internal.misc.Unsafe.park(Native Method)
java.base@11.0.9.1/java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source)
java.base@11.0.9.1/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source)
java.base@11.0.9.1/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(Unknown Source)
java.base@11.0.9.1/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(Unknown Source)
java.base@11.0.9.1/java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
java.base@11.0.9.1/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
java.base@11.0.9.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
java.base@11.0.9.1/java.lang.Thread.run(Unknown Source)
2020-12-18 13:57:29.046 INFO [titan-producer,222efdd2a07966ce,222efdd2a07966ce,true] 1 --- [ scheduling-1] o.s.s.concurrent.ThreadPoolTaskExecutor : Shutting down ExecutorService 'applicationTaskExecutor'
2020-12-18 13:57:29.048 INFO [titan-producer,222efdd2a07966ce,222efdd2a07966ce,true] 1 --- [ scheduling-1] o.s.s.c.ThreadPoolTaskScheduler : Shutting down ExecutorService 'taskScheduler'
2020-12-18 13:57:29.051 INFO [titan-producer,222efdd2a07966ce,222efdd2a07966ce,true] 1 --- [ scheduling-1] o.a.k.clients.producer.KafkaProducer : [Producer clientId=titan-producer-1] Closing the Kafka producer with timeoutMillis = 30000 ms.
2020-12-18 13:57:29.055 ERROR [titan-producer,222efdd2a07966ce,222efdd2a07966ce,true] 1 --- [ scheduling-1] o.a.k.clients.producer.KafkaProducer : [Producer clientId=titan-producer-1] Interrupted while joining ioThreadjava.lang.InterruptedException: null
at java.base/java.lang.Object.wait(Native Method)
at java.base/java.lang.Thread.join(Unknown Source)
at org.apache.kafka.clients.producer.KafkaProducer.close(KafkaProducer.java:1205)
at org.apache.kafka.clients.producer.KafkaProducer.close(KafkaProducer.java:1182)
at org.springframework.kafka.core.DefaultKafkaProducerFactory$CloseSafeProducer.closeDelegate(DefaultKafkaProducerFactory.java:901)
at org.springframework.kafka.core.DefaultKafkaProducerFactory.destroy(DefaultKafkaProducerFactory.java:428)
at org.springframework.beans.factory.support.DisposableBeanAdapter.destroy(DisposableBeanAdapter.java:258)
at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroyBean(DefaultSingletonBeanRegistry.java:587)
at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroySingleton(DefaultSingletonBeanRegistry.java:559)
at org.springframework.beans.factory.support.DefaultListableBeanFactory.destroySingleton(DefaultListableBeanFactory.java:1092)
at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroySingletons(DefaultSingletonBeanRegistry.java:520)
at org.springframework.beans.factory.support.DefaultListableBeanFactory.destroySingletons(DefaultListableBeanFactory.java:1085)
at org.springframework.context.support.AbstractApplicationContext.destroyBeans(AbstractApplicationContext.java:1061)
at org.springframework.context.support.AbstractApplicationContext.doClose(AbstractApplicationContext.java:1030)
at org.springframework.boot.web.servlet.context.ServletWebServerApplicationContext.doClose(ServletWebServerApplicationContext.java:170)
at org.springframework.context.support.AbstractApplicationContext.close(AbstractApplicationContext.java:979)
at org.springframework.cloud.sleuth.instrument.async.TraceRunnable.run(TraceRunnable.java:68)
at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)2020-12-18 13:57:29.055 INFO [titan-producer,222efdd2a07966ce,222efdd2a07966ce,true] 1 --- [ scheduling-1] o.a.k.clients.producer.KafkaProducer : [Producer clientId=titan-producer-1] Proceeding to force close the producer since pending requests could not be completed within timeout 30000 ms.
2020-12-18 13:57:29.056 WARN [titan-producer,222efdd2a07966ce,222efdd2a07966ce,true] 1 --- [ scheduling-1] o.s.b.f.support.DisposableBeanAdapter : Invocation of destroy method failed on bean with name 'kafkaProducerFactory': org.apache.kafka.common.errors.InterruptException: java.lang.InterruptedException
2020-12-18 13:57:29.064 INFO [titan-producer,222efdd2a07966ce,222efdd2a07966ce,true] 1 --- [ scheduling-1] o.s.s.concurrent.ThreadPoolTaskExecutor : Shutting down ExecutorService
2020-12-18 13:57:29.065 INFO [titan-producer,222efdd2a07966ce,222efdd2a07966ce,true] 1 --- [ scheduling-1] c.l.t.p.zookeeper.ZookeeperManagerImpl : Closing zookeeperConnection
2020-12-18 13:57:29.197 INFO [titan-producer,222efdd2a07966ce,222efdd2a07966ce,true] 1 --- [ scheduling-1] org.apache.zookeeper.ZooKeeper : Session: 0x30022348ba6000b closed
2020-12-18 13:57:29.197 INFO [titan-producer,,,] 1 --- [d-1-EventThread] org.apache.zookeeper.ClientCnxn : EventThread shut down for session: 0x30022348ba6000b
2020-12-18 13:57:29.206 INFO [titan-producer,222efdd2a07966ce,222efdd2a07966ce,true] 1 --- [ scheduling-1] com.zaxxer.hikari.HikariDataSource : loadtest_fallback_titan_pendingEvents - Shutdown initiated...
2020-12-18 13:57:29.221 INFO [titan-producer,222efdd2a07966ce,222efdd2a07966ce,true] 1 --- [ scheduling-1] com.zaxxer.hikari.HikariDataSource : loadtest_fallback_titan_pendingEvents - Shutdown completed.
Here is my configuration class
@Flogger
@EnableKafka
@Configuration
@RequiredArgsConstructor
@ConditionalOnProperty(
name = "titan.producer.kafka.enabled",
havingValue = "true",
matchIfMissing = true)
public class KafkaConfiguration {
@Bean
DefaultKafkaProducerFactoryCustomizer kafkaProducerFactoryCustomizer(ObjectMapper mapper) {
return producerFactory -> producerFactory.setValueSerializer(new JsonSerializer<>(mapper));
}
@Bean
public NewTopic createTopic(TitanProperties titanProperties, KafkaProperties kafkaProperties) {
TitanProperties.Kafka kafka = titanProperties.getKafka();
String defaultTopic = kafkaProperties.getTemplate().getDefaultTopic();
int numPartitions = kafka.getNumPartitions();
short replicationFactor = kafka.getReplicationFactor();
log.atInfo()
.log("Creating Kafka Topic %s with %s partitions and %s replicationFactor", defaultTopic, numPartitions, replicationFactor);
return TopicBuilder.name(defaultTopic)
.partitions(numPartitions)
.replicas(replicationFactor)
.config(MESSAGE_TIMESTAMP_TYPE_CONFIG, LOG_APPEND_TIME.name)
.build();
}
}
and my application.yaml
spring:
application:
name: titan-producer
kafka:
client-id: ${spring.application.name}
producer:
key-serializer: org.apache.kafka.common.serialization.UUIDSerializer
value-serializer: org.springframework.kafka.support.serializer.JsonSerializer
properties:
max.block.ms: 2000
request.timeout.ms: 2000
delivery.timeout.ms: 2000 #must be greater or equal to request.timeout.ms + linger.ms
template:
default-topic: titan-dev
Our vault configuration which executes the applicationContext.close() using a scheduledTask. We do it kind randomly as we have multiple replicas of the app running in parallel and avoid all the replicas to be killed at the same time.
@Flogger
@Configuration
@ConditionalOnBean(SecretLeaseContainer.class)
@ConditionalOnProperty(
name = "titan.producer.scheduling.enabled",
havingValue = "true",
matchIfMissing = true)
public class VaultConfiguration {
@Bean
public Lifecycle scheduledAppRestart(Clock clock, TitanProperties properties, TaskScheduler scheduler, ConfigurableApplicationContext applicationContext) {
Instant now = clock.instant();
Duration maxTTL = properties.getVaultConfig().getCredsMaxLease();
Instant start = now.plusSeconds(maxTTL.dividedBy(2).toSeconds());
Instant end = now.plusSeconds(maxTTL.minus(properties.getVaultConfig().getCredsMaxLeaseExpirationThreshold()).toSeconds());
Instant randomInstant = randBetween(start, end);
return new ScheduledLifecycle(scheduler, applicationContext::close, "application restart before lease expiration", randomInstant);
}
private Instant randBetween(Instant startInclusive, Instant endExclusive) {
long startSeconds = startInclusive.getEpochSecond();
long endSeconds = endExclusive.getEpochSecond();
long random = RandomUtils.nextLong(startSeconds, endSeconds);
return Instant.ofEpochSecond(random);
}
}
The ScheduledLifecycle class we use to run the scheduledtasks
import lombok.extern.flogger.Flogger;
import org.springframework.context.SmartLifecycle;
import org.springframework.scheduling.TaskScheduler;
import java.time.Duration;
import java.time.Instant;
import java.util.concurrent.ScheduledFuture;
@Flogger
public class ScheduledLifecycle implements SmartLifecycle {
private ScheduledFuture<?> future = null;
private Duration delay = null;
private final TaskScheduler scheduler;
private final Runnable command;
private final String commandDesc;
private final Instant startTime;
public ScheduledLifecycle(TaskScheduler scheduler, Runnable command, String commandDesc, Instant startTime) {
this.scheduler = scheduler;
this.command = command;
this.commandDesc = commandDesc;
this.startTime = startTime;
}
public ScheduledLifecycle(TaskScheduler scheduler, Runnable command, String commandDesc, Instant startTime, Duration delay) {
this(scheduler, command, commandDesc, startTime);
this.delay = delay;
}
@Override
public void start() {
if (delay != null) {
log.atInfo().log("Scheduling %s: starting at %s, running every %s", commandDesc, startTime, delay);
future = scheduler.scheduleWithFixedDelay(command, startTime, delay);
} else {
log.atInfo().log("Scheduling %s: execution at %s", commandDesc, startTime);
future = scheduler.schedule(command, startTime);
}
}
@Override
public void stop() {
if (future != null) {
log.atInfo().log("Stop %s", commandDesc);
future.cancel(true);
}
}
@Override
public boolean isRunning() {
boolean running = future != null && (!future.isDone() && !future.isCancelled());
log.atFine().log("is %s running? %s", running);
return running;
}
}
Is there a bug with spring-kafka? Any idea?
Thanks