0

I've encountered strange behaviour in Hazelcast Jet. I'm starting many jobs at once (~30, some are triggered slightly before other). However, when my Hazelcast Jet job count hits 26 (magic number?), all processing got stuck.

In the threadumps I see following info:

"hz._hzInstance_1_jet.cached.thread-1" #37 prio=5 os_prio=0 cpu=1093.29ms elapsed=393.62s tid=0x00007f95dc007000 nid=0x6bfc in Object.wait()  [0x00007f95e6af4000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
    at java.lang.Object.wait(java.base@11.0.2/Native Method)
    - waiting on <no object reference available>
    at com.hazelcast.spi.impl.AbstractCompletableFuture.get(AbstractCompletableFuture.java:229)
    - waiting to re-lock in wait() <0x00000007864b7040> (a com.hazelcast.internal.util.SimpleCompletableFuture)
    at com.hazelcast.spi.impl.AbstractCompletableFuture.get(AbstractCompletableFuture.java:191)
    at com.hazelcast.spi.impl.operationservice.impl.InvokeOnPartitions.invoke(InvokeOnPartitions.java:88)
    at com.hazelcast.spi.impl.operationservice.impl.OperationServiceImpl.invokeOnAllPartitions(OperationServiceImpl.java:385)
    at com.hazelcast.map.impl.proxy.MapProxySupport.clearInternal(MapProxySupport.java:1016)
    at com.hazelcast.map.impl.proxy.MapProxyImpl.clearInternal(MapProxyImpl.java:109)
    at com.hazelcast.map.impl.proxy.MapProxyImpl.clear(MapProxyImpl.java:698)
    at com.hazelcast.jet.impl.JobRepository.clearSnapshotData(JobRepository.java:464)
    at com.hazelcast.jet.impl.MasterJobContext.tryStartJob(MasterJobContext.java:233)
    at com.hazelcast.jet.impl.JobCoordinationService.tryStartJob(JobCoordinationService.java:776)
    at com.hazelcast.jet.impl.JobCoordinationService.lambda$submitJob$0(JobCoordinationService.java:200)
    at com.hazelcast.jet.impl.JobCoordinationService$$Lambda$634/0x00000008009ce840.run(Unknown Source)

and also:

"hz._hzInstance_1_jet.async.thread-2" #81 prio=5 os_prio=0 cpu=0.00ms elapsed=661.98s tid=0x0000025bb23ef000 nid=0x43bc in Object.wait()  [0x0000005d492fe000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
    at java.lang.Object.wait(java.base@11/Native Method)
    - waiting on <no object reference available>
    at com.hazelcast.spi.impl.AbstractCompletableFuture.get(AbstractCompletableFuture.java:229)
    - waiting to re-lock in wait() <0x0000000725600100> (a com.hazelcast.internal.util.SimpleCompletableFuture)
    at com.hazelcast.spi.impl.AbstractCompletableFuture.get(AbstractCompletableFuture.java:191)
    at com.hazelcast.spi.impl.operationservice.impl.InvokeOnPartitions.invoke(InvokeOnPartitions.java:88)
    at com.hazelcast.spi.impl.operationservice.impl.OperationServiceImpl.invokeOnAllPartitions(OperationServiceImpl.java:385)
    at com.hazelcast.map.impl.proxy.MapProxySupport.removeAllInternal(MapProxySupport.java:619)
    at com.hazelcast.map.impl.proxy.MapProxyImpl.removeAll(MapProxyImpl.java:285)
    at com.hazelcast.jet.impl.JobRepository.deleteJob(JobRepository.java:332)
    at com.hazelcast.jet.impl.JobRepository.completeJob(JobRepository.java:316)
    at com.hazelcast.jet.impl.JobCoordinationService.completeJob(JobCoordinationService.java:576)
    at com.hazelcast.jet.impl.MasterJobContext.lambda$finalizeJob$13(MasterJobContext.java:620)
    at com.hazelcast.jet.impl.MasterJobContext$$Lambda$783/0x0000000800b26840.run(Unknown Source)
    at com.hazelcast.jet.impl.MasterJobContext.finalizeJob(MasterJobContext.java:632)
    at com.hazelcast.jet.impl.MasterJobContext.onCompleteExecutionCompleted(MasterJobContext.java:564)
    at com.hazelcast.jet.impl.MasterJobContext.lambda$invokeCompleteExecution$6(MasterJobContext.java:544)
    at com.hazelcast.jet.impl.MasterJobContext$$Lambda$779/0x0000000800b27840.accept(Unknown Source)
    at com.hazelcast.jet.impl.MasterContext.lambda$invokeOnParticipants$0(MasterContext.java:242)
    at com.hazelcast.jet.impl.MasterContext$$Lambda$726/0x0000000800a1c040.accept(Unknown Source)
    at com.hazelcast.jet.impl.util.Util$2.onResponse(Util.java:172)
    at com.hazelcast.spi.impl.AbstractInvocationFuture$1.run(AbstractInvocationFuture.java:256)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11/ThreadPoolExecutor.java:1128)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11/ThreadPoolExecutor.java:628)
    at java.lang.Thread.run(java.base@11/Thread.java:834)
    at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64)
    at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80)

I don't have any other info how to reproduce this issue, however I hope someone will know how to fix this or my question will help someone else :)

My setup: - Java 11 - Hazelcast 3.12 Snapshot - Hazelcast Jet 3.0 Snapshot (I can't revert to previous version, it will break my logic; I need n:m joins which will be added in 3.1) - CPU cores: 4 - RAM: 7 GB - Jet mode: server, connects to other cluster as a client to insert final data.

Did anyone encounter similar issue? The problem is, it cannot be simply reproduces, thus it's hard to create an issue for Hazelcast Team. Only threaddumps and general behaviour can give a hint what is going on.

T. Gawęda
  • 15,706
  • 4
  • 46
  • 61
  • Note: I know I don't have MVCE, but, despite that, Hazelcast Team asked me to post this on StackOverflow, because solution may help other users. I assume they will have some ideas how to solve it :) – T. Gawęda Mar 20 '19 at 18:31
  • 1
    Could you create a GitHub issue instead as this is clearly a bug and not a question? A full thread dump would be great too if possible. It seems likely it's due to some deadlock as there's some blocking code is executed in the callback. One thing to try would be to increase the number of operation and partition threads by changing the group properties `PARTITION_OPERATION_THREAD_COUNT` and `GENERIC_OPERATION_THREAD_COUNT` for now. – Can Gencer Mar 21 '19 at 07:47
  • I reproduced the issue easily by submitting jobs in parallel. – Oliv Mar 21 '19 at 08:01
  • 1
    @CanGencer Done: https://github.com/hazelcast/hazelcast-jet/issues/1339 :) – T. Gawęda Mar 21 '19 at 09:40

1 Answers1

1

This was an issue in 3.0-SNAPSHOT during development and was fixed in the 3.0 release.

Oliv
  • 10,221
  • 3
  • 55
  • 76