1

I have N sets of 1 producer to 1 consumer. The consumer will write to Chronicle Queue. I just found out today that there is an error I have not previously seen before.

Exception in thread "TaskConsumer2" Exception in thread "TaskConsumer0" net.openhft.chronicle.wire.UnrecoverableTimeoutException: Couldn't acquire write lock after 15000 ms for the lock file:./chronicle/roll/metadata.cq4t. Lock was held by me. You can manually unlock with net.openhft.chronicle.queue.main.UnlockMain
        at net.openhft.chronicle.queue.impl.single.TableStoreWriteLock.lock(TableStoreWriteLock.java:96)
        at net.openhft.chronicle.queue.impl.single.StoreAppender.prepareAndReturnWriteContext(StoreAppender.java:430)
        at net.openhft.chronicle.queue.impl.single.StoreAppender.writingDocument(StoreAppender.java:406)
        at net.openhft.chronicle.queue.impl.single.StoreAppender.writingDocument(StoreAppender.java:394)
        at net.openhft.chronicle.queue.impl.single.StoreAppender.writeBytes(StoreAppender.java:194)
        at service.producerconsumer.TaskConsumer.runTask(TaskConsumer.java:80)
        at service.producerconsumer.TaskConsumer.run(TaskConsumer.java:142)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Couldn't acquire write lock after 15000 ms for the lock file:./chronicle/roll/metadata.cq4t. Lock was held by me. You can manually unlock with net.openhft.chronicle.queue.main.UnlockMain
        ... 8 more
net.openhft.chronicle.wire.UnrecoverableTimeoutException: Couldn't acquire write lock after 15000 ms for the lock file:./chronicle/roll/metadata.cq4t. Lock was held by me. You can manually unlock with net.openhft.chronicle.queue.main.UnlockMain
        at net.openhft.chronicle.queue.impl.single.TableStoreWriteLock.lock(TableStoreWriteLock.java:96)
        at net.openhft.chronicle.queue.impl.single.StoreAppender.prepareAndReturnWriteContext(StoreAppender.java:430)
        at net.openhft.chronicle.queue.impl.single.StoreAppender.writingDocument(StoreAppender.java:406)
        at net.openhft.chronicle.queue.impl.single.StoreAppender.writingDocument(StoreAppender.java:394)
        at net.openhft.chronicle.queue.impl.single.StoreAppender.writeBytes(StoreAppender.java:194)
        at service.producerconsumer.TaskConsumer.runTask(TaskConsumer.java:80)
        at service.producerconsumer.TaskConsumer.run(TaskConsumer.java:142)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Couldn't acquire write lock after 15000 ms for the lock file:./chronicle/roll/metadata.cq4t. Lock was held by me. You can manually unlock with net.openhft.chronicle.queue.main.UnlockMain
        ... 8 more

Is this due to multi-thread access to chronicle queue? I reuse the chronicle queue and use thread local appenders and the following is a sample of how I use the class.

public class TaskConsumer implements Runnable {
    private final ChronicleQueue QUEUE;
    private CustomQueueClass queue;
    private ExcerptAppender APPENDER;
    //other code

    public TaskConsumer(ChronicleQueue queue) {
        this.QUEUE= queue;
        //instantiate queue
        //other code
    }

    private long millis;
    private long nanos;
    private ByteBuffer buffer;
    private InetAddress remoteAdd;
    private int remotePort;
    private String ni;
    private int remaining;
    private int senderId;
    private long seqNum;
    private MoldUdpHeader moldUdpHeader = new MoldUdpHeader();
    private final PrimitiveIntPair pair = new PrimitiveIntPair(0, 0);
    private final WriteBytesMarshallable marshallable = (bytes) -> bytes.writeLong(this.millis)
        .writeLong(this.nanos)
        .write(this.remoteAdd.getAddress())
        .writeInt(this.remotePort)
        .writeUtf8(this.ni)
        .writeInt(this.remaining)
        .writeInt(this.senderId)
        .writeLong(this.seqNum)
        .write(this.buffer.array(), 0, this.remaining);  //sbe-style writes seqNum, remoteAddress, and the ByteBuffer

    private void runTask() {
        LOGGER.debug(logMarker, "{} {} {} {} {} | senderId: {} seqNum: {} msgCnt: {}",
            () -> ZonedDateTime.ofInstant(Instant.ofEpochMilli(millis), ZoneId.of("Asia/Hong_Kong")),
            () -> remoteAdd.getHostName(), () -> remotePort, () -> ni,
            () -> remaining, () -> moldUdpHeader.getSenderId(), () -> moldUdpHeader.getSeqNum(),
            () -> moldUdpHeader.getMsgCnt());
        this.APPENDER.writeBytes(marshallable);  //<--error thrown here
    }

    public void run() {
        this.APPENDER = QUEUE.acquireAppender();
        TaskHolder task = null;
        while (true) {
            if (Thread.currentThread().isInterrupted()) {
                closeAppender();
                break;
            }
            if (task == null) {
                try {
                    task = queue.getForConsuming(TIMEOUT, TimeUnit.MILLISECONDS);
                } catch (InterruptedException e) {
                    Thread.currentThread().interrupt();
                }
            }

            if (task != null) {
                buffer = task.getByteBuffer();
                if (task.getUdpChannel() != null) {
                    remoteAdd = task.getUdpChannel().getGROUP_ADDRESS();
                    remotePort = task.getUdpChannel().getPORT();
                    millis = task.getMillis();
                    nanos = task.getNanos();
                    ni = task.getUdpChannel().getNETWORK_INTERFACE().getName();
                    remaining = buffer.remaining();

                    if (DECODING.equals("TRUE")) {
                        moldUdpHeader = (MoldUdpHeader) moldUdpHeader.decode(buffer, 0);
                    }
                    senderId = moldUdpHeader.getSenderId();
                    seqNum = moldUdpHeader.getSeqNum();

                    pair.setId(moldUdpHeader.getSenderId());
                    pair.setIndex(getIndex(task.getUdpChannel()));

                    triplesHashmap.computeIfAbsent(pair.copy(), k -> (DECODING.equals("TRUE")) ?
                        new Triple<>(new MutableLong(Long.MIN_VALUE), new LongArrayVisitationCounter(10000000), new PacketStats()) :
                        new Triple<>(new MutableLong(Long.MIN_VALUE), new LongArrayVisitationCounter(10), new PacketStats()));  //using a supplier to lazily instantiate
                    runTask();  //<--- error thrown here
                    synchronized (triplesHashmap.get(pair).getType3()) {
                        if (DECODING.equals("TRUE")) {
                            checkReadValueSequence(triplesHashmap.get(pair), pair, moldUdpHeader.getSeqNum());
                        } else {
                            PacketStats stats = triplesHashmap.get(pair).getType3();
                            stats.incrementPacketsRead();
                            stats.incrementBytesReadBy(remaining);
                        }
                    }
                }
                task.clearValues();
                queue.incrementReadIndex();
                task = null;
            }
        }
    }

    //other code
}

What is weird is that I have deployed the jar onto multiple servers, but it is just this one server that has this error. The other servers are working as per normal. I could use try-catch to just ignore the error and recursion on the catch to retry running the task, but I would like to know what is causing this and how to avoid it

chunkynuggy
  • 147
  • 8

1 Answers1

2

It sounds like you are going the right thing. You can share the Chronicle Queue, a thread-local appender and tailer shouldn't have a problem. Some older versions had problems with clean-up of resources esp if a thread died. We have better control over that now.

I suggest you try 5.23.37 or 5.24ea17.

BTW DECODING.equals("TRUE") is expensive to do every time. I suggest caching this in a local variable outside the loop.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • @experimentunit1998X it's relatively expensive esp as it's easy to put into a local variable – Peter Lawrey Jun 04 '23 at 11:49
  • would using the 5.24ea17 version prevent the error from happening entirely? I think i better not use recursion since it might be possible to recurse out of control. i ll just use a do-while loop with try catch instead. – chunkynuggy Jun 05 '23 at 06:14
  • @chunkynuggy the assumption is that you will write a single message while holding the lock. You won't try to write a message in the middle of writing a message in the same thread, also you should try to write the message in less than 10 microseconds typically. 15 seconds is a staggingly long amount of time for a computer and anything this long should be avoided. Even full GCs or updating a database on the other side of the world shouldn't take this long, though you should avoid doing either while writing a message. – Peter Lawrey Jun 07 '23 at 05:12
  • @chunkynuggy I would check if debug is enabled as your debug line generates a lot of garbage around 10 objects probably. – Peter Lawrey Jun 07 '23 at 05:16
  • The debug is not on. its on info only or above. I was also referring to the timeout in the error log `Couldn't acquire write lock after 15000 ms for the lock file`. Meaning that while the consumer is trying to write to the queue, and is trying to acquire the write lock, the producer is also filling up a "task queue" that the consumer will read from. So when the TimeOutException occurs 15s later, Does it might mean that there could be data loss since the task queue might have been filled, but the appender is still trying to acquire the lock? – chunkynuggy Jun 07 '23 at 07:58
  • @chunkynuggy The queue is only full when you have run out of disk space. In this case, it is trying to acquire a lock it has already acquired. This is a problem because the lock is not relevant ie you should be trying to acquire a lock while you already have one. – Peter Lawrey Jun 16 '23 at 09:01