1

I want to understand the internal locking mechanism used by Apache Ignite for Cache and PME updates:

  • Scenario 1 : Persistent cache

    For a persistent cache, it requires lock during checkpointing and put operation. I am trying to understand at what level ignite takes this lock and if it is read or write lock. Is it for whole cache on all objects or across all caches or just on the object/objects being updated by checkpointing and cache updates.

  • Scenario 2 : In memory Cache

    For In memory cache, there will be no checkpointing. So only locking will happen during cache updates. Is this lock taken at cache level or across all caches or just on the object/objects being updated by cache updates. If it is read or write lock?

As PME requires write lock across all caches, I am trying to get clarity on locking due to cache operations.

Any pointers on above points will be helpful.

Lokesh
  • 7,810
  • 6
  • 48
  • 78

1 Answers1

3

Checkpoint lock

I guess checkpoint lock is, in fact, very important to Ignite's operation with enabled persistence, and you ask about it specifically, so let's talk about it in detail.

Ignite will take a checkpoint read lock on a put operation, and on a very large number of other operations. Any operation that needs to access data or metadata will first acquire a checkpoint read lock.

The checkpoint write lock is acquired only by one thread that starts the checkpointing process, every 3 minutes by default (checkpointTimeout property). It will only hold the write lock while it creates a list of pages to be checkpointed which normally takes milliseconds at most.

The purpose of all this is to make sure that the checkpoint process can safely find the dirty pages that need to be written to disk.

Other locks

There are actually many different locks you need to get when doing reads and writes - segment locks, page locks, entry locks... Ignite is a highly concurrent system written in a threaded model, and because of that, it requires lots of locking. Basically, every component introduces its own concurrency and locking to the whole system.

If you really want to learn about the details, I suggest you start with the design documents on the Ignite wiki. Here are a few of particular interest:

Stanislav Lukyanov
  • 2,147
  • 10
  • 20
  • Thanks @Stanislav. I have been going through the Ignite wiki and its useful. Based on above details for checkpoint locking, if "db-checkpoint-runner" thread (one which acquires write lock for checkpointing) is slow due to lots of updates or if disk performance is slow then writelock will be acquired lot longer, this implies cache put operations will also be blocked. Is this correct understanding? – Lokesh Jan 24 '23 at 02:15
  • Not exactly. The write lock is only held while the checkpointer thread collects the list of dirty pages to be written to disk. It really just iterating over a some maps in the Java memory, which should be very fast. You can actually see the timings of various checkpoint stages in the "Checkpoint started" and "Checkpoint finished" messages. Normally, only writing pages to disk and doing an fsync at the end take significant time. – Stanislav Lukyanov Jan 24 '23 at 03:28
  • Thanks. In our logs we see this "Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [workerName=db-checkpoint-thread, threadName=db-checkpoint-thread-#95%pv-ib-valuation%, blockedFor=1481s]". So db-checkpoint-thread is blocked, if it has just to collect list of dirty pages then i believe it must be struggling to get the writelock ? Which means ignite cache updates are taking longer and readlock is not released? Any suggestions here. – Lokesh Jan 24 '23 at 08:31
  • Also can you please confirm, which thread writes dirty pages to disk and does this thread also acquire any lock. So in summary I am evaluating if slow disk write can slow down my grid due to locking. – Lokesh Jan 24 '23 at 10:57
  • Slow disk may slow down the grid because any operation might need to access disk, in a general case. Say, you read a key, the read obtains a read lock, and then it needs to fetch a page with the value from the disk, while holding the read lock. – Stanislav Lukyanov Jan 24 '23 at 13:01
  • The problem you see may be coming from a few issues, but generally yes, someone is not releasing the read lock. You should analyze thread and/or heap dump to find what thread is holding the lock and why it doesn’t progress. Sometimes it’s a bug in the user code, sometimes it’s a bug in the platform, sometimes it’s too much load, etc. – Stanislav Lukyanov Jan 24 '23 at 13:03