Apache Ignite cache write is visible to other client after a delay

Question

We have a 8 node Ignite cluster on production. Below is the cache configuration for one of the caches.

<bean id="cache-template-bean" abstract="true"
            class="org.apache.ignite.configuration.CacheConfiguration">
          <property name="name" value="inputDataCacheTemplate*"/>
          <property name="cacheMode" value="PARTITIONED"/>
          <property name="backups" value="1"/>
          <property name="atomicityMode" value="ATOMIC"/>
          <property name="dataRegionName" value="dr.prod.input"/>
          <property name="partitionLossPolicy" value="READ_WRITE_SAFE"/>
          <property name="writeSynchronizationMode" value="PRIMARY_SYNC"/>
          <property name="statisticsEnabled" value="true"/>
        <property name="affinity">
          <bean class="org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction">
              <property name="partitions" value="256"/>
          </bean>
        </property>
          <property name="expiryPolicyFactory">
            <bean class="javax.cache.expiry.CreatedExpiryPolicy" factory-method="factoryOf">
              <constructor-arg>
                <bean class="javax.cache.expiry.Duration">
                  <constructor-arg value="DAYS"/>
                  <constructor-arg value="7"/>
                </bean>
              </constructor-arg>
            </bean>
          </property>
      </bean>

We are seeing a strange behaviour. It is as follows

Application A writes a record to cache
Application B tries to read that record
Application B is unable to find record in cache, so it inserts new one thereby wiping the data entered by Application A

3 happens very rarely. There are 1000 such cache miss for about 50M events we receive daily.

Gap between 1 and 2 is more than 20ms at least.

We tried putting a code in Application B where on first cache miss we wait for about 20ms. Now we could reduce those misses by a great margin. But still there were some misses. The fact that app B could read same record it could not find after a delay means that app A is not failing in record insertion, nor there is some other network factor which is impacting inserts nor it is because eviction or expiry. It is also ensured that for 1 and 2 key used for put and get operations is same.

What could be going on here? Please help.

score 1 · Answer 1 · answered Dec 08 '21 at 15:28

I think it's more likely that you have a race condition.

Application B tries to read that record
Application A writes a record to cache
Application B is unable to find record in cache, so it inserts new one thereby wiping the data entered by Application A

Clients generally go to the primary partition to retrieve data, so it's incredibly unlikely that applications A and B are seeing different data.

The traditional way of dealing with this is with transactions, which would also work in Ignite.

Better might be using different APIs. For example, there's IgniteCache#getAndPutIfAbsent() and IgniteCache#putIfAbsent(), both of which do the check and write atomically without needing transactions.

Thanks for answer, actually its not race. As I mentioned application B's read operation happens ONLY after A's write. Its like first payment is made then only order is placed. So app B cannot be invoked unless A is done with its write. There is always a gap of around 20ms which we have observed — Shades88, Dec 09 '21 at 10:43

Apache Ignite cache write is visible to other client after a delay

1 Answers1