We have a 8 node Ignite cluster on production. Below is the cache configuration for one of the caches.
<bean id="cache-template-bean" abstract="true"
class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="inputDataCacheTemplate*"/>
<property name="cacheMode" value="PARTITIONED"/>
<property name="backups" value="1"/>
<property name="atomicityMode" value="ATOMIC"/>
<property name="dataRegionName" value="dr.prod.input"/>
<property name="partitionLossPolicy" value="READ_WRITE_SAFE"/>
<property name="writeSynchronizationMode" value="PRIMARY_SYNC"/>
<property name="statisticsEnabled" value="true"/>
<property name="affinity">
<bean class="org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction">
<property name="partitions" value="256"/>
</bean>
</property>
<property name="expiryPolicyFactory">
<bean class="javax.cache.expiry.CreatedExpiryPolicy" factory-method="factoryOf">
<constructor-arg>
<bean class="javax.cache.expiry.Duration">
<constructor-arg value="DAYS"/>
<constructor-arg value="7"/>
</bean>
</constructor-arg>
</bean>
</property>
</bean>
We are seeing a strange behaviour. It is as follows
- Application A writes a record to cache
- Application B tries to read that record
- Application B is unable to find record in cache, so it inserts new one thereby wiping the data entered by Application A
3 happens very rarely. There are 1000 such cache miss for about 50M events we receive daily.
Gap between 1 and 2 is more than 20ms at least.
We tried putting a code in Application B where on first cache miss we wait for about 20ms. Now we could reduce those misses by a great margin. But still there were some misses. The fact that app B could read same record it could not find after a delay means that app A is not failing in record insertion, nor there is some other network factor which is impacting inserts nor it is because eviction or expiry. It is also ensured that for 1 and 2 key used for put and get operations is same.
What could be going on here? Please help.