1

I am getting below exception on our production environment. After this exception we get serialization exception continuously.

Heuristic completion: outcome state is mixed; nested exception is org.springframework.data.gemfire.GemfireTransactionCommitException: 
    Unexpected failure on commit of Cache local transaction; nested exception is com.gemstone.gemfire.cache.CommitIncompleteException:  
    Incomplete commit of transaction TXId: c11pcsssvc64(eisCacheServer_c11pcsssvc64.dswh.ds.adp.com:63400)<v2>:4723:70384.  Caused by the following exceptions:  
    From member: 100.99.18.94(eisCacheServer_c13pcsssvc696.dswh.ds.adp.com:77303)<v3>:23007 com.gemstone.gemfire.cache.query.IndexMaintenanceException: 
    com.gemstone.gemfire.cache.query.internal.index.IMQException, caused by com.gemstone.gemfire.cache.query.internal.index.IMQException
    at com.gemstone.gemfire.internal.cache.LocalRegion.txApplyPutPart2(LocalRegion.java:5090)
    at com.gemstone.gemfire.internal.cache.AbstractRegionMap.txApplyPut(AbstractRegionMap.java:3488)
    at com.gemstone.gemfire.internal.cache.LocalRegion.txApplyPut(LocalRegion.java:5058)
    at com.gemstone.gemfire.internal.cache.TXCommitMessage$RegionCommit.txApplyEntryOp(TXCommitMessage.java:1296)
    at com.gemstone.gemfire.internal.cache.TXCommitMessage$RegionCommit$FarSideEntryOp.process(TXCommitMessage.java:1566)
    at com.gemstone.gemfire.internal.cache.TXCommitMessage.basicProcessOps(TXCommitMessage.java:719)
    at com.gemstone.gemfire.internal.cache.TXCommitMessage.basicProcess(TXCommitMessage.java:655)
    at com.gemstone.gemfire.internal.cache.TXCommitMessage$CommitProcessMessage.basicProcess(TXCommitMessage.java:1737)
    at com.gemstone.gemfire.internal.cache.TXCommitMessage$CommitProcessForLockIdMessage.process(TXCommitMessage.java:1657)
    at com.gemstone.gemfire.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:305)
    at com.gemstone.gemfire.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:368)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at com.gemstone.gemfire.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:692)
    at com.gemstone.gemfire.distributed.internal.DistributionManager$4$1.run(DistributionManager.java:963)
    at java.lang.Thread.run(Thread.java:745).,

After this data get corrupted from region and we get serialization exception when we hit any API which brings data from corrupted region

An IOException was thrown while deserializing; nested exception is com.gemstone.gemfire.SerializationException: 
An IOException was thrown while deserializing ,Cause=org.springframework.dao.DataAccessResourceFailureException: 
An IOException was thrown while deserializing; nested exception is com.gemstone.gemfire.SerializationException: 
An IOException was thrown while deserializing

prod setup: we have 2 data nodes. Embedded tomcat gemfire node. Gemfire version is 7.0.2(we cannot upgrade to latest version) 2 locators 8 regions

After analysing we found that this might be happening while data sync up between 2 data nodes. Problem is we couldnt reproduce this issue on any lower environments. This only happens on prod intermittently. Does anyone has any idea about this issue?

Keyur Potdar
  • 7,158
  • 6
  • 25
  • 40
  • Quick question about your configuration... is your Region `Index Maintenance` set to synchronous or asynchronous? (Regardless of how you set this using SDG, i.e. whether in XML or JavaConfig using the RegionAttributesFactoryBean, it translates to the GemFire RegionFactory.setIndexMaintenanceSynchronous(:boolean) call (http://gemfire-92-javadocs.docs.pivotal.io/org/apache/geode/cache/RegionFactory.html#setIndexMaintenanceSynchronous-boolean-)). Also, what type of Index are you using? Note, I am not saying either of these are the problem, just collecting additional information at this point. – John Blum Jan 31 '18 at 19:26
  • Unfortunately, the Javadoc for the "IndexMaintenanceQueryException", or `o.a.g.cache.query.internal.index.IMQException` (https://github.com/apache/geode/blob/rel/v1.3.0/geode-core/src/main/java/org/apache/geode/cache/query/internal/index/IMQException.java) does not shed any light into when or why this Exception would be thrown. I would have to find all "usages" of this Exception in the Apache Geode codebase, and specifically at `.LocalRegion.txApplyPutPart2(LocalRegion.java:5090)` to better understand when/why this Exception is occurring. It appears to be happening during a TX though, huh? – John Blum Jan 31 '18 at 19:36
  • Unfortunately, I don't access nor a copy of the old GemFire 7.0.2 codebase on my machine presently. I will follow up internally. – John Blum Jan 31 '18 at 19:41
  • @John Blum: Regions are set to update indexes synchronously and index type is functional. You cannot access the codebase of this version as its licensed version. We have the jar and decompiled it check the source code. We also raised this with Pivotal but they refused comment anything saying they do not support this version any more. – Ashish Repal Feb 02 '18 at 05:51
  • Well, I used to be on the GemFire engineering team and I had the 7.0.2 codebase on my machine. But, as you know, 7.0.2 is quite dated and reached EOL sometime ago. Since that time, I have deleted the codebase and later versions of GF are under new source control (using Git). GF 7.0.2 was managed in an SVN repo that I no longer have access to. – John Blum Feb 02 '18 at 06:24
  • When I spoke to the engineers about this internally, they mentioned... "1.) They have heterogenous data where the field is sometimes a number and sometimes a string which causes some comparison issue. 2.) Low probability/unlikely - low memory during indexing 3.) We've fixed many bugs since 7.0.2 - none appear tx related but maybe it's race condition related? However, if the data is corrupt, there isn't anything the index can do about it... It really sounds more like a data corruption issue than an indexing one." – John Blum Feb 02 '18 at 06:25
  • Additionally, "they are syncing data, I wonder if it's through wan or what they mean by that. Another possible thing might be incompatible session objects- if they are indexing those. Tomcat hopefully is the same across all their systems". – John Blum Feb 02 '18 at 06:26
  • My advice, upgrade! You can (possibly) try out a newer version of Pivotal GemFire by just using the OSS version, Apache Geode (http://geode.apache.org/). Since GemFire 9.0, GF has been based on Apache Geode. If you can setup a test environment with data from production, perhaps you can determine whether the problem still exists. Good luck. – John Blum Feb 02 '18 at 06:28
  • @John Blum: Thanks a lot for your comments. Till now we haven't figured out the cause of this exception, also unable to reproduce it. The order of getting the exceptions is first we get IndexMaintainenceException after that we start getting serialization exceptions. May be IndexMaintainenceException this is causing the region data to be corrupted. In the stack trace we never get single class name from our code base. Anyway's if this doesn't get solve we have to upgrade to newer version of GF. – Ashish Repal Feb 03 '18 at 11:18

0 Answers0