GridGain Server partition loss

Question

We have 3 node Gridgain server and there are 3 client nodes deployed in GCP Kubernetes engine. Cluster is native persistence enabled. Also <property name="shutdownPolicy" value="GRACEFUL"/> as shutdown policy. There is one backup for each cache. After automatic cluster restart getting partition loss. Need to reset these partitions by executing control commands.

Can you provide proper solution for this. We have around 60GB persistent data.

Can you please provide the error message? Which cache has LOST partitions? — Andrei Aleksandrov, Sep 28 '21 at 11:29
javax.cache.CacheException: class org.apache.ignite.internal.processors.cache.CacheInvalidStateException: Failed to execute query because cache partition has been lostParts [cacheName=patient_view_reserved_order_mc_85, part=0] This is the latest one. After resetting applications server work properly. Sometimes loss partitions of system caches. Every time need to reset manually. It is very hard. So we need automated solution. — Nuwan Sameera, Sep 28 '21 at 11:35
Can you please provide the configuration of patient_view_reserved_order_mc_85? Also how you stopped the k8s instances? — Andrei Aleksandrov, Sep 28 '21 at 11:47
CacheConfiguration cacheConfig = new CacheConfiguration<>(); cacheConfig.setName("patient_view_reserved_order_mc_85"); cacheConfig.setBackups(1); cacheConfig.setAtomicityMode(CacheAtomicityMode.TRANSACTIONAL); cacheConfig.setStatisticsEnabled(true); cacheConfig.setGroupName("CACHE_GROUP"); cacheConfig.setIndexedTypes(String.class, UnconfirmedEvent.class); IgniteCachecache = ignite.getOrCreateCache(cacheConfig); Kubernetes instances stop not in our control. This happen due to google operation. — Nuwan Sameera, Sep 28 '21 at 11:53
Could you share/upload the logs prior to the restart? Also what version are you running on? — Alexandr Shapkin, Sep 28 '21 at 12:09
It looks like your GKE is just killing the POD and doesn't waiting for the GridGain process to stop. With the policy mentioned, GridGain will use a dedicated shutdown hook, but as I said, GKE is not expecting this. This is why you lost your sections. BTW, what is you version of GridGain? — Andrei Aleksandrov, Sep 28 '21 at 12:12
Before this incident we use 8.8.4 and yesterday we update it to 8.8.8 latest version. With native persistence all data secure right ? If no data loss is there a way to recover these loss partitions automatically. — Nuwan Sameera, Sep 28 '21 at 12:15

score 6 · Answer 1 · answered Sep 30 '21 at 08:52

6

<property name="shutdownPolicy" value="GRACEFUL"/> is supposed to protect from partition loss if certain conditions are met:

The caches must be either PARTITIONED with backups > 0 or REPLICATED. Check your configs. Default cache config in Ignite is PARTITIONED with backups = 0 (for historical reasons), so the defaults won't work.
There must be more than one baseline node (only baseline nodes store data!). Here is the doc.
You must stop the nodes in a graceful way. This is a bit tricky since you don't always control this.
- If you stop with a kill to the process, make sure it uses SIGTERM and not SIGKILL because the later always kills the process immediately
- If you stop with Ignite.close() this should just work
- If you stop with Java System.exit() it'll work, but if you use System.halt() - it won't (because halt() is not graceful)
- If you use orchestrators such as Kubernetes, you need to make sure they'll stop the nodes gracefully. For example, in Kubernetes you normally have to set terminationGracePeriodSeconds to a high value so that Kubernetes waits for the nodes to finish graceful shutdown instead of killing them.
- If you use custom startup scripts, you need to make sure they forward signals to the Ignite process.

To debug this, check the points above. I would normally start by looking at the server logs (with IGNITE_QUIET=false!) to see if "Invoking shutdown hook" message is there. If it isn't there then your shutdown hook isn't getting called, and the problem is one of the points under 3. Otherwise, there should be other log messages explaining the situation.

answered Sep 30 '21 at 08:52

Stanislav Lukyanov

2,147
10
20

Thank you very much for the detailed answer. Is there any proper way to automate recover lost portions? – Nuwan Sameera Sep 30 '21 at 09:45
1

Right now, not really. The problem is that after a partition has been lost Ignite doesn't really know if you brought back all of the data you previously had - it is possible that some of the data is lost forever (say, because of a disk failure), and it should be manually acknowledged. That said, automated partition loss reset might be the most common request I hear from Ignite users, so it would be good to consider at least an opt-in automation, even if it has potentially dangerous corner cases. – Stanislav Lukyanov Sep 30 '21 at 09:59
1

I would also add that you should not need automatially reset lost partitions in your normal operation because in your normal operations you shouldn't get lost partitions. If the conditions I listed are met, you should never run into this. Partition loss should only happen in a properly configured environment when something goes really wrong - nodes crashing, etc. I agree that even in that case, automation would still be nice sometimes but this should be a rarity. – Stanislav Lukyanov Sep 30 '21 at 10:14
Ok. Thank you very much. I got clear idea. – Nuwan Sameera Sep 30 '21 at 10:27
1

I've created the following ticket with an approach which might not be 100% bulletproof but is safe enouhg in practice https://issues.apache.org/jira/browse/IGNITE-15653. I would be much more confident in this than in whatever scripts people current write to call `resetLostPartitions()` externally. Feel free to contribute! :) – Stanislav Lukyanov Sep 30 '21 at 10:49
Does this mean if (a) you get a non-graceful shutdown of just a single node - e.g. a hardware failure, and (b) you're running a partitioned multi-server node cluster with backups >=2, there's no way for the cluster to come back up again automatically without a partition loss? – Open Door Logistics Jun 10 '22 at 06:19
No, that's not correct. In your scenario there is no partition loss (one partition is still present after a server fails), the cluster will continue to serve all requests, and the failed node will join normally after it's restarted. – Stanislav Lukyanov Jun 10 '22 at 10:06

GridGain Server partition loss

1 Answers1