Apache Ignite: Joining node has caches with data which are not presented on cluster, it could mean that they were already destroyed

Question

I have two remote servers with Apache Ignite 2.12.0. In "\config\default-config.xml" there is one data region "Persistence_Region" with persistence enabled. I start ignite on both servers using "\bin\ignite.bat". Since I enabled persistence region I also have to manually run "./control.bat --set-active active" to activate the cluster. If I terminate one of the nodes (close console window, kill the process), I am not able to restart it. I see following errors in logs and I beleive they have the same cause:

Caused by: class org.apache.ignite.spi.IgniteSpiException: Joining node has caches with data which are not presented on cluster, it could mean that they were already destroyed, to add the node to cluster - remove directories with the caches[ignite-sys-cache]
    at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.checkFailedError(TcpDiscoverySpi.java:2108)
    at org.apache.ignite.spi.discovery.tcp.ServerImpl.joinTopology(ServerImpl.java:1206)
    at org.apache.ignite.spi.discovery.tcp.ServerImpl.spiStart(ServerImpl.java:474)
    at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.spiStart(TcpDiscoverySpi.java:2210)
    at org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:278)

Caused by: class org.apache.ignite.spi.IgniteSpiException: Node with set up BaselineTopology is not allowed to join cluster without one: 010d0fc0-e3c0-4061-b6b1-2083764a5af5
    at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.checkFailedError(TcpDiscoverySpi.java:2108)
    at org.apache.ignite.spi.discovery.tcp.ServerImpl.joinTopology(ServerImpl.java:1206)
    at org.apache.ignite.spi.discovery.tcp.ServerImpl.spiStart(ServerImpl.java:474)
    at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.spiStart(TcpDiscoverySpi.java:2210)
    at org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:278)
    ... 13 more

Caused by: class org.apache.ignite.spi.IgniteSpiException: BaselineTopology of joining node (e46abd42-b188-4b40-9d2f-405358b955b6) is not compatible with BaselineTopology in the cluster. Branching history of cluster BlT ([763775804]) doesn't contain branching point hash of joining node BlT (-3589260343). Consider cleaning persistent storage of the node and adding it to the cluster again.
    at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.checkFailedError(TcpDiscoverySpi.java:2108)
    at org.apache.ignite.spi.discovery.tcp.ServerImpl.joinTopology(ServerImpl.java:1206)
    at org.apache.ignite.spi.discovery.tcp.ServerImpl.spiStart(ServerImpl.java:474)
    at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.spiStart(TcpDiscoverySpi.java:2210)
    at org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:278)
    ... 13 more

The only way to avoid these errors is to clean "\work\db" directories on both servers, I beleive it's not working scenario. I found following article about baseline topology adjustment https://ignite.apache.org/docs/latest/clustering/baseline-topology#baseline-topology-autoadjustment and tried to enable it but "auto_adjust" option still remains disabled:

PS D:\Apps\apache-ignite-2.12.0-bin\bin> ./control.bat --baseline auto_adjust enable --yes
Control utility [ver. 2.12.0#20220108-sha1:b1289f75]
2022 Copyright(C) Apache Software Foundation
User: OBevz
Time: 2022-02-22T10:57:39.669
Command [BASELINE] started
Arguments: --baseline auto_adjust enable --yes
--------------------------------------------------------------------------------
Cluster state: active
Current topology version: 39
Baseline auto adjustment disabled: softTimeout=300000

Current topology version: 39 (Coordinator: ConsistentId=1f66ee78-68b3-4fe0-9a0b-52239a169bf2, Address=AWS01-AIGNITE01.HTFS.Local/172.31.56.7, Order=38)

Baseline nodes:
    ConsistentId=1f66ee78-68b3-4fe0-9a0b-52239a169bf2, Address=AWS01-AIGNITE01.HTFS.Local/172.31.56.7, State=ONLINE, Order=38
    ConsistentId=adc6ee37-f001-4809-81fe-29a364357e5b, Address=AWS01-AIGNITE02.HTFS.Local/172.31.56.9, State=ONLINE, Order=39
--------------------------------------------------------------------------------
Number of baseline nodes: 2

Other nodes not found.
Command [BASELINE] finished with code: 0
Control utility has completed execution at: 2022-02-22T10:57:40.091
Execution time: 422 ms

What is the corrent sequence of actions to start two ignite nodes and join them in one cluster (start ignite, activate cluster, set auto_adjust enabled)? Is it possible to automate setting cluster in active state and enabling "auto_adjust" option (in default-config.xml or as flag for ignite.bat)? Sorry if missed some important part from ignite docs.

Hello, any update on it? Should I provide more details? – Oleg Bevz Mar 21 '22 at 08:55 — Oleg Bevz, Mar 21 '22 at 08:55

score 0 · Answer 1 · answered Apr 10 '22 at 16:49

I wouldn't recommend using baseline auto-adjust here, or just whenever you are not sure that's the right thing to do.

Generally, the following sequence of action is guaranteed to work if not affected by something else:

Start two nodes, make sure they are connected (check in the logs or in control.sh)
Activate (--set-state active)
Stop one node
Start it again

What can (and will) complicate your life:

Manual activation after the first one (for example, stop both servers, start one, activate). Manual activation affects baseline metadata (the "branching" mentioned in one of the errors), and after some series of manipulations you may run into an error.
Baseline topology changes. Again, the "branching" happens on topology changes, and if different nodes become parts of different "branches" they won't join.
Baseline topology auto-adjust. Auto-adjust changes the baseline topology automatically, and we already established that even manual changes are complicated.
Creating and destroying caches on partial topology. Long story short, a node won't be able to join if its persistence has caches that the cluster doesn't have (e.g. when the cluster destroyed the cache while the node was offline). Try to avoid creating and destroying caches when some baseline nodes are offline.
Manual persistence files deletion. If you delete some (but not all) of a database's files, you're essentially doing a brain surgery. If you want to delete all files from one of the nodes, it's generally not that complicated but will require the baseline topology update after.

Make no mistake - all of the actions above are perfectly fine. Experienced users do sometimes change their baseline topologies, use auto-adjust, and perform DB brain surgery. But these actions sort of require you to switch off the auto-pilot and make sure you understand every step. When just starting with Ignite, best to avoid them and use the auto-pilot.

Apache Ignite: Joining node has caches with data which are not presented on cluster, it could mean that they were already destroyed

1 Answers1