Tibco-Ems Failover Issue

Question

I have 2 Tibco-Ems Servers running, with fault tolerant setup. If one server is not available, the active server switches to the failover server as expected.

However, every now and then I get strange errors. Then the new active server says: "reconnect failed: connection unknown for id= XY"

This only happens if there is an open connection on my client. But that's what I would expect, the connection should also switch to the new active server. And as I said, sometimes it works and sometimes not.

When I register for the EMS-Exceptions in my client, I get the error: "Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host."

Stacktrace: at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size) at TIBCO.EMS.LinkTcp._readEx(Byte[] buffer, Int32 offset, Int32 size) at TIBCO.EMS.LinkTcp._ReadWireMsg() at TIBCO.EMS.LinkTcp.LinkReader.Work()

Right now I have no more idea what I could do. Maybe somebody can help me to understand what the exact problem is. Thanks in Advance

UPDATE: A late update here: Even though I get the error "reconnect failed" it works as expected. The second server will take over.

score 4 · Accepted Answer · answered Oct 29 '14 at 03:44

4

Here's what's going on... An EMS server keeps track of the active client connections that it has, and keeps information about these connections in the meta.db store file. Upon fault-tolerant failover the new primary EMS instance is able to recover the client connections when the clients reconnect by matching information that the client provides with information stored in the meta.db store file.

There is a point in time when EMS cleans up client connections that have not reconnected. That time is governed by the ft_reconnect_timeout parameter in the tibemsd.conf configuration file. The default setting for this configuration parameter is 60 seconds. Depending on your logging settings when EMS cleans up "expired" connections you may see a mssage indicating that it has "purged" a client connection in your EMS logs.

There are times when the client eventually does attempt to reconnect after the EMS server has already purged the "expired" connection. This can happen in the event that a network partition prevents the client from successfully reconnecting to the EMS server until after the EMS server cleans up the connection. When this happens you will see the, "Reconnect failed: connection unknown..." message.

When a client is unable to "re-connect" due to this error, it simply attempts a connection as a "new" connection. This works and it is able to continue processing.

answered Oct 29 '14 at 03:44

nochum

755
4
10

sounds good in theory...but it does not work in my case. I always get the "Reconnect failed" message. So do you have any idea what I'm doing wrong? – DanielG Oct 29 '14 at 07:17
The only way you will get a "Reconnect failed" message is when a client attempts a re-connect passing reconnection parameters that the server has already cleaned up. That is almost always immediately followed by a successful connection that does not typically get logged unless you log every successful connection. If this is an issue for you then you can set the ft_reconnect_timeout parameter to a value greater than the default 60 seconds (the value is specified in seconds). Adding +CONNECT to your log_trace parameter should help you see more detail as to what is happening. – nochum Oct 29 '14 at 09:30
ok thanks again, I will have a look. But what should I do after a successful reconnect? Should my connection work as before, without anything to do on the client side? Maybe that's the point I'm doing mistakes. After the failover server took over, my connection is no more valid! So do I need to create a new connection? I thougth I do not need to care about that? – DanielG Oct 29 '14 at 10:28
No, you really don't need to worry about it. As long as your clients are configured with a fault-tolerant connection url (e.g. tcp://host1:7222,tcp://host2:7222) the client failover and reconnection should be transparent. That said, there are client-level settings that affect how many times and how long the client will attempt to reconnect (alternating between the server urls in the ft connect list) before giving up. These are "ReconnAttemptCount", "ReconnAttemptDelay" (the interval between attempts), and "ReconnAttemptTimeout". – nochum Oct 29 '14 at 13:15
ok, then I have no Idea whats going wrong. I set it up with a fault-tolerant connection url. I tried a lot of different parameter settings. It just does not work. – DanielG Oct 30 '14 at 07:12
Did you set your servers up to be fault-tolerant? The client settings by themselves won't do anything if the servers are not configured properly... – nochum Oct 30 '14 at 07:36
yeah, servers are also set to be fault-tolerant. thats copy and paste from another of my comments: Those are the configurations I did, as described in the user guide: - server=Set this parameter to the same server name in the configuration files of both the primary server and the secondary server. - ft_active In the configuration file of the primary server, set parameter to the URL of the secondary server. In the configuration file of the secondary server, set parameter to the URL of the active server – DanielG Nov 03 '14 at 15:10

score 1 · Answer 2 · answered Mar 13 '18 at 12:51

1

We had the same issue, our mistake was that the store (ems db) was'nt share between the active and the standby node, so when the active ems failed, the new active ems was'nt able to recover connections and messages.

answered Mar 13 '18 at 12:51

Aymeric Duché

11
1

score 0 · Answer 3 · answered Oct 27 '14 at 12:37

0

This happens when you are using a client side FT and not the server level FT, at least in our case when we faced this issue that was the underlying cause.

If you are using the ems servers with the FT URL server1:port,server2:port but the servers weren't truly in FT mode, when the connection switches between these two servers, you will have this issue as the connection moves to a different server but the existing connection on the failed server couldn't be destroyed or acquired by the new active server, due to incoherent FT setup.

In a true FT setup on the server side, the active server automatically assumes these connections to be active and continues to serve them. Please verify the server level configuration.

For us, providing the server level FT helped solve this issue.

answered Oct 27 '14 at 12:37

aadi

98
1
1
9

how do you do that? providing the server level FT? – DanielG Oct 27 '14 at 15:07
I just checked the user guide, and I exactly did what is described in "Configuriong Fault-Tolerant Servers". When servers start, Server A tells me that it is in 'active' state, and Server B tells me that it is in standby state for Server A. So I think it is all setup as server level FT!? I set the server and ft_active parameters in the tibemsd.conf files as described in the user guide. What am I missing? – DanielG Oct 27 '14 at 15:12
Those are the configurations I did, as described in the user guide: - server=Set this parameter to the same server name in the configuration files of both the primary server and the secondary server. - ft_active In the configuration file of the primary server, set parameter to the URL of the secondary server. In the configuration file of the secondary server, set parameter to the URL of the active server – DanielG Oct 27 '14 at 15:20
Cool. That's exactly how you would do it. Now, do you see any other error message apart from "reconnect failed: connection unknown for id= XY" this in the log file? Apart from this, to see all the errors, start the server from the command prompt on windows rather than from the services panel. If you are in Linux environment start the service from the shell like ./tibemsd64 -config tibemsdconfigfilelocation and post any other errors you see. – aadi Oct 28 '14 at 05:54
Also this is a typical case where in you configured the FT when the service instances in the domain are running. To fix this issue follow below steps. 1. Stop all the instances that are using EMS. These typically include your BW processes, java programs if any, TIBCO Administrator, TIBCO Hawk. 2. Switch to secondary by killing the primary server and observe the logs. 3. Bring up primary and kill the secondary. Observe the logs. Your error would have gone by this time. 4. Finally bring up all the instances of BW/Admin/Hawk and all other. your environment now is totally EMS FT proof. – aadi Oct 28 '14 at 07:58
First thanks, your help is really appreciated :) But unfortunately, I tried all of your steps and it did not help. I should have mentioned that I'm working with the .net libraries. My whole system only contains the 2 servers as well as a client. If there is no connection established, servers will switch as expected,when killing the main server. But when there is a connection from the client, I will get the error. I also checked the logs, but there is no other error. – DanielG Oct 28 '14 at 10:29
Here are the log messages: 2014-10-28 11:28:01.309 Server is in standby state for 'tcp://7222'. 2014-10-28 11:28:44.107 Connection to active server at 'tcp://7222' has been lost. 2014-10-28 11:28:44.107 Server activating on failure of 'tcp://7222'. 2014-10-28 11:28:44.107 Server rereading configuration. 2014-10-28 11:28:44.107 Recovering state, please wait. 2014-10-28 11:28:44.107 Server is now active. 2014-10-28 11:28:44.637 [adming@MyMachine]: reconnect failed: connection unknown for id=5 – DanielG Oct 28 '14 at 10:36

Tibco-Ems Failover Issue

3 Answers3

Linked