6

I get socket error 110 (Connection timed out) when a Mongo database (version 3.0.5) is replicated from primary DB server to slave, more precisely at the time of committing replication of that database (the log of slave is below). I guess probably the reason for that is that the database is big and send operation to commit it takes too much time.

How can I specify different socket timeout for mongo server? If its not possible, is there any other way to repair replication?

I found such an option only for a mongo client (connection string option socketTimeoutMS) but it doesn't help with Mongo server.

2016-04-26T13:36:34.693+0100 I INDEX    [rsSync]         done building bottom layer, going to commit     
2016-04-26T13:36:34.693+0100 I INDEX [rsSync] build index done.  scanned 30980334 total records. 4072 secs    
2016-04-26T13:36:34.772+0100 I REPL     [rsSync] initial sync cloning db: {skipped db name}    
2016-04-26T13:36:34.823+0100 I NETWORK  [rsSync] Socket say send() errno:110 Connection timed out {skipped ip}:27017    
2016-04-26T13:36:34.828+0100 E REPL     [rsSync] 9001 socket exception [SEND_ERROR] server [{skipped ip}:27017]     
2016-04-26T13:36:34.828+0100 E REPL     [rsSync] initial sync attempt failed, 9 attempts remaining

Update. I was asked for output of rs.status() in comments:

{       "set" : "<skippedsetname>",
        "date" : ISODate("2016-05-04T15:35:06.717Z"),
        "myState" : 5,
        "syncingTo" : "<skipped domain name of other server>:27017",
        "members" : [
                {
                        "_id" : 0,
                        "name" : "<skipped domain name of this server>:27017",
                        "health" : 1,
                        "state" : 5,
                        "stateStr" : "STARTUP2",
                        "uptime" : 29,
                        "optime" : Timestamp(0, 0),
                        "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
                        "syncingTo" : "<skipped domain name of other server>:27017",
                        "configVersion" : 9,
                        "self" : true
                },
                {
                        "_id" : 2,
                        "name" : "10.0.1.7:27017",
                        "health" : 1,
                        "state" : 7,
                        "stateStr" : "ARBITER",
                        "uptime" : 26,
                        "lastHeartbeat" : ISODate("2016-05-04T15:35:05.859Z"),
                        "lastHeartbeatRecv" : ISODate("2016-05-04T15:35:06.347Z"),
                        "pingMs" : 3,
                        "configVersion" : 9
                },
                {
                        "_id" : 3,
                        "name" : "<skipped domain name of other server>:27017",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 26,
                        "optime" : Timestamp(1462376105, 196),
                        "optimeDate" : ISODate("2016-05-04T15:35:05Z"),
                        "lastHeartbeat" : ISODate("2016-05-04T15:35:05.859Z"),
                        "lastHeartbeatRecv" : ISODate("2016-05-04T15:35:06.086Z"),
                        "pingMs" : 4,
                        "electionTime" : Timestamp(1461688501, 1),
                        "electionDate" : ISODate("2016-04-26T16:35:01Z"),
                        "configVersion" : 9
                }
        ],
        "ok" : 1    }

Update. I should but didn't mention hosting used is Azure. Answer and explanation is perfectly googled by query "azure mongodb connection timeout". My bad.

Community
  • 1
  • 1
boqapt
  • 1,726
  • 2
  • 22
  • 31
  • 1
    You're asking for a solution based on your guessing "I guess probably the reason for that is that the database is big and send operation to commit it takes too much time." I guess this is not true, and you problem is a network issue. Could you `telnet {skipped ip} 27017` ? – Héctor Valverde May 03 '16 at 09:14
  • @HéctorValverdePareja Yes I tried and can telnet on that IP and port. Moreover, as I said, the mentioned error happens after replica instance downloads all data of the database as it is seen in the log. So it is strange that it could connect to socket for downloading data but after that could not connect to the same port for committing that operation. – boqapt May 04 '16 at 15:31
  • And what's the output of `rs.status()`? – Héctor Valverde May 04 '16 at 15:33
  • @HéctorValverdePareja it was added to question – boqapt May 04 '16 at 15:40
  • I see one of your nodes is never started completely, the one with `"stateStr" : "STARTUP2"`. 1) Are you aware of that? 2) Could you diagnose if there is any network issue between the PRIMARY and rest of the nodes? – Héctor Valverde May 04 '16 at 15:53
  • @HéctorValverdePareja Yes I know. It is the server on the which the mentioned problem had happened. Replica set exist for a long period until synchronization was lost. After that I dropped DB on slave, restarted it and it started receiving data. – boqapt May 04 '16 at 16:06
  • I've dared to post an answer. – Héctor Valverde May 04 '16 at 16:11

2 Answers2

4

Your assumption of the cause of the error is wrong.

  • Connection timed out: During the attempt to establish the TCP connection, no response came from the other side within a given time limit.

In other words, it is a issue in the establishment of the socket and not a question of how long it takes to make the replication of the database.

Tuning the TCP timeout is a system setting and not something you do per application. The settings, on linux, are in the system-wide /etc/sysctl.conf and you can play around with the net.ipv4.tcp_syn_retries -- However you almost never change the timeout for establishing a socket (for any program, including mongo), and the few times I have changed it it was to make it shorter to get the error faster, rather than increasing it -- increasing it is unlikely to be the right solution in any earthly application.

The problem is either a configuration problem -- like you have some bad IP addresses in your setup, or a networking problem, like a bad firewall, routing table or a network switch which sometimes doesn't work for 60-120 seconds at a time.

Soren
  • 14,402
  • 4
  • 41
  • 67
  • Yep, I misunderstood the error code. But the mentioned error happens after replica instance downloads all data of the database as it is seen in the log. It seems it could connect to socket for downloading data but after that could not connect to the same port for committing that operation. Also that replica set exist for a long period until synchronization was lost. After that I dropped DB on slave, restarted it and it started receiving data from primary but failed to commit. So I doubt its networking or config problem. It seems the problem is connected to mongodb, but not network or OS – boqapt May 04 '16 at 16:04
  • Just because you can connect from A to B, does not imply that you can connect from B to A -- the direction of connections would change depending on who is elected primary/leader node. Check if `rs.status()` is the same on every node, and check if you can nc (or telnet) to each of the other nodes+port from every mongo machine -- the error code clearly suggest a networking issue. – Soren May 04 '16 at 17:31
0

There probably are some files locking the filesystem in your slave. If I where you, I'd remove the node from the replica, then wipe all files under dbpath, check the mongo user can access this directory and restart mongod. Once it's running, add it back to the RS and wait for it. See also: https://docs.mongodb.org/manual/tutorial/recover-data-following-unexpected-shutdown/#mongod-lock

Héctor Valverde
  • 1,089
  • 1
  • 14
  • 34
  • I understand that logically wiping files under dbpath is related to locked files on slave (tried that). But how do you justify removing slave from replica and adding again? – boqapt May 05 '16 at 14:22
  • I just see it's safe (I can't give you any reference but I have a vague memory I've read it elsewhere). The thing is to repair the node in isolation by cutting all communication with the rest of the replica set. – Héctor Valverde May 05 '16 at 14:28