1

I have a linux server program. I notice that after 2 or 3 days some time, the server stops responding. When I look into the server log, I find "connection reset by peer" messages.

I have to restart the server. All works fine and then after 2 or 3 days some time, the same problem occurs.

Any help would be appreciated.

Edit: Sorry for the little description.

It is a simple server I have written that accepts TCP connections. It is in C# and I use mono. I start the server as follows: nohup mono StartServer.exe &

A file called nohup.out is created in the same directory which acts as the log. After 2-3 days some time, I find that the mono process is still alive. But clients (silverlight clients) cannot connect to the server: they get an 'accessdenied' message. Actually, even when the server is down, 'accessdenied' messages are obtained. So I guess, the server is not responding.

When I look in nohup.out file, I find error messages 'connection reset by peer'.

Actually, I initially planned to host it on a windows server but finds it works fine on a linux server as well. Since I am not very experienced with linux, I was thinking if I have missed an obvious configuration.

What puzzles me is that everything works properly until exactly around 3 days.

I was load testing it making one client having multiple connections to the server and found I get the same error within one day. The number of connections is around 30.

gtan
  • 23
  • 5
  • You can't get to that server with ping, www, ssh, telnet, ftp? – ott-- Oct 15 '11 at 17:02
  • 1
    Too vague to help. Which service are you running on this server? Do you have any monitoring system? What happens at that time? In what log file do you see this error? – quanta Oct 15 '11 at 17:08
  • What kind of connection is it? TCP? UDP? Is it idle for long periods of time? What is it connected to (who is this peer that's resetting the connection?)... Setting aside the network problems that could cause it, since you didn't name the server, is it some custom server? Could it have a bug preventing it from accepting new connections when a client disconnects? – DerfK Oct 15 '11 at 17:54
  • I have updated the description. @DerfK It is TCP. Even under load, after around 3 days, it stops. It is connected to silverlight clients. Yes, it is a custom server that I have written. – gtan Oct 15 '11 at 18:02
  • So it is a custom daemon that runs up to about 3 days and it is cleared up someplace. It sounds like there is resource starvation someplace with the app - most likely a memory leak. Have you checked to see how much memory your daemon when it start rejecting connections? – Rilindo Oct 15 '11 at 18:09
  • @Rilindo When I check the resouce usage, it is more or less the same as when the app started originally. – gtan Oct 15 '11 at 18:13
  • 2
    How many clients? Are these persistent connections? I'm wondering if you're hitting a filedescriptor limit or some internal counter messup. What does `netstat -ano | grep portnumber` say about the number of connections when the server isn't working? Also, did you confirm (`tail -f nohup.out` to watch what is happening as it happens) that attempting to connect with a client is directly related to the "connection reset by peer" error, and that connection reset by peer isn't it's normal message for a client disconnect? – DerfK Oct 15 '11 at 18:22
  • Are the "connection reset by peer" messages being generated while the server is in the failed state? Or are they just the last messages in the log and no new messages are generated while it's failed? – David Schwartz Oct 16 '11 at 04:51
  • @DerfK It's presume a maximum of 50 clients per day. They are not persistent. I will try your suggestions. – gtan Oct 16 '11 at 10:01
  • @DavidSchwartz when the server is in the failed state – gtan Oct 16 '11 at 10:40
  • @DerfK I tried some of your suggestions and found a fix. Some connections were not properly closed. – gtan Oct 21 '11 at 12:24

1 Answers1

1

I would've run

# netstat -anp

and

# lsof -n

To see if there's a connection or filehandle-leak going on. I would suspect that the connections or files aren't closed properly, and that it has passed 1024 open files after a while (which is the default limit of open files at once, unless changed with ulimit) - which again would prevent it from creating new connections.

Kvisle
  • 4,193
  • 24
  • 25
  • I need to try these. – gtan Oct 16 '11 at 10:04
  • I used `netstat |grep portnumber` Although I did not reach the limit, I found that there were many half-open connections. Closing them resolved the problem. – gtan Oct 21 '11 at 12:25