-1

I'm trying to understand why my server crashes. It reboots itself after some minutes when launching

stress-ng -d 9

The last logs I received are these ones:

[pid  1547] write(3, "Z\26\260\2273\0Z\346\251\232\311\273e\10\263\6  \376\325(\330O\fG\326\326\330w\344\214t"..., 65536 <unfinished ...>
[pid  1546] write(3, "eT\323a\304\314\300^\25\360\224\224\20\342\6\201!\323\314T\nV\10A\214\25c!\256[\300K"..., 65536 <unfinished ...>
[pid  1545] write(3, "\3135\271\370\264\366\20\307\354\260a\236\337\223,\233u\212\327 a~\37\251\\E\365\217wR\304\200"..., 65536 <unfinished ...>
[pid  1544] write(3, "\357\240\353\341/\345\257\324\205\202&\342\25`\2162\306R\306\275\367\0061\206,ex(T\247S|"..., 65536 <unfinished ...>
[pid  1543] write(3, "\31\345T[a\35\201F\341\343\5\243F\250\23\221r\301\0367\221\3\202\320\310\32\263-\204B\234\32"..., 65536 <unfinished ...>
[pid  1547] <... write resumed> )       = 65536
[pid  1546] <... write resumed> )       = 65536
[pid  1542] write(3, "f;\337\363\340\332)\32nS:\204\254ab\223A\233Z\2\265.j\254\244\324b!p\275Xz"..., 65536 <unfinished ...>
[pid  1541] write(3, "\356\327\\`*\4K\350\

(the server crashes at the middle of the last line!)

I checked the smartctl and everything seems normal:

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   028   100   000    Old_age   Always       -       28
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       112
166 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       1
167 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
168 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       3
169 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       33
170 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
171 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
172 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
173 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       2
174 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       111
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   050   100   000    Old_age   Always       -       50 (Min/Max 0/52)
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       1
230 Unknown_SSD_Attribute   0x0032   100   100   000    Old_age   Always       -       0
232 Available_Reservd_Space 0x0033   100   100   004    Pre-fail  Always       -       100
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       530
241 Total_LBAs_Written      0x0030   253   253   000    Old_age   Offline      -       489
242 Total_LBAs_Read         0x0030   253   253   000    Old_age   Offline      -       507

The speeds of the disk seems OK too:

root@aaa:/home/customer# hdparm -Tt /dev/sda
/dev/sda:
Timing cached reads: 24102 MB in 2.00 seconds = 12063.26 MB/sec
Timing buffered disk reads: 968 MB in 3.00 seconds = 322.25 MB/sec
root@aaa:/home/customer# hdparm -Tt /dev/sda
/dev/sda:
Timing cached reads: 24290 MB in 2.00 seconds = 12156.88 MB/sec
Timing buffered disk reads: 968 MB in 3.00 seconds = 322.28 MB/sec

Any idea?

Kevin
  • 59
  • 5
  • 1
    Don't run `stress-ng -d 9`? – ewwhite Nov 12 '15 at 10:07
  • This is the minimum command to crash it quickly. I've the same problem when doing classical actions on my server after some hours. – Kevin Nov 12 '15 at 10:08
  • I mean, without details about your hardware, OS, etc... What are you asking? – ewwhite Nov 12 '15 at 10:08
  • What's on the console when it crashes? – MadHatter Nov 12 '15 at 10:08
  • The last logs I have when using strace is what I posted in the question. Without using strace (launching stress-ng directly), I don't have any log. I'm asking how could I know where the problem is from? I've a E3-1220 with a 240GB SSD. I'm on Debian 8, but tried before with Ubuntu 14.04 LTS and Proxmox 3.4 with the same result. – Kevin Nov 12 '15 at 10:11
  • I didn't ask you what was in the logs, I asked you what was on the screen. The kernel often doesn't log all the crash information to HDD, particularly if it goes down because of FS or VM problems - it risks corrupting the FSes if it writes to them at such times. A photographic image of the contents of the console screen can be invaluable in such cases. – MadHatter Nov 12 '15 at 10:25
  • Thank you for your help. I don't have access to a "screen" on this server. – Kevin Nov 12 '15 at 10:31
  • Are you telling us that it's virtualised? If so, what's the virtualisation technology? Or is it containerised? If neither, why do you not have access to the console? – MadHatter Nov 12 '15 at 10:38
  • I have access to the console. The last message I received when launching `strace stress-ng -d 9` are the one I posted on my question (the last line is `[pid 1541] write(3, "\356\327\\*\4K\350\ ` ). Is it the answer to your question? – Kevin Nov 12 '15 at 11:04
  • Sorry to belabour the point, but are you saying that the kernel logged **nothing** to the console when it crashed? It just hung, hard? And to confirm that we're talking about the **physical** console, here - the screen on which the BIOS messages appear at boot time? – MadHatter Nov 12 '15 at 11:07
  • Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/31442/discussion-between-kevin-and-madhatter). – Kevin Nov 12 '15 at 11:09
  • 1
    In chat, you have confirmed that you don't have access to the console. My feeling is that without seeing what's on the console, you're not going to be able to understand your server's failure. You're going to try to secure some kind of console access via remote-hands-and-eyes, and report back. – MadHatter Nov 12 '15 at 11:28

1 Answers1

1

It may be worth factoring out that particular HDD from the test, e.g. by re-running the test on an externally mounted HDD just to see if it is generic kernel issue or an issue with that particular drive. The -d HDD stress-ng stressor just hammers the file system with a lot of generic read/write patterns, so it is surprising it is causing this kind of hang. I am therefore supposing it may be an issue with that particular drive.

Colin King
  • 21
  • 1