0

I'm running some servers on Hetzner (AX101) and have been experiencing random reboots for a while now, all my investigations lead absolutely nowhere.

Prerequisite: Ubuntu 22.04 (Ubuntu 5.15.0-58.64-generic 5.15.74)

From system's standpoint it looks nothing is happening:

Feb  6 10:44:00 server4 kernel: [256072.858601] [UFW BLOCK] IN=enp41s0 OUT= MAC=a8:a1:59:c0:b0:d0:00:31:46:0d:3d:f3:08:00 SRC=185.156.73.150 DST=138.201.121.186 LEN=40 TOS=0x00 PREC=0x00 TTL=250 ID=26829 PROTO=TCP SPT=53764 DPT=5
492 WINDOW=1024 RES=0x00 SYN URGP=0
Feb  6 10:44:37 server4 kernel: [256110.138416] [UFW BLOCK] IN=enp41s0 OUT= MAC=a8:a1:59:c0:b0:d0:00:31:46:0d:3d:f3:86:dd SRC=240b:4005:0018:3b00:88cd:89dd:7daf:c400 DST=2a01:04f8:0172:24e2:0000:0000:0000:0002 LEN=60 TC=0 HOPLIMI
T=245 FLOWLBL=0 PROTO=TCP SPT=35153 DPT=20000 WINDOW=65535 RES=0x00 SYN URGP=0
Feb  6 10:46:18 server4 kernel: [    0.000000] Linux version 5.15.0-58-generic (buildd@lcy02-amd64-101) (gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023
 (Ubuntu 5.15.0-58.64-generic 5.15.74)
Feb  6 10:46:18 server4 kernel: [    0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.15.0-58-generic root=UUID=76ab4da2-200e-48f1-8831-51fcf6935563 ro consoleblank=0 systemd.show_status=true nomodeset consoleblank=0
Feb  6 10:46:18 server4 kernel: [    0.000000] KERNEL supported cpus:
Feb  6 10:46:18 server4 kernel: [    0.000000]   Intel GenuineIntel
Feb  6 10:46:18 server4 kernel: [    0.000000]   AMD AuthenticAMD
Feb  6 10:46:18 server4 kernel: [    0.000000]   Hygon HygonGenuine
Feb  6 10:46:18 server4 kernel: [    0.000000]   Centaur CentaurHauls
Feb  6 10:46:18 server4 kernel: [    0.000000]   zhaoxin   Shanghai
Feb  6 10:46:18 server4 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Feb  6 10:46:18 server4 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Feb  6 10:46:18 server4 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Feb  6 10:46:18 server4 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers'
Feb  6 10:46:18 server4 kernel: [    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
Feb  6 10:46:18 server4 kernel: [    0.000000] x86/fpu: xstate_offset[9]:  832, xstate_sizes[9]:    8
Feb  6 10:46:18 server4 kernel: [    0.000000] x86/fpu: Enabled xstate features 0x207, context size is 840 bytes, using 'compacted' format.
Feb  6 10:46:18 server4 kernel: [    0.000000] signal: max sigframe size: 3376
Feb  6 10:46:18 server4 kernel: [    0.000000] BIOS-provided physical RAM map:
Feb  6 10:46:18 server4 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009ebff] usable
Feb  6 10:46:18 server4 kernel: [    0.000000] BIOS-e820: [mem 0x000000000009ec00-0x000000000009ffff] reserved
Feb  6 10:46:18 server4 kernel: [    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
Feb  6 10:46:18 server4 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000009bfefff] usable
Feb  6 10:46:18 server4 kernel: [    0.000000] BIOS-e820: [mem 0x0000000009bff000-0x0000000009ffffff] reserved
Feb  6 10:46:18 server4 kernel: [    0.000000] BIOS-e820: [mem 0x000000000a000000-0x000000000a1fffff] usable

Everything works as expected, until it doesn't. Server goes down for two minutes and than just re-appears booting the system.

NVMe disks are looking perfectly fine:

smartctl -A /dev/nvme0n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-58-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        40 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    55,954,348 [28.6 TB]
Data Units Written:                 76,540,527 [39.1 TB]
Host Read Commands:                 993,043,774
Host Write Commands:                1,875,329,624
Controller Busy Time:               1,396
Power Cycles:                       5
Power On Hours:                     4,902
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               40 Celsius
Temperature Sensor 2:               49 Celsius

I also did memtest which yielded in no issues.

From software standpoint, there is nothing special running there: PostgreSQL, node exporter - and that's basically it.

I contacted Hetzner with this problem and they even replaced all the hardware - but problem persists, which makes me think it is likely to be software (doubt power surges).

Any direction I can dig this problem further?

Danny
  • 101
  • Have you ever figured this out? I started experience the same on my AX41-NVME dedicated server. – user0103 Jun 04 '23 at 22:25
  • @user0103 unfortunately, not. Generally I'd say the most efficient way will be to contact Hetzner support, they can perform full hardware check. If the problem persists they can replace the server. Basically it's the only thing that helped with with those reboots. – Danny Jun 06 '23 at 06:53
  • Thanks for your time to reply. I requested to do the full hardware check, they did it and reported that no errors found. I suspect that it's a PSU issue because they check RAM, disks, stress-test CPU etc but it definitely looks like a power issue, I had something similar on my desktop PC before I replaced PSU. In the end, I got 2 such hard-resets in the one day yesterday and I migrated to another server. No issues so far despite using the same software and OS (Ubuntu 22.04) so it seems that it's indeed the hardware issue and it can be resolved only by replacing the server. – user0103 Jun 06 '23 at 15:11
  • @user0103 I also suspect PSU issues, but I haven't found any confirmation of it, neither from Hetzner or system side. In my experience the best to do now is to patiently wait when the server fails again, once that happened - just reopen the ticket, saying that issue persists. – Danny Jun 08 '23 at 07:08

0 Answers0