2

Everything was running fine on our old machine running Debian 10. We recently moved to a new server also running Debian 10 on a 1st gen Xeon Scalable CPU. With my processes, everything starts up okay, but after a bit over an hour (e.g. or running IPython or Jupyter) the process mysteriously segfaults. I ran this simple test program through gdb:

import time
for i in range(36000):
    print('\r {0}'.format(i), end='')
    time.sleep(1)

and got the following backtrace:

(gdb) run test-seg-fault.py 
Starting program: /home/breitsbw/anaconda3/bin/python test-seg-fault.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
 7791
Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) backtrace
#0  0x0000000000000000 in ?? ()
#1  0x000055555566e900 in PyEval_RestoreThread (tstate=0x5555558c36d0) at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:270
#2  0x00005555556dbbcc in pysleep (secs=<optimized out>) at /tmp/build/80754af9/python_1565725737370/work/Modules/timemodule.c:1844
#3  time_sleep (self=<optimized out>, obj=<optimized out>) at /tmp/build/80754af9/python_1565725737370/work/Modules/timemodule.c:371
#4  0x00005555556b857d in _PyMethodDef_RawFastCallKeywords (method=0x555555875120 <time_methods+288>, self=0x7ffff77b3d70, args=0x7ffff785b5d0, nargs=<optimized out>, kwnames=<optimized out>) at /tmp/build/80754af9/python_1565725737370/work/Objects/call.c:648
...

You can see that this time, the program ran for a bit over 2 hours before segfaulting. I have no idea how to continue to investigate #0 0x0000000000000000 in ?? () from the gdb backtrace.

I have already tried reinstalling Anaconda several times, including the 2019.07 version. I have verified dozens of times that this segmentation fault continues to happen -- whether I run Jupyter, IPython, or just the simple python script provided above. Usually the script runs for about 3800 seconds. I'm not sure why gdb caused a delay in the segfault.

Since posting this question, we have found that others are also experiencing the segfault when running extended processes (taking longer than 1 hour) using other programs (e.g. Fortran executables). We believe there is a hardware-software interaction that is causing the issue. (Now it might make sense to move this question to a new community -- Could a moderator please advise?)

Any tips on what to investigate or how to debug this further would be greatly appreciated.

Brian
  • 3,453
  • 2
  • 27
  • 39
  • It turns out this might be related to the Lustre filesystem: https://jira.whamcloud.com/browse/LU-13137 – Brian Jan 15 '20 at 18:45

0 Answers0