2

I am trying to run KNIME 2.11.3 software on Scientific Linux cluster (with Sun Grid Engine) using qsub, asking for 4GB of ram.

Java used:

java version "1.8.0_73"
Java(TM) SE Runtime Environment (build 1.8.0_73-b02)
Java HotSpot(TM) 64-Bit Server VM (build 25.73-b02, mixed mode)

The problem: The KNIME software starts up the workflow properly, but (probably) during loading up of Weka machine learning nodes the software crashes. The error information I get is as follows:

    #
    # A fatal error has been detected by the Java Runtime Environment:
    #
    #  SIGSEGV (0xb) at pc=0x00002b2774bf2c4c, pid=115080, tid=47451179185920
    #
    # JRE version: Java(TM) SE Runtime Environment (7.0_60-b19) (build 1.7.0_60-b19)
    # Java VM: Java HotSpot(TM) 64-Bit Server VM (24.60-b09 mixed mode linux-amd64 compressed oops)
    # Problematic frame:
    # C  [libc.so.6+0x7fc4c]  cfree+0x1c

What could be happening ? (Here is from the log)

#  SIGSEGV (0xb) at pc=0x00002b2774bf2c4c, pid=115080, tid=47451179185920
#
# JRE version: Java(TM) SE Runtime Environment (7.0_60-b19) (build 1.7.0_60-b19)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (24.60-b09 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libc.so.6+0x7fc4c]  cfree+0x1c
#
# Core dump written. Default location: /exports/eddie3_homes_local/pgrabows/core or core.115080
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
#

---------------  T H R E A D  ---------------

Current thread (0x00002b2820007800):  JavaThread "KNIME-TableIO-1" daemon [_thread_in_vm, id=115118, stack(0x00002b28169df000,0x00002b2816ae0000)]

siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), si_addr=0xfffffffffffffff7

Registers:
RAX=0x0000000000000000, RBX=0x00002b277ff37010, RCX=0x00002b2816adf700, RDX=0x0000000000000001
RSP=0x00002b2816addd28, RBP=0x00002b2774970130, RSI=0x0000000000000001, RDI=0xffffffffffffffff
R8 =0x0000000000000020, R9 =0x0101010101010101, R10=0x0000000000000022, R11=0x00002b2774bfbf1e
R12=0x00002b2816addd50, R13=0x00002b2775d2e860, R14=0x00002b27d40b00e0, R15=0x00002b2816adddd0
RIP=0x00002b2774bf2c4c, EFLAGS=0x0000000000010286, CSGSFS=0x0000000000000033, ERR=0x0000000000000005
  TRAPNO=0x000000000000000e

Top of Stack: (sp=0x00002b2816addd28)
0x00002b2816addd28:   00002b2774970655 00002b277453a000
0x00002b2816addd38:   0000000000000000 00002b27d40b00e0
0x00002b2816addd48:   00002b2774970198 00002b2778002470
0x00002b2816addd58:   00002b27d40b00e0 00002b277574e26d
0x00002b2816addd68:   00002b2774cea5c0 00002b2816adddd0
0x00002b2816addd78:   00002b2778002470 00002b2816addda0
0x00002b2816addd88:   00002b277574e26d 0000000000000006
0x00002b2816addd98:   0000000000000078 00002b2816adde60
0x00002b2816addda8:   00002b27757189f8 00002b277c005310
0x00002b2816adddb8:   00002b27d41387f8 00002b2816addf9f
0x00002b2816adddc8:   00002b27d41387f8 00002b2775d2fa50
0x00002b2816adddd8:   0000005000000000 000000000000002e
0x00002b2816addde8:   0000000000000000 0000000000000000
0x00002b2816adddf8:   00002b27d40affe0 000000000000002e
0x00002b2816adde08:   0000000000000100 0000000000000000
0x00002b2816adde18:   0000000000000000 0000000000000000
0x00002b2816adde28:   00002b27d40afeb0 000000000000002e
0x00002b2816adde38:   00002b27d41387f8 0000000000000000
0x00002b2816adde48:   00002b2820007800 00002b2816addf9f
0x00002b2816adde58:   0000000000000002 00002b2816addec0
0x00002b2816adde68:   00002b277571912e 00002b2820007800
0x00002b2816adde78:   00002b277c01d0bf 00002b27d40affb0
0x00002b2816adde88:   0000000000000000 00002b2816ade098
0x00002b2816adde98:   00002b28200156f0 00002b27d41387f8
0x00002b2816addea8:   00002b2820007800 00002b27d40afea0
0x00002b2816addeb8:   00002b27d41387f8 00002b2816addf20
0x00002b2816addec8:   00002b277571968e 00002b2816addf9f
0x00002b2816added8:   00002b27d40afeb0 00002b27d40b0288
0x00002b2816addee8:   00000000000003d8 00002b2816ade098
0x00002b2816addef8:   00002b2816addf9f 00002b27d41387f8
0x00002b2816addf08:   00002b2820007800 00000000b1e99c80
0x00002b2816addf18:   00002b2820007800 00002b2816addf80 

Instructions: (pc=0x00002b2774bf2c4c)
0x00002b2774bf2c2c:   1f 44 00 00 48 8b 05 b1 a2 33 00 48 8b 00 48 85
0x00002b2774bf2c3c:   c0 0f 85 bf 00 00 00 48 85 ff 0f 84 b4 00 00 00
0x00002b2774bf2c4c:   48 8b 47 f8 48 8d 4f f0 a8 02 75 28 a8 04 48 8d
0x00002b2774bf2c5c:   3d ff aa 33 00 74 0c 48 89 c8 48 25 00 00 00 fc 

Register to memory mapping:

RAX=0x0000000000000000 is an unknown value
RBX=0x00002b277ff37010 is an unknown value
RCX=0x00002b2816adf700 is pointing into the stack for thread: 0x00002b2820007800
RDX=0x0000000000000001 is an unknown value
RSP=0x00002b2816addd28 is pointing into the stack for thread: 0x00002b2820007800
RBP=0x00002b2774970130: <offset 0x1130> in /lib64/libdl.so.2 at 0x00002b277496f000
RSI=0x0000000000000001 is an unknown value
RDI=0xffffffffffffffff is an unknown value
R8 =0x0000000000000020 is an unknown value
R9 =0x0101010101010101 is an unknown value
R10=0x0000000000000022 is an unknown value
R11=0x00002b2774bfbf1e: <offset 0x88f1e> in /lib64/libc.so.6 at 0x00002b2774b73000
R12=0x00002b2816addd50 is pointing into the stack for thread: 0x00002b2820007800
R13=0x00002b2775d2e860: <offset 0xdfa860> in /exports/eddie3_homes_local/pgrabows/usr/bin/knime_2.11.3/jre/lib/amd64/server/libjvm.so at 0x00002b2774f34000
R14=0x00002b27d40b00e0 is an unknown value
R15=0x00002b2816adddd0 is pointing into the stack for thread: 0x00002b2820007800


Stack: [0x00002b28169df000,0x00002b2816ae0000],  sp=0x00002b2816addd28,  free space=1019k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libc.so.6+0x7fc4c]  cfree+0x1c

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  sun.management.MemoryImpl.getMemoryPools0()[Ljava/lang/management/MemoryPoolMXBean;+0
j  sun.management.MemoryImpl.getMemoryPools()[Ljava/lang/management/MemoryPoolMXBean;+6
j  sun.management.ManagementFactoryHelper.getMemoryPoolMXBeans()Ljava/util/List;+0
j  java.lang.management.ManagementFactory.getMemoryPoolMXBeans()Ljava/util/List;+0
j  org.knime.core.data.util.memory.MemoryWarningSystem.findTenuredGenPool()Ljava/lang/management/MemoryPoolMXBean;+28
j  org.knime.core.data.util.memory.MemoryWarningSystem.<init>()V+26
j  org.knime.core.data.util.memory.MemoryWarningSystem.getInstance()Lorg/knime/core/data/util/memory/MemoryWarningSystem;+10
j  org.knime.core.data.util.memory.MemoryObjectTracker.<init>()V+23
j  org.knime.core.data.util.memory.MemoryObjectTracker.getInstance()Lorg/knime/core/data/util/memory/MemoryObjectTracker;+10
j  org.knime.core.data.container.Buffer.registerMemoryReleasable()V+21
j  org.knime.core.data.container.Buffer.addRow(Lorg/knime/core/data/DataRow;ZZ)V+104
j  org.knime.core.data.container.DataContainer.addRowToTableWrite(Lorg/knime/core/data/DataRow;)V+344
j  org.knime.core.data.container.DataContainer.access$4(Lorg/knime/core/data/container/DataContainer;Lorg/knime/core/data/DataRow;)V+2
j  org.knime.core.data.container.DataContainer$ASyncWriteCallable.callWithContext()Ljava/lang/Void;+101
j  org.knime.core.data.container.DataContainer$ASyncWriteCallable.call()Ljava/lang/Void;+8
j  org.knime.core.data.container.DataContainer$ASyncWriteCallable.call()Ljava/lang/Object;+1
j  java.util.concurrent.FutureTask.run()V+42
j  java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+95
j  java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5
j  java.lang.Thread.run()V+11
v  ~StubRoutines::call_stub

EDIT: adding the whole log: DOWNLOAD LOG FILE (DROPBOX)

EDIT2: adding ulimit and PATH data

The ulimits are different on both. On the slave node:

core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256023
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) 1048576
file locks                      (-x) unlimited

While on the master node:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256023
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 524288
cpu time               (seconds, -t) 600
max user processes              (-u) 200
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

There are also differences between them in terms of $LD_LIBRARY_PATH, i.e. the master node has an additional entry here: /exports/applications//gridengine/2011.11p1_155/lib/linux-x64

FINAL EDIT, FOUND ANSWER:

The answer was to ask the cluster for more RAM, I asked for 8GB RAM minimum using "-l h_vmem=8G" while doing qsub. It is awkward as the same workflow works properly on my old laptop with 4GB RAM but creates such a nasty error elsewhere. It is also possible this is our local cluster configuration-related error.

  • Possible JVM bug. Could you upload full log ? – Jayan Feb 11 '16 at 16:08
  • I uploaded the log file. I changed my Java to 1.7.0_60 on the master node but the problem persists. – Piotr Grabowski Feb 12 '16 at 12:56
  • Could you run the java program manually on the machine where crash occur? Looks like some dependency not correct. Please add machine details(OS, patch level to question). since you get exception with newer Java, you could file defect with Oracle. – Jayan Feb 12 '16 at 16:31
  • 1
    The crash from the log and original question has different reason for crash. I took some freedom to cleanup. Please review and provide correct info. All the best. – Jayan Feb 12 '16 at 16:33
  • 1
    I corrected my post and added proper log error. I tried running the software on a node by logging to it through qlogin. The run was successful. It seems that the problem only arises by sending the job to the node via qsub. – Piotr Grabowski Feb 15 '16 at 12:31
  • What is the difference between two environments in that case? Typically env variables such as PATH,LD_LIBRARY_PATH can affect. Not sure if you are setting different ulimits . – Jayan Feb 16 '16 at 14:02

0 Answers0