16

My xinetd daemon suddenly stopped working after a kernel upgrade (from 2.6.24 to 2.6.33). I've run an strace and found this:

[...]
close(3)                                = 0
munmap(0x7f1a93b43000, 4096)            = 0
getrlimit(RLIMIT_NOFILE, {rlim_cur=8*1024, rlim_max=16*1024}) = 0
setrlimit(RLIMIT_NOFILE, {rlim_cur=1024, rlim_max=1024}) = 0
close(3)                                = 4294967287
exit_group(1)                           = ?

So basically, it looks like the close system call returned something different than 0 or -1

I did several tests and it appears that it happens only with 64bit executables:

$ file closetest32
closetest32: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, not stripped
$ strace closetest32
execve("./closetest32", ["closetest32"], [/* 286 vars */]) = 0
[ Process PID=4731 runs in 32 bit mode. ]
open("/proc/mounts", O_RDONLY)          = 3
close(3)                                = 0
close(3)                                = -1 EBADF (Bad file descriptor)
_exit(0)                                = ?


$ file closetest64
closetest64: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), statically linked, not stripped
$ strace closetest64
execve("./closetest64", ["closetest64"], [/* 286 vars */]) = 0
open("/proc/mounts", O_RDONLY)          = 3
close(3)                                = 0
close(3)                                = 4294967287
_exit(0)                                = ?

I'm running the following kernel:

Linux foobar01 2.6.33.9-rt31.64.el5rt #1 SMP PREEMPT RT Wed May 4 10:34:12 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

The worst part is that I cannot reproduce the bug on another machine with the same kernel.

Any ideas ?

EDIT: as requested: here's the code used for closetest32 and closetest64

closetest32.asm:

.section .data

filename:
    .ascii "/proc/mounts"

.section .text
.globl _start
_start:
    xorl %edi, %edi
    movl $5, %eax # open() i386 system call
    leal filename, %ebx # %ebx ---> filename
    movl $0, %esi # O_RDONLY flag into esi
    int $0x80

    xorl %edi, %edi
    movl $6, %eax # close() i386 system call
    movl $3, %ebx # fd 3
    int $0x80

    xorl %edi, %edi
    movl $6, %eax # close() i386 system call
    movl $3, %ebx # fd 3
    int $0x80

    ## terminate program via _exit () system call
    movl $1, %eax # %eax =  _exit() i386 system call
    xorl %ebx, %ebx # %ebx = 0 normal program return code
    int $0x80

compiled as:

as test32.asm -o test32.o --32
ld -m elf_i386 test32.o -o closetest32

closetest64.asm:

.section .data

filename:
    .ascii "/proc/mounts"

.section .text
.globl _start
_start:
    xorq %rdi, %rdi
    movq $2, %rax # open() system call
    leaq filename, %rdi # %rdi ---> filename
    movq $0, %rsi # O_RDONLY flag into rsi
    syscall

    xorq %rdi, %rdi
    movq $3, %rax # close() system call
    movq $3, %rdi # fd 3
    syscall

    xorq %rdi, %rdi
    movq $3, %rax # close() system call
    movq $3, %rdi # fd 3
    syscall

    ## terminate program via _exit () system call
    movq $60, %rax # %rax = _exit() system call
    xorq %rdi, %rdi # %rdi = 0 normal program return code
    syscall

compilation:

as test64.asm -o test64.o
ld test64.o -o closetest64
sevenup
  • 170
  • 8
  • 6
    Not sure if it is related, but possibly coincidentally, the returned value 0xFFFFFFF7 = -9. And EBADF = +9. Curious. – Mark Wilkins Apr 26 '12 at 15:00
  • Could you show the source code of closetest64/32? – JeremyP Apr 26 '12 at 15:18
  • What version is the strace binary and glibc on both machines? I just made a simple test and strace 4.5.20 prints -1 EBADF for both 32/64. libc - 2.13 – strkol Apr 26 '12 at 15:25
  • @Mark Wilkins: indeed! And I guess it's not a coincidence. If you look at the source code of the close() system call (http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=fs/open.c;h=5720854156dbd61e28598da83a529294401eb119;hb=HEAD), you'll see a return -EBADF on line 1099 which is possibly what happens. – sevenup Apr 26 '12 at 15:30
  • Probably related: there is similar [bug](http://old.nabble.com/-PATCH--dup2:-Fix-return-value-with-oldfd-%3D%3D-newfd-and-invalid-fd-td23488867.html) for `dup2` – ks1322 Apr 26 '12 at 15:43
  • @strkol strace 4.5.18 and libc 2.1.2 on both of them – sevenup Apr 26 '12 at 16:20
  • @ks1322 you may be right, it looks like the same signed/unsigned issue. What bothers me now is the fact that the return value is different on two different machines running the same kernel, with the same strace and libc versions :) – sevenup Apr 26 '12 at 16:26
  • Best guess is that this is a bug in strace -- its incorrectly extracting and printing the return value from the syscall. See if you can upgrade strace to something that supports your kernel version. – Chris Dodd Apr 26 '12 at 16:43
  • 2
    It's not an strace bug, it's a kernel bug. This is a 64-bit machine, so 1<<32-9 should **never** appear. The problem is that the kernel is internally using `unsigned` instead of `int` for the return value of some of these functions, then returning `-EBADF` which gets reduced modulo 2^32 rather than modulo 2^64. – R.. GitHub STOP HELPING ICE Apr 26 '12 at 16:51
  • @R.., but close()'s return value is signed int, so it should never exceed 2147483647, that's why I think it's strace bug. – strkol Apr 26 '12 at 16:57
  • 11
    No, `close` (the syscall) can return `-err` on failure, where `err` is any valid error code. The problem is that the *generic* code in the libc syscall wrappers that handles syscall error returns has to treat the return value as a `long` (since it could be a pointer or long for some syscalls) when making the comparison to see if it's a small negative value that would indicate an error. But the kernel returned `(long)(unsigned)-9` which is very different from `(long)-9`. or `(unsigned long)-9` (either of which would have worked). – R.. GitHub STOP HELPING ICE Apr 26 '12 at 17:17
  • You can also verify return value in gdb on both machines and compare results with strace output. – ks1322 Apr 26 '12 at 18:36
  • At the risk of appearing stupid, is it significant that `close` is being called twice on descriptor 3 ? The first succeeds, the second of course fails. Maybe is that what closetest is testing but the original strace for xinetd also shows this. – William Morris May 06 '12 at 13:09
  • As you can see in the first trace, xinetd is calling setrlimit to specify the maximum file descriptor number that can be opened him. When the program starts, it closes all the unneeded file descriptor maybe just to ensure that its environment is safe and clean. However, as you can see [here](http://www.xinetd.org/cgi-bin/cvsweb.cgi/xinetd/xinetd/init.c?rev=1.10&content-type=text/x-cvsweb-markup) in the setup_file_descriptors function, xinetd exits in case of a wrong return value. tl;dr: even with a wrong fd value, close should not fail. It should return a negative value with the proper errno. – sevenup May 06 '12 at 16:42
  • I think this issue got answered in the first few comments. Can't it be sufficient to test for `< 0` instead of `== -1` in the return code from `close()`? Perhaps you should post the answer that satisfies you, @sevenup, or a new question. The discussion is starting to drift. – Randall Cook May 08 '12 at 06:23
  • The thing is I cannot patch all the programs that run in my system. For example, a python script as simple as: import os fd = os.open('/proc/mounts', 0) ret = os.close(fd) ret = os.close(fd) will fail with a 64 bits python interpreter (raise a SystemError) but will work fine in 32 bits (OSError). Anyway, you're right, it's a kernel bug and all I need to do is rollback to a previous version or to a more recent one. – sevenup May 08 '12 at 09:24

1 Answers1

1

As expected, a rollback to a previous kernel version solved the problem. I'm not really a kernel specialist but as far as I understand, the answer given by @R.. makes sense:

This is a 64-bit machine, so 1<<32-9 should never appear. The problem is that the kernel is internally using unsigned instead of int for the return value of some of these functions, then returning -EBADF which gets reduced modulo 2^32 rather than modulo 2^64

The problem is that the generic code in the libc syscall wrappers that handles syscall error returns has to treat the return value as a long (since it could be a pointer or long for some syscalls) when making the comparison to see if it's a small negative value that would indicate an error. But the kernel returned (long)(unsigned)-9 which is very different from (long)-9. or (unsigned long)-9 (either of which would have worked).

Community
  • 1
  • 1
sevenup
  • 170
  • 8