6

I am working in this relatively large code base where I am seeing a file descriptor leak and processes start complaining that they are not able to open files after I run certain programs.

Though this happens after 6 days , I am able to reproduce the problem in 3-4 hours by reducing the value in /proc/sys/fs/file-max to 9000.

There are many processes running at any moment. I have been able to pin point couple of processes that could be causing the leak. However, I don't see any file descriptor leak either through lsof or through /proc//fd.

If I kill the processes(they communicate with each other) that I am suspecting of leaking, the leak goes away. FDs are released.

cat /proc/sys/fs/file-nr in a while(1) loop shows the leak. However, I don't see any leak in any process.

Here is a script I wrote to detect that leak is happening :

#!/bin/bash

if [ "$#" != "2" ];then
    name=`basename $0`
    echo "Usage : $name <threshold for number of pids> <check_interval>"
    exit 1
fi


fd_threshold=$1
check_interval=$2
total_num_desc=0
touch pid_monitor.txt
nowdate=`date`
echo "=================================================================================================================================" >> pid_monitor.txt
echo "****************************************MONITORING STARTS AT $nowdate***************************************************" >> pid_monitor.txt

while [ 1 ]
do
    for x in `ps -ef | awk '{ print $2 }'`
    do
        if [ "$x" != "PID" ];then
            num_fd=`ls -l /proc/$x/fd 2>/dev/null | wc -l`
            pname=`cat /proc/$x/cmdline 2> /dev/null`
            total_num_desc=`expr $total_num_desc + $num_fd`
            if [ $num_fd -gt $fd_threshold ]; then
                echo "Proces name $pname($x) and number of open descriptor = $num_fd" >> pid_monitor.txt
            fi
        fi
    done
    total_nr_desc=`cat /proc/sys/fs/file-nr`
    lsof_desc=`lsof | wc -l`
    nowdate=`date`
    echo "$nowdate : Total number of open file descriptor = $total_num_desc lsof desc: = $lsof_desc file-nr descriptor = $total_nr_desc" >> pid_monitor.txt
    total_num_desc=0
    sleep $2
done

./monitor.fd.sh 500 2 & tail -f pid_monitor.txt

As I mentioned earlier, I don't see any leak in /proc//fd for any , but leak is happening for sure and system is running out of file descriptors.

I suspect something in the kernel is leaking. Linux kernel version 2.6.23.

My questions are follows :

  1. Will 'ls /proc//fd' show list descriptors for any library linked to the process with pid . If not how do i determine when there is a leak in the library i am linking to.

  2. How do I confirm that leak is in the userspace vs. in kernel.

  3. If the leak is in the kernel what tools can I use to debug ?

  4. Any other tips you can give me.

Thanks for going through the question patiently.

Would really appreciate any help.

user1342468
  • 71
  • 1
  • 3
  • 1. Yes, it will show all descriptors, including those from the linked libraries. 2. It's very unlikely to have fd leak in the kernel. 3. look 2. 4. It's very unclear what the problem is, could you provide more details? Which syscall fails, with what error? – strkol Apr 22 '12 at 10:59

2 Answers2

1

Found the solution to the problem.

There was a shared memory attach happening in some function and that function was getting called every 30 seconds. The shared memory attach was never getting detached , hence the descriptor leak. I guess /proc//fd doesn't show shared memory attach as a descriptor. Hence my script was not able to catch file descriptor leak.

Ghansham
  • 448
  • 5
  • 19
user1342468
  • 71
  • 1
  • 3
0

Which processes start complaining? And what is the error you see? What is the output of your monitoring script?

To open a file you need two things, a file descriptor, and a struct file - or file description. The file descriptor is what userspace uses, inside the kernel it is used to lookup the struct file. It's not clear to me which you are leaking.

mpe
  • 2,640
  • 1
  • 19
  • 17