0

I have a program that has been running fine for months, and 2 days before being shown to the brass, it just hangs at the beginning, while exploring a file structure. Here's a backtrace:

#0  futex_wait (private=0, expected=2, futex_word=0x55555556c204) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait_private (futex=futex@entry=0x55555556c204) at lowlevellock.c:35
#2  0x00007ffff7e30a03 in __GI___readdir64 (dirp=0x55555556c200) at ../sysdeps/unix/sysv/linux/readdir64.c:37
#3  0x000055555555be55 in N2_GetRunNumbers (RootDirName=0x7fffffffca10 "LiveData", Direct=1, RunNoList=0x7fffffffc938)
at N2readData.c:471

And here's the corresponding code (line 471 is the readdir call):

struct dirent *dir;
DIR *d = opendir(RootDirName);
if (d) 
    while ((dir = readdir(d)) != NULL) { ...}

Reading the documentation on readdir, nowhere does it state that it can be blocking.

Now the hitch is that the directory I'm reading is mounted via sshfs, but I have full R/W access. So what could be the reason ?

dargaud
  • 2,431
  • 2
  • 26
  • 39
  • 2
    This `readdir` evidently provides some thread safety; there is indeed a lock within `struct dirent`: https://github.com/bminor/glibc/blob/11ba44f3a7a5a280b942639a13c77d2364177419/sysdeps/unix/sysv/linux/readdir64.c#L37. But it should only be contended if this same `struct dirent` is shared between multiple threads. Could that be the case in your application? If not, I'd suspect generic memory corruption that may have gotten the lock into an impossible state. A MRE would help. – Nate Eldredge Mar 24 '22 at 13:08
  • On the prod server there might very well be thread issues, but I'm currently testing on my devel machine, and that shouldn't be. I should add that there are currently 26000 files in the dir (`ls` takes about 2s). And what's an MRE besides a military Meal Ready to Eat ? – dargaud Mar 24 '22 at 13:27
  • I just looked when `readdir` stops, and it seems to always be at the 20031th iteration, after a normal file. – dargaud Mar 24 '22 at 13:35
  • 1
    Memory corruption was indeed the cause. I was copying file names into a buffer that wasn't big enough... Thanks – dargaud Mar 24 '22 at 13:48
  • 1
    Glad you found it! By the way, MRE = [mcve] – Nate Eldredge Mar 24 '22 at 18:40

0 Answers0