1

Context

I'm running some calculations on network data using a specific library which I have not coded. I have both small datasets (hundreds of rows) and big datasets (up to 10k rows).

Small datasets run fine but large ones take a lot of time not only in doing calculations (which is expected) but also in saving results to file which seems odd since I'm just saving a small pandas DataFrame to csv.

Finally while doing the same operation in a Jupyter notebook I have encountered the error:

Unexpected error while saving file: Too many open files

Which I have attributed to Jupyter but has led me to inspect lsof

My question:

I checked the number of open files by typing the following in Bash:

lsof 2>/dev/null | grep name.surname | cut -f 1 -d ' ' | sort | uniq -c

(I had to grep my user since I'm on a shared server)

I get something like this:

     34 bash
      9 cut
     13 grep
    103 jupyter-l
     30 lsof
  12144 python3
      4 (sd-pam
     10 sort
      4 sshd
     60 systemd
      9 uniq
    103 ZMQbg/19
    103 ZMQbg/20
    103 ZMQbg/25
    412 ZMQbg/9

I see that python3 has a big number next to it: is that all right?

Note: this happens both for the small and the large datasets during the whole time the script is running.

gibbone
  • 2,300
  • 20
  • 20
  • I would construct a separate directory with the smallest number of files possible and see if that result correlates with the above. And it's not clear, is a dataset (small or large) a group of individual files, or a small number of files that are parsed by your process (guessing the former). Good luck. – shellter Nov 20 '19 at 16:11
  • The datasets are just 15 files, most of them are small and only 4 are big. I just changed "edges" with "rows" so that the question is more clear. – gibbone Nov 20 '19 at 16:53
  • what is your problem: security concerns? or you only want to make it run? If you want it to run you can edit how many open files on linux by using `ulimit -n unlimited` before running the script – geckos Nov 20 '19 at 17:02
  • you can do `strace -e open python3 – geckos Nov 20 '19 at 17:04
  • @geckos my issue is to understand if it's ok for a python script to open so many files and if that may cause a slowing down of the script (especially when saving to file). I'll check with `strace` if I find something interesting. – gibbone Nov 20 '19 at 17:14
  • I don't see problem on it opening lot of files, if I'm suspicious of malicious software I would check strace for opened files, it will show every `open()` syscall, so we can see all files that it opened and all files that it tried to open. Otherwise I would just go and increase the max openfiles for this process. You can do this by session or user based (if you need something that persist against sessions) by editing limits.conf – geckos Nov 20 '19 at 17:19
  • I wouldn't concern about performance at start – geckos Nov 20 '19 at 17:19
  • You are reporting by process name. However, there is a possibility that the main python is forking additional processes, which will inherit all the open FD. The 12144 could be from multiple processes. Suggesting that you will include the pid in the report `lsof | grep ... | awk '{print $1, $2} | sort \ uniq -c'. Also, python may have large number of 'so' opened (and not 'files'). – dash-o Nov 20 '19 at 17:35
  • If you get this sorted, please post an answer with what you have discovered. Good luck! – shellter Nov 20 '19 at 17:56

0 Answers0