0
import pyarrow as pa
f = 'my_partitioned_big_dataset'
ds = dataset.dataset(f, format='parquet', partitioning='hive')
s = ds.scanner()
pa.dataset.write_dataset(s.head(827981), 'here', format="arrow", partitioning=ds.partitioning)  # is ok
pa.dataset.write_dataset(s.head(827982), 'here', format="arrow", partitioning=ds.partitioning)  # fails
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-54-9160d6de8c45> in <module>
----> 1 pa.dataset.write_dataset(s.head(827982), 'here', format="arrow", partitioning=ds.partitioning)
...
OSError: [Errno 24] Failed to open local file '...'. Detail: [errno 24] Too many open files

Am on linux (ubuntu). ulimit seems ok?

$ ulimit -Hn
524288
$ ulimit -Sn
1024
$ cat /proc/sys/fs/file-max
9223372036854775807

ulimit -Ha
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 128085
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 524288
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 128085
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 128085
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 128085
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Any ideas on how to work around this? I have a feeling I have already set my ulimit quite high but maybe I could adjust that. It is likely pyarrow has some feature to release open files on the fly?

mathtick
  • 6,487
  • 13
  • 56
  • 101

1 Answers1

2

There is no way to control this in the current code. This feature (max_open_files) was recently added to the C++ library and ARROW-13703 tracks adding it to the python library. I'm not certain if it will make the cutoff for 6.0 or not (6.0 should be releasing quite soon).

In the meantime, your limit for open files ((-n) 1024) is the default and is a bit conservative. You should be able to pretty safely increase the limit by a couple thousand. See this question for more discussion.

Pace
  • 41,875
  • 13
  • 113
  • 156
  • Interesting that soft limit is showing so low. I added some more ulimit checks above and I think mine are set quite high but maybe I need to chase down the soft limit there. – mathtick Oct 10 '21 at 08:51
  • So increasing the ulimit allows it to go further but I still seem to hit the limit at some point. I am guessing the max_open_files is a critical patch or there is some other way of writing files? – mathtick Oct 10 '21 at 09:11
  • The # of files created is going to be based on your partitioning scheme. So you could use a coarser partitioning scheme. Alternatively, you can split your table into multiple tables and then write each one to the same destination. If you do this you will need to use a uuid in the `basename_template` to ensure you don't overwrite data (see https://stackoverflow.com/questions/69184289/pyarrow-overwrites-dataset-when-using-s3-filesystem/69185178#69185178 ) for an example. – Pace Oct 10 '21 at 19:19