-2

I am on windows and want to run my multi-threaded python app that saves data to .csv in an async way. As reported here, here and here, I am getting the following error at some point:

<type 'exceptions.IOError'> 
Traceback (most recent call last):
  File "results_path", line 422, in function
    df_results.to_csv(results_file)
IOError: [Errno 24] Too many open files

This proposes a fix that includes with-statements for every file IO operation:

with open(results_path, 'a') as results_file:
         df_results.to_csv(results_file)

However, I am still getting IOError as described above (In a nutshell, none of the SO questions solved my issue). Therefore, the with-statement apparently does not properly close the .csv file after the append operation.

First, I now increases the number of open files. This admittedly just delays the crash:

import win32file
max_open_files = 2048     # Windows-specific threshold for max. open file count
win32file._setmaxstdio(max_open_files)

Second, my temporary approach is (A) to check for open .csv-files consequtively, and (B) forcefully restart the whole script if the open file count gets anywhere near the threshold allowed for windows:

from psutil import Process 
import os, sys
proc = Process() 
open_file_count = 0                                         # Set up count of open files
for open_file in proc.open_files():                         # Iterate open files list
        if ".csv" in str(open_file):                        # Is file of .csv type?
                open_file_count += 1                        # Count one up
            else:
                continue
    else:
        if open_file_count > (max_open_files / 2):              # Threshold, see above
            os.execl(sys.executable, sys.executable, *sys.argv) # Force restart
        else:
            pass

This approach is ugly and inefficient in many ways (loop through all open files in every iteration/thread). At the very least, this needs to work without forcefully restarting the whole code.

Q1: How to properly close .csv files using python on windows?

Q2: If closing fails after IO operation, how to forcefully close open all .csv files at once?

sudonym
  • 3,788
  • 4
  • 36
  • 61
  • 1
    It would be good to see a minimal script that actually reproduces your error. –  Mar 08 '18 at 03:00
  • You claim the with-statement doesn't seem to close the files properly. Is there a chance you actually simply have too many files open at once? Without an example script, it's hard to tell what's going on. But with async threads, and if each thread takes a while, could you have several hundreds of threads going at once, and thus have several hundreds of files open at once? –  Mar 08 '18 at 03:03
  • Yes, but not anywhere near 2048 – sudonym Mar 08 '18 at 03:29
  • 1
    We don't know; we just have to take your word for it. Two of the answers below suggest a similar reasoning: there are just too many threads keeping files open. –  Mar 08 '18 at 03:31
  • Possibly something happens in `to_csv` that duplicates and leaks a file descriptor. But this is pointless speculation without a minimal, complete, and verifiable example. – Eryk Sun Mar 08 '18 at 03:32

3 Answers3

0

Those answers are correct. The with statement is the correct and Pythonic way to open and automatically close files. It works and is well tested. I suspect, however, that it's the multiprocessing or threading that's throwing a spanner in the works.

In particular, how many of your threads or processes are writing to your CSV? If more than one, then I'm confident that's the issue. Instead, have a single writer, and pass what needs to be written to that writing thread or process via a multiprocessing.Queue or regular (thread-safe) queue. In effect, a funnel, in which all processes that want to add data to the CSV would instead put the data into the queue, and the writing process will take each queue item out and write it the file.

Given a lack of working example in the question, I'll simply leave a pointer to Python's documentation on multiprocess communication.

hunteke
  • 3,648
  • 1
  • 7
  • 17
0

Use ThreadPoolExecutor from https://docs.python.org/3/library/concurrent.futures.html so you can keep a maximum number of threads running at one time to be less than the maximum number of file descriptors.

The with statement is the best way to handle the closing of files even when exceptions happen so you don't forget.

Totoro
  • 867
  • 9
  • 10
-2

Just close normal, not "with"???

In [1]: import pandas as pd

In [2]: df = pd.DataFrame()

In [3]: fw = open("test2.txt","a")

In [4]: df.to_csv(fw)

In [5]: fw.close()

In [6]: !ls
test2.txt
Windyground
  • 119
  • 6