2

I'm running a script that is replacing german umlauts in file names. There are over 1700 files that I need to do this for, but I'm getting an error indicating that there are too many open files after the script runs for a while. Anyone have any ideas how to fix this? Feedback is greatly appreciated!

Code:

# -*- coding: utf-8 -*-

''' Script replaces all umlauts in filenames within a root directory and its subdirectories with the English
    equivalent (ie. ä replaced with ae, Ä replaced with Ae).'''

import os
import itertools
import logging
from itertools import groupby

##workspace = u'G:\\Dvkoord\\GIS\\TEMP\\Tle\\Scripts\\Umlaut'
workspace = u'G:\\Gis\\DATEN'
log = 'Umlauts.log'
logPath = r"G:\Dvkoord\GIS\TEMP\Tle\Scripts\Umlaut\Umlauts.log"
logMessageFormat = '%(asctime)s - %(levelname)s - %(message)s'


def GetFilepaths(directory):
    """Function returns a list of file paths in a directory tree using os.walk.  Parameter: directory
    """
    file_paths = []
    for root, directories, files in os.walk(directory):
        for filename in files:
            filepath = os.path.join(root, filename)
            file_paths.append(filepath)
##    file_paths = list(set(file_paths))
    return file_paths

def uniq(input):
  output = []
  for x in input:
    if x not in output:
      output.append(x)
  return output

def Logging(logFile, logLevel, destination, textFormat, comment):
    """Function writes a log file.  Parameters: logFile (name the log file w/extension),
        logLevel (DEBUG, INFO, etc.), destination (path under which the log file will be
        saved including name and extension), textFormat (how the log text will be formatted)
        and comment.
    """
    # logging
    logger = logging.getLogger(__name__)
    # set log level
    logger.setLevel(logLevel)
    # create a file handler for the log -- unless a separate path is specified, it will output to the directory where this script is stored
    logging.FileHandler(logFile)
    handler = logging.FileHandler(destination)
    handler.setLevel(logLevel)
    # create a logging format
    formatter = logging.Formatter(textFormat)
    handler.setFormatter(formatter)
    # add the handlers to the logger
    logger.addHandler(handler)
    logger.info(comment)


def main():
    # dictionary of umlaut unicode representations (keys) and their replacements (values)
    umlautDictionary = {
                        u'Ä': 'Ae',
                        u'Ö': 'Oe',
                        u'Ü': 'Ue',
                        u'ä': 'ae',
                        u'ö': 'oe',
                        u'ü': 'ue',
                        u'ß': 'ss'
                        }
    dataTypes = [".CPG",
                 ".dbf",
                 ".prj",
                 ".sbn",
                 ".sbx",
                 ".shp",
                 ".shx",
                 ".shp.xml",
                 ".lyr"]
    # get file paths in root directory and subfolders
    filePathsList = GetFilepaths(workspace)
    # put all filepaths with an umlaut in filePathsUmlaut list
    filePathsUmlaut = []
    for fileName in filePathsList:
##        print fileName
        for umlaut in umlautDictionary:
            if umlaut in os.path.basename(fileName):
                for dataType in dataTypes:
                    if dataType in fileName:
##                        print fileName
                        filePathsUmlaut.append(fileName)
    # remove duplicate paths from filePathsUmlaut
    uniquesUmlauts = uniq(filePathsUmlaut)

    # create a dictionary for umlaut translation
    umap = {
            ord(key):unicode(val)
            for key, val in umlautDictionary.items()
            }
    # use translate and umap dictionary to replace umlauts in file name and put them in the newFilePaths list
    # without changing any of the umlauts in folder names or upper directories
    newFilePaths = []
    for fileName in uniquesUmlauts:
        pardir = os.path.dirname(fileName)
        baseName = os.path.basename(fileName)
        newBaseFileName = baseName.translate(umap)
        newPath = os.path.join(pardir, newBaseFileName)
        newFilePaths.append(newPath)
    newFilePaths = uniq(newFilePaths)

    # create a dictionary with the old umlaut path as key and new non-umlaut path as value
    dictionaryOldNew = dict(itertools.izip(uniquesUmlauts, newFilePaths))
    # rename old file (key) as new file (value)
    for files in uniquesUmlauts:
        for key, value in dictionaryOldNew.iteritems():

            if key == files:
                comment = '%s'%files + ' wurde als ' '%s'%value + ' umbenannt.'
                print comment
                if os.path.exists(value):
                    os.remove(value)
                os.rename(files, value)
                Logging(log, logging.INFO, logPath, logMessageFormat, comment)


if __name__ == '__main__':
    main()
Crazy Otto
  • 125
  • 2
  • 13

1 Answers1

5

I think the problem is your Logging function. Every time you log, you're creating a new FileHandler and adding it to the set of handlers, and you do this for every file renamed, so you rapidly hit the limit on open file descriptors. Configure your logger once, then use it many times, don't configure it every time you use it.

Note that the exception might not be raised in Logging; deleting a file on Windows involves opening it for delete, so you could max out open files with loggers, then fail when you try to delete a file.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • ah -- that makes sense, thanks. i'll give it a spin without the logging and see what happens. – Crazy Otto Apr 07 '16 at 11:16
  • so what would be the best way (based on my code) to configure the logger once, then use it many times, as opposed to what I'm doing now? – Crazy Otto Apr 07 '16 at 11:37
  • @CrazyOtto: Just create and configure `logger` at the top level as a global, not in a function, and replace calls to `Logging` with just `logger.info(comment)`. – ShadowRanger Apr 07 '16 at 13:01
  • 2
    Deleting a file requires opening a kernel File handle. A Windows process can open over 16 million kernel handles. The limit here is just on the number of CRT lowio file descriptors and stdio `FILE` streams. The former varies from 2048 file descriptors in VS 2008 (Python 2.7) up to 8192 in VS 2015 (Python 3.5). The number of stdio `FILE` streams only affects Python 2.x. It's initially limited to 512 and can be increased up to 2048 by calling [`_setmaxstdio`](https://msdn.microsoft.com/en-us/library/6e3b887c%28v=vs.90%29). – Eryk Sun Apr 07 '16 at 18:19
  • @eryksun: Good to know, thanks. I didn't know it was a purely `stdio` issue (but knowing that, I knew that Python 3 bypasses `stdio` and uses OS I/O functions directly, so it makes sense the problem would be Py2 only). – ShadowRanger Apr 07 '16 at 19:45
  • Python 3 doesn't use Windows I/O functions. It uses the C runtime [lowio](https://msdn.microsoft.com/en-us/library/40bbyw78) functions (e.g. `_wopen`, `_read`, `_write`) that provide basic POSIX compatibility by wrapping Windows API functions (e.g. `CreateFile`, `ReadFile`, `WriteFile`) and mapping Windows handles to POSIX file descriptors. There's a hard limit of 8192 open file descriptors in the VS 2015 CRT. An open Python issue exists that proposes modifying the raw I/O layer (i.e. `io.FileIO`) to directly use the Windows API, but it hasn't gained much traction. – Eryk Sun Apr 07 '16 at 20:13