3

I have a Python script that needs to process a large number of files. To get around Linux's relatively small limit on the number of arguments that can be passed to a command, I am using find -print0 with xargs -0.

I know another option would be to use Python's glob module, but that won't help when I have a more advanced find command, looking for modification times, etc.

When running my script on a large number of files, Python only accepts a subset of the arguments, a limitation I first thought was in argparse, but appears to be in sys.argv. I can't find any documentation on this. Is it a bug?

Here's a sample Python script illustrating the point:

import argparse
import sys
import os

parser = argparse.ArgumentParser()
parser.add_argument('input_files', nargs='+')
args = parser.parse_args(sys.argv[1:])

print 'pid:', os.getpid(), 'argv files', len(sys.argv[1:]), 'argparse files:', len(args.input_files)

I have a lot of files to run this on:

$ find ~/ -name "*" -print0 | xargs -0 ls > filelist
748709 filelist

But it appears xargs or Python is chunking my big list of files and processing it with several different Python runs:

$ find ~/ -name "*" -print0 | xargs -0 python test.py
pid: 4216 argv files 1819 number of files: 1819
pid: 4217 argv files 1845 number of files: 1845
pid: 4218 argv files 1845 number of files: 1845
pid: 4219 argv files 1845 number of files: 1845
pid: 4220 argv files 1845 number of files: 1845
pid: 4221 argv files 1845 number of files: 1845
...

Why are multiple processes being created to process the list? Why is it being chunked at all? I don't think there are newlines in the file names and shouldn't -print0 and -0 take care of that issue? If there were newlines, I'd expect sed -n '1810,1830p' filelist to show some weirdness for the above example. What gives?

I almost forgot:

$ python -V
Python 2.7.2+
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Jake Biesinger
  • 5,538
  • 2
  • 23
  • 25
  • Weird problem. As another option you could of course just parse `filelist` inside your script. – Benjamin Bannier Feb 01 '12 at 20:44
  • 1
    possible duplicate of [python sys.argv limitations?](http://stackoverflow.com/questions/5533704/python-sys-argv-limitations) – jcollado Feb 01 '12 at 20:49
  • 1
    That's what xargs does. It still needs to call Python via the shell, so would have the same limitation on the arguments. Why not make your Python program take the `~/` and `-name *` parameters directly? – John La Rooy Feb 01 '12 at 20:49
  • 1
    I had thought xargs somehow magically gets around the limited argument space problem. Turns out it just forks off a separate processes with smaller chunks. Also turns out this behavior makes no difference in every application I've used xargs for, save this one... – Jake Biesinger Feb 02 '12 at 01:12

5 Answers5

7

xargs will chunk your arguments by default. Have a look at the --max-args and --max-chars options of xargs. Its man page also explains the limits (under --max-chars).

Lars Kotthoff
  • 107,425
  • 16
  • 204
  • 204
  • Thanks. I hadn't seen this before. Any idea why the above command `find ~/ -name "*" -print0 | xargs -0 ls > filelist` actually works? It seems `ls` will be called several times, all writing (and not appending!) to the same file. Perhaps the file is opened only once and it's xargs' output we're capturing? – Jake Biesinger Feb 01 '12 at 23:02
  • The shell takes care of the redirection. `ls` actually outputs to `stdout`. Think of it as everything on the line inside parentheses and the redirection outside. – Lars Kotthoff Feb 02 '12 at 08:47
2

Python does not seem to place a limit on the number of arguments but the operating system does.

Have a look here for a more comprehensive discussion.

Community
  • 1
  • 1
Till Hoffmann
  • 9,479
  • 6
  • 46
  • 64
2

Everything that you want from find is available from os.walk.

Don't use find and the shell for any of this.

Use os.walk and write all your rules and filters in Python.

"looking for modification times" means that you'll be using os.stat or some similar library function.

S.Lott
  • 384,516
  • 81
  • 508
  • 779
  • 1
    I agree in principle that doing all this from within python is the way to go, using os.walk, glob.glob, and os.stat. What I didn't know was that xargs still obeys the os limit and just makes multiple calls to the command with the remaining arguments. – Jake Biesinger Feb 01 '12 at 23:00
  • "agree in principle" means "disagree". Yet, you provide no reasons. Here are the reasons why your decision is all-advice. All Python is faster, simpler and more flexible. Simpler because it's one language: Python. Faster because it all runs in a single process (without swapping). If you want more speed, use `multiprocessing`. Finally, it's more flexible because you're not constrained by the weird limitations of `find`. There is no downside to simplifying your application. – S.Lott Feb 01 '12 at 23:21
  • The application **normally** doesn't need to process thousands of files. The work is usually done on a few dozen files (chromosomes); In a particular benchmark I'm testing, I have thousands of files. Perhaps my OP didn't make that clear. – Jake Biesinger Feb 02 '12 at 01:06
  • @JakeBiesinger: Perhaps my comment didn't make it clear that `find` remains a bad idea if you're processing one file or millions. Rather than repeat the reasons, I'll conclude by saying that there's no downside to replacing a `find` -based shell script with Python. No downside. Numerous advantages. – S.Lott Feb 02 '12 at 01:32
  • ofc there's a downsides, you're complicating your script with extra logic. instead of doing whatever operation it was doing on files, now it also has deal with how walk the file system. `find` is incredibly powerful and already does this exceedingly well – CervEd Aug 06 '21 at 10:44
1

xargs will pass as much as it can, but there's still a limit. For instance,

find ~/ -name "*" -print0 | xargs -0 wc -l | grep total

will give you multiple lines of output.

You probably want to have your script either take a file containing a list of filenames, or accept filenames on its stdin.

retracile
  • 12,167
  • 4
  • 35
  • 42
0

The problem is that xargs is limited by the number of chars of the calling arguments (maximum 2091281).

A quick test showed this ranges from 5000 files - 55000 files, depending on the length of the path.

The solution to get more is to accept piping in the file path through standard-in instead.

find ... -print0 | script.py

#!/usr/bin/env python3

import sys

files = sys.stdin.read().split('\0')
...

CervEd
  • 3,306
  • 28
  • 25