9

I am looking for a very rapid way to check whether a folder contains more than 2 files.

I worry that len(os.listdir('/path/')) > 2 may become very slow if there are a lot of files in /path/, especially since this function will be called frequently by multiple processes at a time.

bluppfisk
  • 2,538
  • 3
  • 27
  • 56
  • time it, try it. I don't think it will be slow... – Roman Pavelka Jul 04 '23 at 14:35
  • Are you sure your folder contains only file ? Because your current code will also count subdirectory. – Viper Jul 04 '23 at 14:39
  • 2
    Note that `len` of a list is O(1) - the list already "knows" its length, it doesn't have to be traversed to count the items: https://stackoverflow.com/questions/1115313/cost-of-len-function – slothrop Jul 04 '23 at 14:39
  • @Viper: yes only files! – bluppfisk Jul 04 '23 at 14:39
  • 2
    @slothrop, I know, it's not `len()` that will slow things down, it's the populating of `os.listdir()` that I think will slow things down. – bluppfisk Jul 04 '23 at 14:41
  • I believe `pathlib.Path('/path/').glob('*')` is a generator, so you can loop over the results and then exit the loop when you get the third file, and you don't have to wait for all the other files to be returned. – John Gordon Jul 04 '23 at 14:51

6 Answers6

10

There is indeed another function introduced by PEP471 : os.scandir(path)

As it returns a generator, no list will be created and worse case scenario (huge directory) will still be lightweight.

Its higher level interface os.walk(path) will allow you to go through a directory without having to list all of it.

Here is a code example for your specific case :

import os

MINIMUM_SIZE = 2

file_count = 0
for entry in os.scandir('.'):
    if entry.is_file():
        file_count += 1
    if file_count == MINIMUM_SIZE:
        break

enough_files = (file_count == MINIMUM_SIZE)
PoneyUHC
  • 287
  • 1
  • 11
6

To get the fastest it's probably something hacky.

My guess was:


def iterdir_approach(path):
    iter_of_files = (x for x in Path(path).iterdir() if x.isfile())
    try:
        next(iter_of_files)
        next(iter_of_files)
        next(iter_of_files)
        return True
    except:
        return False

We create a generator and try to exhaust it, catching the thrown exception if necessary.

To profile the approaches we create a bunch of directories with a bunch of files in them :

import shutil
import tempfile
import timeit
import matplotlib.pyplot as plt
from pathlib import Path


def create_temp_directory(num_directories):
    temp_dir = tempfile.mkdtemp()
    for i in range(num_directories):
        dir_path = os.path.join(temp_dir, f"subdir_{i}")
        os.makedirs(dir_path)
        for j in range(random.randint(0,i)):
            file_path = os.path.join(dir_path, f"file_{j}.txt")
            with open(file_path, 'w') as file:
                file.write("Sample content")
    return temp_dir

We define the various approaches (Copied the other two from the answers to the question:


def iterdir_approach(path):
    #@swozny
    iter_of_files = (x for x in Path(path).iterdir() if x.isfile())
    try:
        next(iter_of_files)
        next(iter_of_files)
        next(iter_of_files)
        return True
    except:
        return False

def len_os_dir_approach(path):
    #@bluppfisk
    return len(os.listdir(path)) > 2


def check_files_os_scandir_approach(path):
    #@PoneyUHC
    MINIMUM_SIZE = 3
    file_count = 0
    for entry in os.scandir(path):
        if entry.is_file():
            file_count += 1
        if file_count == MINIMUM_SIZE:
            return True
    return False


def path_resolve_approach(path):
    #@matleg
    directory_path = Path(path).resolve()
    nb_files = 0
    enough_files = False
    for file_path in directory_path.glob("*"):
        if file_path.is_file():
            nb_files += 1
        if nb_files > 2:
            return True
    return False

def dilettant_approach(path):
    #@dilettant
    gen = os.scandir(path)  # OP states only files in folder /path/
    enough = 3  # At least 2 files

    has_enough = len(list(itertools.islice(gen, enough))) >= enough

    return has_enough
def adrian_ang_approach(path):
    #@adrian_ang
    count = 0
    with os.scandir(path) as entries:
        for entry in entries:
            if entry.is_file():
                count += 1
                if count > 2:
                    return True
    return False

Then we profile the code using timeit.timeit and plot the execution times for various amounts of directories:


num_directories_list = [10, 50, 100, 200, 500,1000]
approach1_times = []
approach2_times = []
approach3_times = []
approach4_times = []
approach5_times = []
approach6_times = []


for num_directories in num_directories_list:
    temp_dir = create_temp_directory(num_directories)
    subdir_paths = [str(p) for p in Path(create_temp_directory(num_directories)).iterdir()]
    approach1_time = timeit.timeit(lambda: [iterdir_approach(path)for path in subdir_paths], number=5)
    approach2_time = timeit.timeit(lambda: [check_files_os_scandir_approach(path)for path in subdir_paths], number=5)
    approach3_time = timeit.timeit(lambda: [path_resolve_approach(path)for path in subdir_paths], number=5)
    approach4_time = timeit.timeit(lambda: [len_os_dir_approach(path)for path in subdir_paths], number=5)
    approach5_time = timeit.timeit(lambda: [dilettant_approach(path)for path in subdir_paths], number=5)
    approach6_time = timeit.timeit(lambda: [adrian_ang_approach(path)for path in subdir_paths], number=5)


    approach1_times.append(approach1_time)
    approach2_times.append(approach2_time)
    approach3_times.append(approach3_time)
    approach4_times.append(approach4_time)
    approach5_times.append(approach5_time)
    approach6_times.append(approach6_time)




    shutil.rmtree(temp_dir)

Visualization of the results


plt.plot(num_directories_list, approach1_times, label='iterdir_approach')
plt.plot(num_directories_list, approach2_times, label='check_files_os_scandir_approach')
plt.plot(num_directories_list, approach3_times, label='path_resolve_approach')
plt.plot(num_directories_list, approach4_times, label='os_dir_approach')
plt.plot(num_directories_list, approach5_times, label='dilettant_approach')
plt.plot(num_directories_list, approach6_times, label='adrian_ang_approach')


plt.xlabel('Number of Directories')
plt.ylabel('Execution Time (seconds)')
plt.title('Performance Comparison')
plt.legend()
plt.show()

enter image description here

Closeup of best 3 solutions: enter image description here

Sebastian Wozny
  • 16,943
  • 7
  • 52
  • 69
  • Nice quick job ! :D – PoneyUHC Jul 04 '23 at 15:24
  • I think in `dilettant_approach`, `enough` should be 3 for comparability with the other methods? – slothrop Jul 04 '23 at 15:51
  • 2
    Fixed, Thanks! I though it was fine with `>` but that wasn't enough – Sebastian Wozny Jul 04 '23 at 15:55
  • Hm, I find the dependency on the number of directories irritating, as the question was on a single folder with only files (as clarified in a response to a comment asking for that). I expect a more or less constant execution time (as observed in my sloppy benchmarks) no proportionality to the number of files / dirs. – Dilettant Jul 04 '23 at 16:12
  • The code generates a directory full of directories that only contain files to check. The number of files in the directories is randomized, but we vary the amount of directories to observe scaling behaviour. – Sebastian Wozny Jul 04 '23 at 16:13
  • @SebastianWozny Sure, I did read that in your driver / setup code, but the question was on a more simpler use case, where O(1) (well maybe O(m) with m = "enough") would be the complexity. And these queries would be run from different processes. Anyway, I think the approaches all are fast enough (as far as that can be read from the original requirements). – Dilettant Jul 04 '23 at 16:16
  • 1
    I see what you mean, maybe I went a little too fast. Ultimately the comparison is still valid, as it gives a somewhat even spread of how many files there are in each directory? – Sebastian Wozny Jul 04 '23 at 16:18
  • I added another approach using a C extension, which is even faster. There's probably room for optimizing that C code too. – bluppfisk Jul 04 '23 at 17:41
  • 1
    very cool! I find it super cool that we go to these lengths to answer some questions :D – Sebastian Wozny Jul 04 '23 at 20:40
4

for anyone wanting to try the C approach, here's a module you can import from Python (only does files, not subdirs)

#define PY_SSIZE_T_CLEAN
#include <stdio.h>
#include <dirent.h>
#include <stdlib.h>
#include <Python.h>

static PyObject *
method_dircnt(PyObject *self, PyObject *args)
{
    DIR *dir;
    const char *dirname;
    long min_count, count = 0;
    struct dirent *ent;

    if (!PyArg_ParseTuple(args, "sl", &dirname, &min_count))
    {
        return NULL;
    }

    dir = opendir(dirname);

    while((ent = readdir(dir)))
            if (ent->d_name[0] != '.') {
                ++count;
                if (count >= min_count) {
                    closedir(dir);
                    Py_RETURN_FALSE;
                }
            }

    closedir(dir);

    Py_RETURN_TRUE;
}

static char dircnt_docs[] = "dircnt(dir, min_count): Returns False if dir countains more than min_count files.\n";

static PyMethodDef dircnt_methods[] = {
    {"dircnt", (PyCFunction)method_dircnt, METH_VARARGS, dircnt_docs},
    {NULL, NULL, 0, NULL}
};

static struct PyModuleDef dircnt_module_def = 
{
    PyModuleDef_HEAD_INIT,
    "dircnt",
    "Check if there are more than N files in dir",
    -1,
    dircnt_methods
};

PyMODINIT_FUNC PyInit_dircnt(void){
    // Py_Initialize();

    return PyModule_Create(&dircnt_module_def);
}

build:

gcc -I /usr/include/python3.11 dircnt.c -v -shared -fPIC -o dircnt.so (or wherever your headers from the python-dev package are)

usage:

from dircnt import dircnt
dircnt(path, min_count)

It is a fair bit faster especially for higher min_count values:

min_count = 2 with min_count = 2

min_count = 200 with min_count = 200

bluppfisk
  • 2,538
  • 3
  • 27
  • 56
3

If you want something more explicit using pathlib, you can try:

from pathlib import Path

directory_path = Path('/path/').resolve()
nb_files = 0 
enough_files = False
for file_path in directory_path.glob("*"):
    if file_path.is_file():
        nb_files += 1
    if nb_files >= 2:
        enough_files = True
        break
print(enough_files)

matleg
  • 618
  • 4
  • 11
3

As the OP knows there are only files within /path/ one optimization is to not test on the file attributes.

This version should be profiting from the prior knowledge / constraints:

import itertools
import os

gen = os.scandir('/path/')  # OP states only files in folder /path/
enough = 2  # At least 2 files

# Build an iterator that only returns the first enough elements
# measure the length of the resulting list (at most enough elements)
# and apply the criterion to get the boolean result
has_enough = len(list(itertools.islice(gen, enough))) >= enough

print(has_enough)

Placing this in a shell script and use hyperfine to measure some random performance (folder with 500+ files):

❯ hyperfine ./ssssss.sh
Benchmark #1: ./ssssss.sh
  Time (mean ± σ):      77.6 ms ±   0.6 ms    [User: 29.9 ms, System: 31.8 ms]
  Range (min … max):    76.3 ms …  79.4 ms    36 runs

... and as it should not really matter same system on a folder with more than 100k files:

❯ ls -l |wc -l
  100204

~
❯ hyperfine ./ssssss.sh
Benchmark #1: ./ssssss.sh
  Time (mean ± σ):      79.6 ms ±   1.1 ms    [User: 31.9 ms, System: 33.5 ms]
  Range (min … max):    76.8 ms …  82.1 ms    35 runs
Dilettant
  • 3,267
  • 3
  • 29
  • 29
2

You can use the os.scandir function instead. For example, to check if a folder contains more than 2 files, it iterates over directory entries only 2 times and returns positively when the directory has at least 2 files:

import os

def has_more_than_two_files(path):
    count = 0
    with os.scandir(path) as entries:
        for entry in entries:
            if entry.is_file():
                count += 1
                if count > 2:
                    return True
    return False
Adrian Ang
  • 520
  • 5
  • 12
  • what makes the context manager faster? – bluppfisk Jul 05 '23 at 06:23
  • From the [Python Documentation of os.scandir](https://docs.python.org/3/library/os.html#os.scandir), it returns an iterator of `os.DirEntry` objects rather than a list of file names, so no need to load the entire list of file names into memory – Adrian Ang Jul 05 '23 at 11:44
  • yes but why the context manager? – bluppfisk Jul 05 '23 at 16:44
  • Apologies I misunderstood, the context manager (`with` statement) will stop consuming resources when the iteration is halted on the third count, and need not complete the iteration – Adrian Ang Jul 05 '23 at 23:50
  • hm I understand what a context manager does; but why would it be faster. I mean why don't you just `for entry in os.scandir(path)`? – bluppfisk Jul 06 '23 at 01:00
  • Both ways work the same to achieve the goal. But I think using context manager ensures that the resources can be reused for other processes when iteration is halted, ie, no assigned but unconsumed resources which is wasted, so should be more efficient. Imagine there is like a `f.close()` closure to redistribute the unused resources. This seems to be true in the time measurement charts – Adrian Ang Jul 06 '23 at 02:49