4

Is there a performant way to sample files from a file system until you hit a target sample size in Python?

For example, let's say I have 10 million files in an arbitrarily nested folder structure and I want a sample of 20,000 files.

Currently, for small-ish flat directories of ~100k or so, I can do something like this:

import os
import random
sample_size = 20_000
sample = random.sample(list(os.scandir(path)), sample_size)
for direntry in sample:
    print(direntry.path)

However, this doesn't scale up well. So, I thought maybe put the random check in the loop. This sort of works, but has the problem of if the number of files in the directory is close the sample_size, it may not pick up the full target sample_size and I would need to keep track of which files were included in the sample and then keep looping until I fill up the sample bucket.

import os
import random
sample_size = 20_000
count = 0
for direntry in os.scandir(path):
    if random.randint(0, 10) < 5:
        continue
    print(direntry.path)
    count += 1
    if count >= sample_size:
        print("reached sample_size")
        break

Any ideas on how to randomly sample a large selection of files from a large directory structure?

lifebythedrop
  • 401
  • 3
  • 18
  • Why don't you randomly walk dirs and subdirs until you have the sample size you want? E.g listdir, choose at random if is another dir, repeat listdir choose at random if it is a file, store path and add the size to the total – E.Serra Dec 11 '18 at 16:10

1 Answers1

3

Use iterators/generators so you won't keep all files in memory. And use Reservoir sampling to pick selected samples from the basically a stream of file names.

Code

from pathlib import Path
import random

pathlist = Path("C:/Users/XXX/Documents").glob('**/*.py')
nof_samples = 10

rc = []
for k, path in enumerate(pathlist):
    if k < nof_samples:
        rc.append(str(path)) # because path is object not string
    else:
        i = random.randint(0, k)
        if i < nof_samples:
            rc[i] = str(path)

print(len(rc))
print(rc)
Severin Pappadeux
  • 18,636
  • 3
  • 38
  • 64