Is there a performant way to sample files from a file system until you hit a target sample size in Python?
For example, let's say I have 10 million files in an arbitrarily nested folder structure and I want a sample of 20,000 files.
Currently, for small-ish flat directories of ~100k or so, I can do something like this:
import os
import random
sample_size = 20_000
sample = random.sample(list(os.scandir(path)), sample_size)
for direntry in sample:
print(direntry.path)
However, this doesn't scale up well. So, I thought maybe put the random check in the loop. This sort of works, but has the problem of if the number of files in the directory is close the sample_size
, it may not pick up the full target sample_size
and I would need to keep track of which files were included in the sample and then keep looping until I fill up the sample bucket.
import os
import random
sample_size = 20_000
count = 0
for direntry in os.scandir(path):
if random.randint(0, 10) < 5:
continue
print(direntry.path)
count += 1
if count >= sample_size:
print("reached sample_size")
break
Any ideas on how to randomly sample a large selection of files from a large directory structure?