Pulling random files out of a folder for sampling

Question

I needed a way to pull 10% of the files in a folder, at random, for sampling after every "run." Luckily, my current files are numbered numerically, and sequentially. So my current method is to list file names, parse the numerical portion, pull max and min values, count the number of files and multiply by .1, then use random.sample to get a "random [10%] sample." I also write these names to a .txt then use shutil.copy to move the actual files.

Obviously, this does not work if I have an outlier, i.e. if I have a file 345.txt among other files from 513.txt - 678.txt. I was wondering if there was a direct way to simply pull a number of files from a folder, randomly? I have looked it up and cannot find a better method.

Thanks.

Ignore the numbering in the file name...Simply load a list of all your files, and use random indexes into the list — Grantly, Mar 14 '18 at 14:58
@Grantly Or just pull random values out of the list without even worrying about the index. — abarnert, Mar 14 '18 at 15:23

score 11 · Answer 1 · answered Mar 14 '18 at 15:25

Using numpy.random.choice(array, N) you can select N items at random from an array.

import numpy as np
import os

# list all files in dir
files = [f for f in os.listdir('.') if os.path.isfile(f)]

# select 0.1 of the files randomly 
random_files = np.random.choice(files, int(len(files)*.1))

score 2 · Accepted Answer · answered May 10 '18 at 19:30

I was unable to get the other methods to work easily with my code, but I came up with this.

output_folder = 'C:/path/to/folder'
for x in range(int(len(files) *.1)):
    to_copy = choice(files)
    shutil.copy(os.path.join(subdir, to_copy), output_folder)

score 1 · Answer 3 · answered Mar 14 '18 at 15:05

1

This will give you the list of names in the folder with mypath being the path to the folder.

from os import listdir
from os.path import isfile, join
from random import shuffle
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
shuffled = shuffle(onlyfiles)
small_list = shuffled[:len(shuffled)/10]

This should work

answered Mar 14 '18 at 15:05

Samuel Muiruri

492
1
8
17

Shuffling the whole list in-place will be less efficient than sampling out of it when you only ever want 10% of the values, but this is so simple and obvious to understand that it wins easily unless performance matters, and I doubt the performance cost will be even measurable in the overall application. – abarnert Mar 14 '18 at 15:26
If performance of anything is an issue, it’ll be calling isfile on each file; it might be worth switching from listdir to scandir to avoid all those stat calls. – abarnert Mar 14 '18 at 15:27

Alex Bodnya · Answer 4 · 2018-03-14T15:18:13.817

You can use following strategy:

Use list = os.listdir(path) to get all your files in the directory as list of paths.
Next, count your files with range = len(list) function.
Using rangenumber you can get random item number like that random_position = random.randrange(1, range)
Repeat step 3 and save values in a list until you get enough positions (range/10 in your case)
After that you can get required files names like that list[random_position]

Use cycle for for iterating.

Hope this helps!

score 0 · Answer 5 · answered Feb 15 '20 at 17:18

Based on Karl's solution (which did not work for me under Win 10, Python 3.x), I came up with this:

import numpy as np
import os

# List all files in dir
files = os.listdir("C:/Users/.../Myfiles")

# Select 0.5 of the files randomly 
random_files = np.random.choice(files, int(len(files)*.5))

# Get the remaining files
other_files = [x for x in files if x not in random_files]

# Do something with the files
for x in random_files:
    print(x)

Pulling random files out of a folder for sampling

5 Answers5