0

I have a function that gets all the files in a directory recursively using fs.readdirSync. It works well with the small directory I ran it through as a test, but now that I am running this on a directory that is over 100GB large, it is taking a very long time to complete. Any ideas on how I can speed this up or if there's a better way of doing this? I'm eventually going to have to run this over some directories with Terabytes of data.

// Recursive function to get files
function getFiles(dir, files = []) {
    // Get an array of all files and directories in the passed directory using fs.readdirSync
    const fileList = fs.readdirSync(dir);
    // Create the full path of the file/directory by concatenating the passed directory and file/directory name
    for (const file of fileList) {
        const name = `${dir}/${file}`;
        // Check if the current file/directory is a directory using fs.statSync
        if (fs.statSync(name).isDirectory()) {
            // If it is a directory, recursively call the getFiles function with the directory path and the files array
            getFiles(name, files);
        } else {
            // If it is a file, push the full path to the files array
            files.push(name);
        }
    }
    return files;
}
Omi in a hellcat
  • 558
  • 1
  • 8
  • 29
  • You shouldn't have a single directory with gazillions of files in the first place - filesystems tend to be bad at that. Also, with your current implementation doing that may well run out of memory storing all the filenames. – AKX Jul 06 '23 at 18:44
  • Also, you might want to just use `fs/promises`'s asynchronous `readdir` instead. – AKX Jul 06 '23 at 18:45
  • 2
    Also-also: what does `time find -type f that-big-100gb-directory > /dev/null` say? Is that slow, too? – AKX Jul 06 '23 at 18:50
  • If you talk about `is over 100GB large,` and `directories with Terabytes of data.` do you then talk about the size of the files or the count of files? – t.niese Jul 06 '23 at 20:04
  • Sorry if I wasn't clear, when I refer to a directory I am just referring to a network drive that I am scanning through. – Omi in a hellcat Jul 06 '23 at 21:42

1 Answers1

2

Unfortunately going async is slower. So we need to optimize your code. You can do it with {withFileTypes:true} option and it gets 2x faster.

Also I've tried node v20's {recursive:true} option but it's slower than even your solution. And it didn't work with withFileTypes.

Maybe a better SSD with high read speed would help. Though file entries are read from a file system index I guess, not sure how hardware affects this.

import fs from 'fs';

const DIR = '/bytex';

function getFiles(dir, files = []) {
    // Get an array of all files and directories in the passed directory using fs.readdirSync
    const fileList = fs.readdirSync(dir);
    // Create the full path of the file/directory by concatenating the passed directory and file/directory name
    for (const file of fileList) {
        const name = `${dir}/${file}`;
        // Check if the current file/directory is a directory using fs.statSync
        if (fs.statSync(name).isDirectory()) {
            // If it is a directory, recursively call the getFiles function with the directory path and the files array
            getFiles(name, files);
        } else {
            // If it is a file, push the full path to the files array
            files.push(name);
        }
    }
    return files;
}

function getFiles2(dir, files = []) {
    const fileList = fs.readdirSync(dir, { withFileTypes: true });
    fileList.forEach(file => file.isDirectory() ? getFiles2(`${dir}/${file.name}`, files) : files.push(`${dir}/${file.name}`));
    return files;
}

let start = performance.now();
let files = getFiles(DIR);
console.log(performance.now() - start);
console.log(files.length);

start = performance.now();
files = getFiles2(DIR);
console.log(performance.now() - start);
console.log(files.length);

The output:

171.66947209835052
64508
68.24071204662323
64508
Alexander Nenashev
  • 8,775
  • 2
  • 6
  • 17
  • For reference, I ran your getFiles2 function over a network drive totaling in 292GB and it took 3.1 hours to complete. Much faster than my original solution. On average twice as fast. – Omi in a hellcat Jul 07 '23 at 16:45
  • 1
    @Omiinahellcat nice to hear. as i predicted. but your bottleneck is the network. if you could execute the code remotely with the drive as a local one it could be seconds not hours – Alexander Nenashev Jul 07 '23 at 16:53