4

I have some issue with a project of mine, which aims to scan one or more directories in search of MP3 files and store its metadata and paths into MongoDB. The main computer which runs the code is a Windows 10 64-bit machine, with 8GB RAM, CPU AMD Ryzen 3.5 GHz (4 cores). Windows resides on an SSD, while the music on HDD 1 TB.
The nodejs app can be launched manually by command line or through NPM, starting from here. I'm using a recursive function to scan all the directories and we're talking about 20 thousand files more or less.
I've solved the EMFILE: too many files open issue through graceful-fs but now I've landed to a new issue: JavaScript heap out of memory.
Below is the complete output which I receive:

C:\Users\User\Documents\GitHub\mp3manager>npm run scan

> experiments@1.0.0 scan C:\Users\User\Documents\GitHub\mp3manager
> cross-env NODE_ENV=production NODE_OPTIONS='--max-old-space-size=4096' node scripts/cli/mm scan D:\Musica

Scanning 1 resources in production mode
Trying to connect to  mongodb://localhost:27017/music_manager
Connected to mongo...

<--- Last few GCs --->

[16744:0000024DD9FA9F40]   141399 ms: Mark-sweep 63.2 (70.7) -> 63.2 (71.2) MB, 47.8 / 0.1 ms  (average mu = 0.165, current mu = 0.225) low memory notification GC in old space requested
[16744:0000024DD9FA9F40]   141438 ms: Mark-sweep 63.2 (71.2) -> 63.2 (71.2) MB, 38.9 / 0.1 ms  (average mu = 0.100, current mu = 0.001) low memory notification GC in old space requested


<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x02aaa229e6e9 <JSObject>
    0: builtin exit frame: new ArrayBuffer(aka ArrayBuffer)(this=0x027bb3502801 <the_hole>,0x0202be202569 <Number 8.19095e+06>,0x027bb3502801 <the_hole>)

    1: ConstructFrame [pc: 000002AF8F50D385]
    2: createUnsafeArrayBuffer(aka createUnsafeArrayBuffer) [00000080419526C9] [buffer.js:~115] [pc=000002AF8F8440B1](this=0x027bb35026f1 <undefined>,size=0x0202be202569 <Number 8.19095e+06>)
    3:...

FATAL ERROR: Committing semi space failed. Allocation failed - JavaScript heap out of memory
 1: 00007FF6E36FF04A
 2: 00007FF6E36DA0C6
 3: 00007FF6E36DAA30
 4: 00007FF6E39620EE
 5: 00007FF6E396201F
 6: 00007FF6E3E82BC4
 7: 00007FF6E3E79C5C
 8: 00007FF6E3E7829C
 9: 00007FF6E3E77765
10: 00007FF6E3989A91
11: 00007FF6E35F0E52
12: 00007FF6E3C7500F
13: 00007FF6E3BE55B4
14: 00007FF6E3BE5A5B
15: 00007FF6E3BE587B
16: 000002AF8F55C721
npm ERR! code ELIFECYCLE
npm ERR! errno 134

I've tried to use NODE_OPTIONS='--max-old-space-size=4096' but I'm not even sure that Node is considering this option on Windows. I've tried p-limit to limit the number of promises effectively running, but honestly, I'm a bit out of new ideas now and I'm starting thinking to use another language to see if it can cope better with these kinds of issues. Any advice would be appreciated. Have a nice day.

EDIT: I tried to substitute the processDir function with the one posted by @Terry but the result it's the same.

Update 2019-08-19: In order to avoid the heap issues, I removed the recursion and used a queue to add the directories:


const path = require('path');
const mm = require('music-metadata');
const _ = require('underscore');
const fs = require('graceful-fs');
const readline = require('readline');

const audioType = require('audio-type');
// const util = require('util');
const { promisify } = require('util');
const logger = require('../logger');
const { mp3hash } = require('../../../src/libs/utils');
const MusicFile = require('../../../src/models/db/mongo/music_files');

const getStats = promisify(fs.stat);
const readdir = promisify(fs.readdir);
const readFile = promisify(fs.readFile);
// https://github.com/winstonjs/winston#profiling

class MusicScanner {
    constructor(options) {
        const { paths, keepInMemory } = options;

        this.paths = paths;
        this.keepInMemory = keepInMemory === true;
        this.processResult = {
            totFiles: 0,
            totBytes: 0,
            dirQueue: [],
        };
    }

    async processFile(resource) {
        const buf = await readFile(resource);
        const fileRes = audioType(buf);          
        if (fileRes === 'mp3') {
            this.processResult.totFiles += 1;

            // process the metadata
            this.processResult.totBytes += fileSize;
        }
    }

    async processDirectory() {
        while(this.processResult.dirQueue.length > 0) {
            const dir = this.processResult.dirQueue.shift();
            const dirents = await readdir(dir, { withFileTypes: true });
            const filesPromises = [];

            for (const dirent of dirents) {
                const resource = path.resolve(dir, dirent.name);
                if (dirent.isDirectory()) {
                    this.processResult.dirQueue.push(resource);
                } else if (dirent.isFile()) {
                    filesPromises.push(this.processFile(resource));
                }
            }

            await Promise.all(filesPromises);
        }
    }


    async scan() {
        const promises = [];

        const start = Date.now();

        for (const thePath of this.paths) {
            this.processResult.dirQueue.push(thePath);
            promises.push(this.processDirectory());
        }

        const paths = await Promise.all(promises);
        this.processResult.paths = paths;
        return this.processResult;
    }
}

module.exports = MusicScanner;

The problem here is that the process takes 54 minutes to read 21K files and I'm not sure how I could speed up the process in this case. Any hints on that?

Chris
  • 1,140
  • 15
  • 30
  • Do you have some large files in the directory? Your code reads every kind of file into the heap, and videos are easily gigiabytes. A stack after switching to the scanDir @Terry provided should show the size of the buffer being allocated which may differentiate. With the original version it could blow up on total dir contents, with his traversal it should survive unless an individual file is sufficiently large.. – lossleader Aug 17 '19 at 14:59

2 Answers2

2

I'm not sure how helpful this will be, but I created a test script to see if I got the same results as you, I'm also running Windows 10.

It might be useful for you to run this script and see if you get any issues. I am able to list all files in /program files/ (~91k files) or even /windows (~265k files) without blowing up. Maybe it's another operation rather than simply listing the files that's causing the problem.

The script will return a list of all the files in the path, so that's pretty much what you need. Once you have this it can simply be iterated in a linear manner and then you can add the details to your Mongo DB instance.

const fs = require('fs');
const path = require('path');
const { promisify } = require('util');
const getStats = promisify(fs.stat);
const readdir = promisify(fs.readdir);

async function scanDir(dir, fileList) {

    let files = await readdir(dir);
    for(let file of files) {
        let filePath = path.join(dir, file);
        fileList.push(filePath);
        try {
            let stats = await getStats(filePath);
            if (stats.isDirectory()) {
                await scanDir(filePath, fileList);
            }
        } catch (err) {
            // Drop on the floor.. 
        }
    }

    return fileList;   
}

function logStats(fileList) {
    console.log("Scanned file count: ", fileList.length);
    console.log(`Heap total: ${parseInt(process.memoryUsage().heapTotal/1024)} KB, used: ${parseInt(process.memoryUsage().heapUsed/1024)} KB`);
}

async function testScan() {
    let fileList = [];
    let handle = setInterval(logStats, 5000, fileList);
    let startTime = new Date().getTime();
    await scanDir('/program files/', fileList);
    clearInterval(handle);
    console.log(`File count: ${fileList.length}, elapsed: ${(new Date().getTime() - startTime)/1000} seconds`);
}

testScan();
Terry Lennox
  • 29,471
  • 5
  • 28
  • 40
  • Even trying your code produces the same result for me. The strange thing is that also Visual Code crashes, even if I launch the app through the command line, using the CLI app... – Chris Aug 12 '19 at 12:41
  • VS code **sometimes** crashes, maybe it's a collateral effect of memory issues – Chris Aug 12 '19 at 12:50
  • Ok, cool. Thank you for trying out the code.. I wonder what the difference is that's causing the problem. I am using Node.js v10.15.1. I've also updated my answer to log memory stats, that might be useful. I do notice memory usage increasing the more files that are scanned. – Terry Lennox Aug 12 '19 at 13:21
  • Another point.. is it possible you're running into unbounded recursion here, is there something within the test directory structure that's causing the process to continue without stopping. 20,000 files seems a very small number to cause an out of memory error. I'm seeing memory usage of perhaps 257 mb scanning my entire c drive. A lot more than optimal, but not leading to an out of memory error. – Terry Lennox Aug 12 '19 at 15:59
  • Scanning my drive (as noted above) was using ~257 mb of memory for around 1.7 million files (as a point of comparison) – Terry Lennox Aug 12 '19 at 16:07
0

I could consider solved this issue (at least on Linux, I still have to try on Windows), following these steps (using a Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz with 8GB of RAM here):

  • Removal of the recursion function, in favor of a queue strategy: I store directories' paths inside an array and I store the promises for processing files inside a temporary array, executing them once they exceed 100 as dimension's length;
  • use of mediainfo and eyeD3 instead of music-metadata: despite the fact music-metadata is a great module, I noticed it was consuming 140% of my CPU and 30% of my RAM. The combined use of mediainfo and eyeD3 (just to extract the image) improved a lot the performance. No more heap issues, heap total

Now to store 20329 files in Mongo it takes less than 4 minutes, while if I store the cover art images, it takes around 16 minutes (due to the extra file reading and eyeD3 execution).

Complete source code here.

Chris
  • 1,140
  • 15
  • 30