3

i have a directory that contains small XML files (every file is 170~200 bytes), and i want to read all content of every file and merge them in a single XML file, displayed in a tree.

OLD

FileUtils.File + NetUtil.asyncFetch + NetUtil.readInputStreamToString

Time to read 3000 XML files 1112.3642930000005 msec

NEW

OS.File.DirectoryIterator + OS.File.read

Time to read 3000 XML files 5330.708094999999 msec

I noticed an enormous difference in the reading time per single file : OLD has a time of 0.08~0.12 msec NEW has a time 0.5~6.0 msec ( 6.0 it's not a typo i saw some time peaks, in comparison to the OLD)

I know that the OLD one is linked to C++ but at : https://developer.mozilla.org/en-US/docs/Mozilla/JavaScript_code_modules/OSFile.jsm

OS.File is a new API designed for efficient, off-main thread, manipulation of files by privileged JavaScript code.

I don't see the efficency of the NEW API. Is there something wrong in my code?

n.b : dbgPerf is a performance debug that collects time and a comment in an object array and performs all calculation when i call the end function at the end of all. it does not affect performance.

Code using nsIFile :

this._readDir2 = function (pathToTarget, callbackEndLoad) {

    var _content = '';
    dbgPerf.add("2 start read dir");

    var fuDir = new FileUtils.File(pathToTarget);
    var entries = fuDir.directoryEntries;
    var files = [];
    while (entries.hasMoreElements()) {

        var entry = entries.getNext();
        entry = entry.QueryInterface(OX.LIB.Ci.nsIFile);

        if (entry.isFile()) {

            var channel = NetUtil.newChannel(entry);
            files.push(channel);
            dbgPerf.add("ADD file" + entry.path);
        } else {
            dbgPerf.add("NOT a file" + entry.path);
        }
    }

    var totalFiles = files.length;
    var totalFetched = 0;

    for (var a = 0; a < files.length; a++) {

        var entry = files[a];

        dbgPerf.add("start asynch file " + entry.name);
        NetUtil.asyncFetch(entry, function (inputStream, status) {

            totalFetched++;

            if (!Components.isSuccessCode(status)) {
                dbgPerf.add('asyncFetch failed for reason ' + status);
                return;
            } else {

                _content += NetUtil.readInputStreamToString(inputStream, inputStream.available());
                dbgPerf.add("process end file " + entry.name);
            }

            if (totalFetched == files.length) {

                var parser = new DOMParser();

                _content = _content.replace(/<root>/g, '');
                _content = _content.replace(/<\/root>/g, '');
                _content = _content.replace(/<catalog>/g, '');
                _content = _content.replace(/<\/catalog>/g, '');
                _content = _content.replace(/<\?xml[\s\S]*?\?>/g, '');

                xmlDoc = parser.parseFromString('<?xml version="1.0" encoding="utf-8"?><root>' + _content + '</root>', "text/xml");
                //dbgPerf.add("2 fine parsing XML file " + arrFileData);

                var response = {};
                response.total = totalFiles;
                response.xml = xmlDoc;

                callbackEndLoad(response);
            }
        });
    }

    dbgPerf.add("2 AFTER REQUEST ALL FILE");
};

CODE USING OS.File :

this._readDir = function (pathToTarget, callbackEndLoad) {

    dbgPerf.add("1 inizio read dir");

    var xmlDoc;
    var arrFileData = '';

    var iterator = new OS.File.DirectoryIterator(pathToTarget);

    var files = [];
    iterator.forEach(function onEntry(entry) {
        if (!entry.isDir) {
            files.push(entry.path);
        }
    });

    var totalFetched = 0;

    files.forEach(function (fpath) {

        Task.spawn(function () {

            arrFileData += OS.File.read(fpath, {
                encoding: "utf-8"
            });

            totalFetched++;

            if (totalFetched == files.length) {

                var parser = new DOMParser();

                arrFileData = arrFileData.replace(/<root>/g, '');
                arrFileData = arrFileData.replace(/<\/root>/g, '');
                arrFileData = arrFileData.replace(/<catalog>/g, '');
                arrFileData = arrFileData.replace(/<\/catalog>/g, '');
                arrFileData = arrFileData.replace(/<\?xml[\s\S]*?\?>/g, '');

                xmlDoc = parser.parseFromString('<?xml version="1.0" encoding="utf-8"?><root>' + arrFileData + '</root>', "text/xml");
                dbgPerf.add("1 fine parsing XML file " + arrFileData);

                var response = {};
                response.xml = xmlDoc;

                callbackEndLoad(response);
            }
        });
    });
};
  • 1
    If you only need to read (not write), another option would be XHR which should be faster and more efficient, but I have not compared their performances. – erosman May 15 '15 at 03:03
  • 1
    IMHO probably it requests more resources to open/close a XHR than read a filestream when you have the pointer to it. i'll try to use XMLHttpRequest and post results, i hope someone from the mozilla dev can answer this question. Just to know if OS.File performance is better than nsIFile or if i must do something different in the code to obtain better results. – Francesco Danti May 15 '15 at 07:15
  • Your code blocks are the wrong way around. – Luckyrat May 15 '15 at 08:18
  • Very very awesome research thanks for sharing this!! XHR to read is an interesting thing to test too! – Noitidart May 16 '15 at 07:35

3 Answers3

2

I'm the author of OS.File.

We had some benchmarks of nsIFile vs. OS.File back in the days. If you were to rewrite either nsIFile to work in a background thread (which is not possible by design of XPConnect) or OS.File to work in the main thread (which we made impossible to avoid blocking the UX), in most cases that I recall, you would find that OS.File is faster.

As mentioned, by design, OS.File is designed specifically to not perform any work in the main thread. That's because I/O tasks have unpredictable duration – in extreme and unpredictable cases, the simple act of closing a file can block the thread for several seconds, which is unacceptable in the main thread.

A consequence of this is that what you are benchmarking is actually the following:

  1. Serialize the request and send it to the OS.File thread;
  2. Perform the actual I/O;
  3. Serialize the response and send it to the main thread;
  4. Wait until the next tick of the main thread (which is when the main thread actually receives the response);
  5. Deserialize the response;
  6. Trigger the then callback and wait until the next tick of the main thread (by definition of Promise).

The I/O efficiency is in step (2), insofar as OS.File is often much smarter than nsIFile, so will perform less I/O than nsIFile. That's better for battery, better for being a good citizen and playing nice with other processes, and better by comparison to other I/O performed in the same thread. The responsiveness is due to the fact that we perform as little work as possible in the main thread. But if your code is executed in the main thread, the total throughput is often going to be much lower than nsIFile due to steps (1), (3), (4), (5), (6).

I hope this answers your question.

PS Your snippets are wrong. For one thing, they are inverted. Also, you forget a yield in the call to OS.File.read.

Yoric
  • 3,348
  • 3
  • 19
  • 26
  • 1
    thanks for the answer. I really appreciate your input. I understand perfectly from what you said what OS.File does. I will try benchmarking again with the correction you suggested. Is there a way to accomplish the task of reading 3000 to 10000 small files in a directory? You think it is better to have an external agent/Service like a DatabaseServer that does this specific work? – Francesco Danti May 16 '15 at 19:53
  • I'm not sure I understand the question. You can read many small files in a directory, either using the main thread API for OS.File, or by spawning your own worker and using the worker API for OS.File. The former is easier to do, so you might want to do it for testing, but the latter will probably perform better, i.e. your I/O will not accidentally starve other clients of OS.File. I'm not sure what you mean by external agent/service. – Yoric May 17 '15 at 19:25
  • @FrancescoDanti pleaes see Yoric's follow up. I don't think you got notified of his comment. – Noitidart May 20 '15 at 04:14
  • Sorry for the delay, i saw the answer but i'm doing further testing at the end i will share the results. By external Agent/Service i mean an external program such as a server that does all the processing workload and gives back results, so it speeds up this specific workload ( not designed to be done by javascript) in asynch mode like an XHR so it won't break the UI. BUT i'm still trying to remain IN browser/javascript and avoid all external "help" – Francesco Danti May 20 '15 at 11:46
  • In that case, I suggest you spawn a [ChromeWorker](https://developer.mozilla.org/en-US/docs/Web/API/ChromeWorker) and use OS.File from that worker. – Yoric May 21 '15 at 15:48
1

OS.File is efficient because is is non-blocking. Sure, this makes benchmarking suffer, but the user will enjoy an uninterrupted experience and even an increase of the perceived speed.

paa
  • 5,048
  • 1
  • 18
  • 22
1

What you've demonstrated is a way in which the new OSFile approach is much slower than the old approach but that doesn't necessarily conflict with the statement that the new method is more efficient.

The fact that the I/O runs on a different thread means that other parts of the application can still do useful work while the I/O thread is waiting for the (often incredibly slow) storage to supply the data. That directly results in visible improvements such as increased UI smoothness and therefore in nearly all cases, users will benefit from this new approach.

However, the cost for these types of increased efficiency is that your code no longer gets immediate access to the file it has requested so the total time you have to wait for the data to be supplied to your code is going to be higher.

It might be worth testing a 3rd approach where you run your code in a worker - this will get you access to a synchronous file API and therefore might allow you regain some of the speed you saw with the old nsIFile approach while retaining the benefit of not blocking the main thread.

https://developer.mozilla.org/en-US/docs/Mozilla/JavaScript_code_modules/OSFile.jsm/OS.File_for_workers

Luckyrat
  • 1,455
  • 14
  • 16