1

Goal

We standing up a low volume site, where users (browser client) will select image files (284 KB per file) and then request a Node Express Server to bundle them into a ZIP for download to the web client.

Issues & Design Constraints

  • The resultant ZIP might be on the order of 50 MB - 5 GB. Therefore we would like to give the user a running progress bar while the ZIP is being constructed. (We assume the browser will give running updates as to the progress of the actual download).
  • While we expect low volume of requests (1-2 request at a time). However, we do not want to completely tie up our 4 core server processor, so we want to minimize synchronous calls that tie up the express server.
  • Given the size of the ZIP, we cannot expect the zip to be assembled only in memory
  • Is there any other issues we should worry about?

Question

We assume that running 7zip as a child process is bad, since we would not get any running status as to how many of the 258KB files had been added to the ZIP.

So which of the following packages are very Node/ExpressJS friendly packages given the design constraints/goals listed above?

What I am seeing above is that most packages first collect the files, and then finalize them to memory and then pipe them to the http request (probably not good for 5GB of data or am I missing something). Some seem to be able to use disk, but the question will be does one get update events as each file is added?

Others seem to be fully async and I don't see how you would get a running progress value as each file added to the ZIP package.

Dr.YSG
  • 7,171
  • 22
  • 81
  • 139
  • I would think the simplest design would be to run a child process that puts the resulting zip file on disk in a temp file and manages it's own memory consumption. Then, when that's done you can stream the temp file from disk as the download. Then, all you need is an executable that offers some sort of progress on the construction of the zip file to stdout. Since it's running in another process, you don't have to worry about how it does its job processor-wise because it won't tie up nodejs in any way. Compressing is, by its nature, somewhat CPU-intensive so you can't really avoid that. – jfriend00 May 23 '17 at 16:01
  • @jfriend00 all I have available on the target machine is 7zip, and when I tried the --bb logging switch, it just prints out the files after the archive is made. Maybe there is a switch I am missing? https://sevenzip.osdn.jp/chm/cmdline/switches/index.htm – Dr.YSG May 26 '17 at 18:48

2 Answers2

2

Of the packages listed above. Most were not appropriate

  • JSZIP is mainly for the browser
  • EasyZip is a node wrapper for of JSZIP, but it does not provide progress notifications durring creation
  • Express-Zip is an in-memory express friendly RES solution (but probably would not handle the size of the ZIP we are talking about)
    • ZIP-Stream is underlying utility underleath Archiver. Archiver has the queuing services, so one should just user archiver
  • YAZL might work, but the interface is more complex for progress tracking than Archiver

We chose Archiver, since it had most of the features desired:

  • Express Friendly
  • low memory footprint
  • as fast as 7ZIP for the particular image archives we create (we don't need to compress, files are large, etc.) You might have 25% hit in performance for other types of archives
  • It does not let you append to existing archives (that was one feature we wanted), but adm-zip might provide that gap

As for the 7zip solution. We tend not to like reading the entrails of a standard output stream from a spawned child process.

  • It is messy to find strings int he streams
  • it causes context switches to read the stream,
  • you have a brittle solution trying to deal with what output stream puts out (e.g. in the case of 7zip it sometimes leaps the counter by 30% sometimes by 1%), as well as other sources for brittle solutions.
Dr.YSG
  • 7,171
  • 22
  • 81
  • 139
  • One question that really depends upon the performance profile for your server is that it appears node-archiver is running in-process. If you're doing a lot of this or serving a lot of users at once, you may really benefit from moving the archive handling out of process (as in running 7zip in another process). And, at load, that would provide far more scalability benefits than the extra cost of piping stdout from one process to another. Dealing with parsing the output stream is a purely solvable problem (it's just a matter of fully understanding it and writing code to handle it). – jfriend00 May 30 '17 at 20:39
  • So, that's certainly fine to prefer the progress interface that node-archiver gives you, but be careful about whether it meets your scalability needs or not. The issues you mention with 7zip are completely solvable. More code to write, but solvable if you want the advantages of throwing more CPUs at the problem or offloading the zip handling from your express process to free it up to be more responsive handling other web requests. – jfriend00 May 30 '17 at 20:41
  • The very first line in my post reads: "We standing up a low volume site," So no, scaling to lots of simultaneous users is not part of the architecture or problem space. – Dr.YSG May 30 '17 at 21:04
  • And then you say "However, we do not want to completely tie up our 4 core server processor" so I wouldn't say your requirement is really very clear. If you're happy with what you've got, that's fine. My comments apply to others (who might have slightly different desires) who come along and read this. – jfriend00 May 30 '17 at 21:21
  • It's an I/o bound problem. I did not want a package that tried to compress already compress d images and turn this into a CPU bound problem. One can only give a small piece of the problem in SO before folks eyes glaze over. – Dr.YSG May 30 '17 at 21:26
  • OK, understood. – jfriend00 May 30 '17 at 21:51
0

We assume that running 7zip as a child process is bad, since we would not get any running status as to how many of the 258KB files had been added to the ZIP.

That appears to be a false assumption.

A command line like this will show progress for each file added to the archive on stdout as each new file is added:

7z a -bsp1 -bb3 test.7z *

So, you can launch that from node.js using the child process module and you should be able to capture the stdout progress as it happens. You will need to use spawn, not exec so you can get the stdout data live as it happens.

Running this as a child process will keep your nodejs process free to serve other requests and will allow the child process to manage its own memory, independent of nodejs.

The 7zip program handles extremely large archives and files with appropriate memory usage. With the right flags to get progress to stdout and running it as a child process, it appears to meet all your requirements.

jfriend00
  • 683,504
  • 96
  • 985
  • 979
  • Thank you @jfriend. I got time to do my own R&D into this. There are pros and cons to both methods. The NPM packages that use disk for staging the data (and can handle 64bit zip containers) use the node Zlib https://nodejs.org/api/zlib.html , Which might save heavy process context swapping over a 7zip spawn approach. I am going to test out both 7zip and archiver (which has nice progress monitoring) and report back. – Dr.YSG May 29 '17 at 16:18
  • BTW, the -bb3 or just -bb1 does not help, as stated, it does not show progress. here is what I am using for 7zip -a -tzip -mx1 -mmt-=on -r -bsp1 – Dr.YSG May 29 '17 at 16:20
  • Oh yes, the reason I did not think that 7zip could do incremental progress was: https://superuser.com/questions/702122/how-to-show-extraction-progress-of-7zip-inside-cmd – Dr.YSG May 29 '17 at 16:21
  • @Dr.YSG - In my command window, it shows each file that is being added to the archive as it is added. I consider that a coarse level of progress. And, it's way more info than I was seeing without that flag. If you're looking for something else, then please be a LOT more specific in your question. I thought I was helping and offering you info you didn't already know. I guess I'm just wasting my time. – jfriend00 May 29 '17 at 16:27
  • You are helping, and and that is why I said earlier thank you (see top comment).What I was commenting was that the -bb3 flag by itself is useful. In fact you can remove it. It is the -bsp1 flag that actually give progress (the bb flag shows results at the end of the compression). – Dr.YSG May 29 '17 at 19:47
  • @friend00 , Since you are a meta member, perhaps you know. Is there a way to give you the bonus credit, but for me to provide my own answer that better maps to my problem and is a more reasonable approach for most folks? (i.e. I found that the node-archiver works better and is at least as fast as 7zip) for my data sets. – Dr.YSG May 29 '17 at 19:54
  • @Dr.YSG - I have not actually offered a bounty myself (I mostly answer questions, not ask them), but [this post in the Help Center](https://stackoverflow.com/help/bounty) says: "The bounty period lasts 7 days. Bounties must have a minimum duration of at least 1 day. After the bounty ends, there is a grace period of 24 hours to manually award the bounty. Simply click the bounty award icon next to each answer to permanently award your bounty to the answerer. (You cannot award a bounty to your own answer.)". So, it sounds like you can award a bounty to one answer and then accept your own answer. – jfriend00 May 29 '17 at 23:30
  • @Dr.YSG - And, as described [here on meta](https://meta.stackexchange.com/questions/116567/can-i-award-a-bounty-to-myself-if-i-provide-the-best-answer), you can accept your own answer, but award the bounty to a different answer that helped you. – jfriend00 May 29 '17 at 23:33