1

I have to parse a very big CSV file in NodeJS and save it in a database (async operation) that allows up to 500 entries at a time. Due to memory limits I have to stream the CSV file and want to use PapaParse to parse the CSV file (as that worked best in my case).

As PapaParse uses a callback style approach to parse Node.js streams I didn't see an easy to combine highland (for batching and data transform) and PapaParse. So, I tried to use a ParseThrough stream to write data to and read that stream with highland for batching:

const csv = require('papaparse');
const fs = require('fs');
const highland = require('highland');
const { PassThrough } = require('stream');

const passThroughStream = new PassThrough({ objectMode: true });

csv.parse(fileStream, {
  step: function(row) {
    // Write data to stream
    passThroughStream.write(row.data[0]);
  },
  complete: function() {
    // Somehow "end" the stream
    passThroughStream.write(null);
  },
});

highland(passThroughStream)
  .map((data) => {
    // data transform
  })
  .batch(500)
  .map((data) => {
    // Save up to 500 entries in database (async call)
  });

Obviously that doesn't work as is and doesn't do anything really. Is something like that even possible or even an better way to parse very big CSV files and save the rows in a database (in batches of up to 500)?

Edit: Using the csv package (https://www.npmjs.com/package/csv) it would be possible like so (same for fast-csv):

highland(fileStream.pipe(csv.parse()))
  .map((data) => {
    // data transform
  })
  .batch(500)
  .map((data) => {
    // Save up to 500 entries in database (async call)
  });

But unfortunately both NPM packages do not parse the CSV files properly in all cases.

mrksbnch
  • 1,792
  • 2
  • 28
  • 46
  • You'd need to control the flow with `parser.pause()` and `parser.resume()` methods that papaParse seems to use. I may write some module for this. – Michał Karpacki Feb 06 '18 at 19:52
  • @MichałCzapracki Hm, that'll probably take more time if I have to wait every 500 rows to save something in the database (very big CSV files). – mrksbnch Feb 06 '18 at 20:02
  • Well it will anyway if you use Highland - because that how it handles asynchronous work - but just consider if your database will actually handle the inserts correctly. – Michał Karpacki Feb 06 '18 at 20:10
  • 1
    Check this one out: [scramjet](https://www.npmjs.com/package/scramjet) - it will work on the inserts in parallel. But remember, if you need to keep the order - you may have a race condition on the DB side - in sequences and autoindexes. – Michał Karpacki Feb 06 '18 at 20:12
  • @MichałCzapracki Thanks, I've look at scramjet and it looks good (similar to highland to me) but I'm not exactly sure how that will work with PapaParse either (as I unfortunately can't use "fast-csv" or "csv" in my case)? I see a small example with CSVParse but not really that much about it in the docs other than that. – mrksbnch Feb 06 '18 at 23:32
  • And forgot to mention: The order is not really an issue in my case. – mrksbnch Feb 06 '18 at 23:34
  • 1
    It took me a couple lines to make papaparse work with scramjet, so I changed my internal parser from csv-parse to papa - because there's no dependencies underneath. I'll write a proper answer. – Michał Karpacki Feb 07 '18 at 09:14

1 Answers1

4

After a quick look at papaparse I decided to implement CSV parser in scramjet.

fileStream.pipe(new scramjet.StringStream('utf-8'))
    .csvParse(options)
    .batch(500)
    .map(items => db.insertArray('some_table', items))

I hope that works for you. :)

Michał Karpacki
  • 2,588
  • 21
  • 34