0

I have an API which is connected to AWS lambda which does following:

  1. Getting JSON data from s3. Number of records around 60,000
  2. Using Json2csv library to parse the JSON data to csv string
  3. Putting the csv string result to s3 bucket

Point 2 above is taking too long to parse the JSON data into csv string. The library I am using for it is json2csv: https://www.npmjs.com/package/json2csv

Following is my code:

/// Get data in JSON format in object: records (array of JSON)

let headers = [
    {
      label: "Id",
      value: "id"
    },
    {
      label: "Person Type",
      value: "type"
    },
    {
      label: "Person Name",
      value: "name"
    }
];

let json2csvParser = new Parser({ fields: headers });

console.log("Parsing started");
let dataInCsv = json2csvParser.parse(records);
console.log("Parsing completed");

// PutObject of dataInCsv in s3

It is taking around 20 seconds to parse 60K records. Is there anything I can do to improve the performance here? Any other library? I used to think in memory operations are pretty fast. Why is it that this parsing is slow. Any help please.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
hatellla
  • 4,796
  • 8
  • 49
  • 101
  • Could be memory allocation since synchronous .parse loads everything into RAM. Try the async from json2csv https://www.npmjs.com/package/json2csv. It might be faster – Kaelan Mikowicz Apr 03 '20 at 00:50
  • But shouldn't it be faster if everything is happening inside RAM? Its just few MBs of data and RAM is more than a GB. – hatellla Apr 03 '20 at 01:57
  • I was unclear. The .parse loads all the rows into RAM before it begins parsing. The async / stream method may offer you slightly better performance since growing an array for 60k records may take some time. – Kaelan Mikowicz Apr 03 '20 at 03:27
  • I would suggest that you run your workflow locally in Node.js and compare the performance, You can also run Node.js with the profiler --prof to figure out why it is slow. – Kaelan Mikowicz Apr 03 '20 at 03:28

1 Answers1

1

If you are writing and reading to file you can use this async solution taken from the json2csv package docs.

const { createReadStream, createWriteStream } = require('fs');
const { Transform } = require('json2csv');

const fields = ['field1', 'field2', 'field3'];
const opts = { fields };
const transformOpts = { highWaterMark: 16384, encoding: 'utf-8' };

const input = createReadStream(inputPath, { encoding: 'utf8' });
const output = createWriteStream(outputPath, { encoding: 'utf8' });
const json2csv = new Transform(opts, transformOpts);

const processor = input.pipe(json2csv).pipe(output);

You can replace createReadStream and createWriteStream with the AWS Lambda streams you need, possibly this one

Kaelan Mikowicz
  • 345
  • 2
  • 12