best approach to handle million of records from aws s3 getobject with node.js and return records to frontend with pagination

Question

code below fetches csv file data from aws s3 and after fetching data i need to manipulate the response and return the same data from node.js backend to frontend.But problem is that data is more than 200k records which is not feasible node keep that in memory and return same to frontend.

  AWS.config.update({
    accessKeyId: "xxxxxxxxxxxxxxxx",
    secretAccessKey: "xxxxxxxxxxxxxxxxxxxxxxxx",
    "region": "--------"  
})

  const s3 = new AWS.S3();
  const params = {
    Bucket: 'bucket',
    Key: 'userFIle/test.csv',
    Range:"bytes=7777-9999"
  }
  const datae = await s3.getObject(params).promise();
  let str=datae.Body.toString()
  let workBook ,jsonData

  workBook = xlsx.read(str, { type: 'binary' });
  jsonData = workBook.SheetNames.reduce((initial, name) => {
    const sheet = workBook.Sheets[name];
    initial[name] = xlsx.utils.sheet_to_json(sheet);
    return initial;
  }, {});
  console.log(jsonData,"==fffffff==",jsonData.Sheet1.length)

Hi, welcome to StackOverflow. :) I'm looking at your code and I'd very much like to answer the question, but I'd need you to explain a couple things: 1) Why do you use ranged request? You'd need the data from the begining (header). 2) Is this a standard csv file? I'd show you an easier way to parse it then. — Michał Karpacki, Aug 30 '19 at 16:13
hi and thanks for welcome.Actually i wanted to use pagination while receiving data from S3 aws because cvs file uploaded at s3 can contain more than 200k records and i dont want to pass all records at same time to node as it will stop node.js from working. yes file uploaded is standard csv file. — Chandandeep Singh, Sep 02 '19 at 12:28
Hmm... in that case, if you're not attached to the modules you use here I could propose a good solution with [`scramjet`](https://www.scramjet.org/) — Michał Karpacki, Sep 02 '19 at 15:27

score 0 · Accepted Answer · answered Sep 04 '19 at 11:19

The AWS S3 SDK can be used with streaming and since CSV is a good streamed format it can be parsed and transformed while being downloaded.

I'd recommend (as the author) to use scramjet, and based on this answer your code could look like this:

const {StringStream} = require("scramjet");

StringStream
  // Scramjet stream can be created from any stream source or a generator function
  .from(() => s3.getObjectMetadata(key)
    .promise()
    .then(() => s3.getObject().createReadStream())
  )
  // then you can run a flow of commands here
  // First we'd need to parse the items as they are being downloaded from s3.
  // if you have a header in the first line, you can pass {header: true} here,
  // see: https://www.papaparse.com/docs#config for more options
  .CSVParse() 
  // here you can parse the object to your own structure
  .map(row => {
    return {
      id: row[0],
      price: row[1],
      name: row[2]
    }
  })
  // you can also use async functions or promises for every line.
  .each(async function(item) { await doSomethingWithItem(); }
  // this will print every line off your CSV while it's being downloaded
  .each(console.log) 
  .run()
  .catch(error => {
    if (error.statusCode === 404) {
      // Catching NoSuchKey
    }
  });

Please see the docs here: www.scramjet.org.

best approach to handle million of records from aws s3 getobject with node.js and return records to frontend with pagination

1 Answers1