Javascript - Read parquet data (with snappy compression) from AWS s3 bucket

Question

In nodeJS, I am trying to read a parquet file (compression='snappy') but not successful.

I used https://github.com/ironSource/parquetjs npm module to open local file and read it but reader.cursor() throws cryptic error 'not yet implemented'. It does not matter which compression (plain, rle, or snappy) was used to create input file, it throws same error.

Here is my code:

const readParquet = async (fileKey) => {

  const filePath = 'parquet-test-file.plain'; // 'snappy';

  console.log('----- reading file : ', filePath);
  let reader = await parquet.ParquetReader.openFile(filePath);
  console.log('---- ParquetReader initialized....');

  // create a new cursor
  let cursor = reader.getCursor();

  // read all records from the file and print them
  if (cursor) {
    console.log('---- cursor initialized....');

    let record = await cursor.next() ; // this line throws exception
    while (record) {
      console.log(record);
      record = await cursor.next();
    }
  }

  await reader.close();
  console.log('----- done with reading parquet file....');

  return;
};

Call to read:

let dt = readParquet(fileKeys.dataFileKey);
dt
  .then((value) => console.log('--------SUCCESS', value))
  .catch((error) => {
    console.log('-------FAILURE ', error); // Random error
    console.log(error.stack);
  })

More info: 1. I have generated my parquet files in python using pyarrow.parquet 2. I used 'SNAPPY' compression while writing file 3. I can read these files in python without any issue 4. My schema is not fixed (unknown) each time I write parquet file. I do not create schema while writing. 5. error.stack prints undefined in console 6. console.log('-------FAILURE ', error); prints "not yet implemented"

I would like to know if someone has encountered similar problem and has ideas/solution to share. BTW my parquet files are stored on AWS S3 location (unlike in this test code). I still have to find solution to read parquet file from S3 bucket.

Any help, suggestions, code example will be highly appreciated.

score 1 · Answer 1 · answered Mar 08 '19 at 00:55

Use var AWS = require('aws-sdk'); to get data from S3.

Then use node-parquet to read parquet file into variable.

import np = require('node-parquet');

// Read from a file:
var reader = new np.ParquetReader(`file.parquet`);
var parquet_info = reader.info();
var parquet_rows = reader.rows();
reader.close();
parquet_rows = parquet_rows + "\n";

entitycs · Answer 2 · 2021-01-28T15:21:20.050

There is a fork of https://github.com/ironSource/parquetjs here: https://github.com/ZJONSSON/parquetjs which is a "lite" version of the ironSource project. You can install it using npm install parquetjs-lite.

The ZJONSSON project comes with a function ParquetReader.openS3, which accepts an s3 client (from version 2 of the AWS SDK) and params ({Bucket: 'x', Key: 'y'}). You might want to try and see if that works for you.

If you are using version 3 of the AWS SDK / S3 client, I have a compatible fork here: https://github.com/entitycs/parquetjs (see tag feature/openS3v3).

Example usage from the project's README.md:

const parquet = require("parquetjs-lite");

const params = {
  Bucket: 'xxxxxxxxxxx',
  Key: 'xxxxxxxxxxx'
};
// v2 example
const AWS = require('aws-sdk');
const client = new AWS.S3({
  accessKeyId: 'xxxxxxxxxxx',
  secretAccessKey: 'xxxxxxxxxxx'
});
let reader = await parquet.ParquetReader.openS3(client,params);

//v3 example
const {S3Client, HeadObjectCommand, GetObjectCommand} = require('@aws-sdk/client-s3');
const client = new S3Client({region:"us-east-1"});
let reader = await parquet.ParquetReader.openS3(
  {S3Client:client, HeadObjectCommand, GetObjectCommand},
  params
);

// create a new cursor
let cursor = reader.getCursor();

// read all records from the file and print them
let record = null;
while (record = await cursor.next()) {
  console.log(record);
}

Javascript - Read parquet data (with snappy compression) from AWS s3 bucket

2 Answers2

Linked