Read data in node from AWS S3 ORC format database

Question

My client have database in AWS s3 ORC format. Now i have to query to fetch data form s3. to do that i used Amazon Athena but the problem is that it's cost price to use athena. so just wanted to ask is there any other way to query ORC format in node js. below is the code of Athena i used.

"use strict";

const AthenaExpress = require("athena-express"),
    aws = require("aws-sdk"),
    awsCredentials = {
        region: "<region>"
    };

aws.config.update(awsCredentials);

const athenaExpressConfig = {
    aws,
    s3: "s3://<bucketName>/<folder>/<folder>",
    getStats: true
};

const athenaExpress = new AthenaExpress(athenaExpressConfig);
// console.log('athenaExpress');
// console.log(athenaExpress);
//Invoking a query on Amazon Athena
(async () => {
    let query = {
        sql: "SELECT * FROM cityMaster LIMIT 3",
        db: "<dbName>",
        getStats: true
    };

    try {
        let results = await athenaExpress.query(query);
        console.log(results);
    } catch (error) {
        console.log(error);
    }
})();

I have been googling but didn't get any tutorial or blog to query ORC format in nodejs. Can someone please guide me how to access S3 ORC data in node js.

Have you tried to mount the s3 bucket to your local file system in order to read it with this library here then? https://cloud.netapp.com/blog/amazon-s3-as-a-file-system https://github.com/corymickelson/norc — Citrullin, Jul 30 '20 at 22:17

score 0 · Answer 1 · answered Aug 03 '20 at 05:25

I think the main thing here is you requiring a 'query service' for Node.js. It's important to note that S3 is more like a container for data storage, rather than a database itself (thus the need for Amazon Athena, which is a query service). What you'll have to do is to first connect S3 to a query service or database of some sort in order to analyse the ORC data, then connect the resulting tables / queries to Node.js.

Alternatives to analysing of S3 data I've seen people use are Presto and Snowflake (with Snowflake being the less common of the two - perhaps due to costs?).

Article on comparing Presto with AWS Athena
Article on comparing Snowflake with AWS Athena

Based on the material I'm able to find, it sounds like Snowflake is the easier option of the two, with Presto being the harder option should you wish to run the Hive metastore on a Docker container. Do get an understanding of the costs of services (to both query and store data) before starting to use them.

Here's an article on accessing S3 data with Presto (another for querying data).
Here's an article on querying ORC data in S3 with Snowflake before loading.

Once the data is in either Presto or Snowflake, it looks relatively straightforward to connect the data to Node.js (see npm packages for Presto and Snowflake).

Read data in node from AWS S3 ORC format database

1 Answers1