Query S3 in parallel with SQL and partitioning

Question

Is it possible to make simplest concurrent SQL queries on S3 file with partitioning?

The problem it looks like you have to choose 2 options from 3.

You can make concurrent SQL queries against S3 with S3 Select. But S3 Select doesn't support partitioning, it also works on single file at a time.

Athena support partitioning and SQL queries, but it has limit of 20 concurrent queries. Limit could be increased, but there is no guarantees and uper line.

You can configure HBase that works on S3 through EMRFS, but that requires to much configurations. And I suppose data should be written through HBase (another format). Maybe more simple solution?

If you query sparingly, Athena is a great choice (and concurrency limit is not a problem). For sustained (eg production) workloads, standalone Presto (https://aws.amazon.com/marketplace/pp/B07DKV5659) or Presto on EMR are great options that leave in control. — Piotr Findeisen, Nov 05 '18 at 08:33

jbgorski · Accepted Answer · 2018-10-31T11:47:55.193

You can also use such managed services like AWS Glue or AWS EMR.

Example code which you can run in Glue:

import sys 
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

def load_dict(_database,_table_name):
    ds = glueContext.create_dynamic_frame.from_catalog(database = _database, table_name = _table_name, transformation_ctx = "ds_table")
    df = ds.toDF()
    df.createOrReplaceTempView(_table_name)   
    return df

df_tab1=load_dict("exampledb","tab1")
df_sql=spark.sql( "select m.col1, m.col2 from tab1 m")
df_sql.write.mode('overwrite').options(header=True, delimiter = '|').format('csv').save("s3://com.example.data/tab2")

job.commit()

You can also consider to use Amazon Redshift Spectrum.

https://aws.amazon.com/blogs/big-data/amazon-redshift-spectrum-extends-data-warehousing-out-to-exabytes-no-loading-required/

Query S3 in parallel with SQL and partitioning

1 Answers1