9

I'm trying to use an Athena View as a data source to my AWS Glue Job. The error message I'm getting while trying to run the Glue job is about the classification of the view. What can I define it as? Thank you

Error Message Appearing

Akira Yamamoto
  • 4,685
  • 4
  • 42
  • 43
  • The only workaround I could find so far is using the Filter class: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-filter.html In my case I am using the view to filter some data. – Akira Yamamoto Jan 21 '19 at 05:53
  • I found another way of using a SQL query in AWS Glue: https://stackoverflow.com/a/45824735/475876 – Akira Yamamoto Feb 07 '19 at 05:14

1 Answers1

13

You can by using the Athena JDBC driver. This approach circumvents the catalog, as only Athena (and not Glue as of 25-Jan-2019) can directly access views.

  1. Download the driver and store the jar to an S3 bucket.
  2. Specify the S3 path to the driver as a dependent jar in your job definition.
  3. Load the data into a dynamic frame using the code below (using an IAM user with permission to run Athena queries).
from awsglue.dynamicframe import DynamicFrame
# ...
athena_view_dataframe = (
    glueContext.read.format("jdbc")
    .option("user", "[IAM user access key]")
    .option("password", "[IAM user secret access key]")
    .option("driver", "com.simba.athena.jdbc.Driver")
    .option("url", "jdbc:awsathena://athena.us-east-1.amazonaws.com:443")
    .option("dbtable", "my_database.my_athena_view")
    .option("S3OutputLocation","s3://bucket/temp/folder") # CSVs/metadata dumped here on load
    .load()
    )

athena_view_datasource = DynamicFrame.fromDF(athena_view_dataframe, glueContext, "athena_view_source")

The driver docs (pdf) provide alternatives to IAM user auth (e.g. SAML, custom provider).

The main side effect to this approach is that loading causes the query results to be dumped in CSV format to the bucket specified with the S3OutputLocation key.

I don't believe that you can create a Glue Connection to Athena via JDBC because you can't specify an S3 path to the driver location.

Attribution: AWS support totally helped me get this working.

Alejandro C De Baca
  • 888
  • 2
  • 9
  • 16