0

I've been using the metis package to run Athena queries via R. While this is great for small queries, there still does not seem to be a viable solution for queries with very large return datasets (10's of thousands of rows, for example). However, when running these same queries in the AWS console, it is fast/straightforward to use the download link to obtain the CSV file of the query result.

This got me thinking: is there a mechanism for sending the query via R but returning/obtaining the S3:// bucket location where the query results live instead of the normal results object?

zach
  • 29,475
  • 16
  • 67
  • 88
  • Looks like the kernel of an ideal solution is beginning here: https://gitlab.com/snippets/1733206 – zach Jul 16 '18 at 13:59
  • New hrbrmstr package for using async retrieval: https://github.com/hrbrmstr/roto.athena – zach Jul 26 '18 at 19:27

2 Answers2

1

You could look at the Cloudyr Project. They have a package that handles creating the signature requests for the AWS API. Then you can fire off a query, poll AWS until the query finishes (using the QueryExecutionID), and use aws.s3 to download the result set.

You can also use system() to use AWS CLI commands to execute a query, wait for the results, and download the results.

For example: You could run the following commands on the command line to get the results of a query.

$ aws athena start-query-execution --query-string "select count(*) from test_null_unquoted" --execution-context Database=stackoverflow --result-configuration OutputLocation=s3://SOMEBUCKET/ --output text XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX

Once you get the query-execution-id, then you can check on the results.

$ aws athena get-query-execution --query-execution-id=XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX --output text QUERYEXECUTION select count(*) from test_null_unquoted XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX QUERYEXECUTIONCONTEXT stackoverflow RESULTCONFIGURATION s3://SOMEBUCKET/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX.csv STATISTICS 104 1403 STATUS 1528809056.658 SUCCEEDED 1528809054.945

Once the query succeeds, you can download the data.

$ aws s3 cp s3://stack-exchange/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX.csv

Edit: You can even turn those commands into a one liner (Bash example here), but I'm sure you could do the same thing in powershell.

$ eid=`aws athena start-query-execution --query-string "select count(*) from test_null_unquoted" --query-execution-context Database=SOMEDATABASE--result-configuration OutputLocation=s3://SOMEBUCKET/ --output text --output text` && until aws athena get-query-execution --query-execution-id=$eid --output text | grep "SUCCEEDE D"; do sleep 10 | echo "waiting..."; done && aws s3 cp s3://SOMEBUCKET/$eid.csv . && unset eid

Zerodf
  • 2,208
  • 18
  • 26
  • thanks @zerodf. this is great. My idea solution would be within the context of the JDBC driver so let me give it another day or two to see if such an answer pops up. otherwise I will accept this answer. – zach Jun 13 '18 at 18:10
  • 1
    This concept is used in the packages [`RAthena`](https://github.com/DyfanJones/RAthena) and [`noctua`](https://github.com/DyfanJones/noctua). The packages use python's Boto3 and R paws SDKs into amazon to create a DBI interface into Athena. If you want the Athena Query ID to be surfaced within these packages please raise a issue ticket on the linked githubs. – Dyfan Jones Nov 28 '19 at 19:13
1

As mentioned in my comment above you could investigate the RAthena and noctua packages.

These packages connect to AWS Athena using AWS SDK's as their drivers. What this means is that they will also download the data from S3 in as similar method that is mentioned by @Zerodf. They both use data.table to load the data into R so they are pretty quick. Also you can get to Query Execution ID if required for some reason.

Here is an example of how to use the packages:

RAthena

Create a connection to AWS Athena, for more information around how to connect please look at: dbConnect

library(DBI)
con <- dbConnect(RAthena::athena())

Example in how to query Athena:

dbGetQuery(con, "select * from sampledb.elb_logs")

How to access the Query ID:

res <- dbSendQuery(con,  "select * from sampledb.elb_logs")

sprintf("%s%s.csv",res@connection@info$s3_staging, res@info$QueryExecutionId)

noctua

Create a connection to AWS Athena, for more information around how to connect please look at: dbConnect

library(DBI)
con <- dbConnect(noctua::athena())

Example in how to query Athena:

dbGetQuery(con, "select * from sampledb.elb_logs")

How to access the Query ID:

res <- dbSendQuery(con,  "select * from sampledb.elb_logs")

sprintf("%s%s.csv",res@connection@info$s3_staging, res@info$QueryExecutionId)

Sum up

These packages should do what you are looking for however as they download the data from the query output in s3 I don't believe you will need to go to the query execution ID to do the same process.

Dyfan Jones
  • 229
  • 2
  • 9