3

My Athena DB is in ap-south-1 region and AWS QuickSight doesn't exist in that region.

How can I connect QuickSight with Athena in that case?

Srihari Karanth
  • 2,067
  • 2
  • 24
  • 34

2 Answers2

3

All you need to do is to copy table definitions from one region to another. There are several ways to do that

With AWS Console

This approach is the most simple one and doesn't require additional setup as everything is based on Athena DDL statements.

  1. Get table definition with
    SHOW CREATE TABLE `database`.`table`;
    
    This should output something like:
    CREATE EXTERNAL TABLE `database`.`table`(
      `col_1` string, 
      `col_2` bigint, 
      ... 
      `col_n` string)
    ROW FORMAT SERDE 
      'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    LOCATION
      's3://some/location/on/s3'
    TBLPROPERTIES (
      'classification'='parquet',
      ... 
      'compressionType'='gzip')
    
  2. Change to a desired region
  3. Create database where you want to store table definitions, or use default one.
  4. Execute statement produced by SHOW CREATE TABLE. Note, you might need to change name of database with respect to previous step
  5. If you table is partitioned then you would need to load all partitions. If data on S3 adheres HIVE partitioning style, i.e.

    s3://some/location/on/s3
    |
    ├── day=01
    |   ├── hour=00
    |   └── hour=01
    ...
    

    then you can use

    MSCK REPAIR TABLE `database`.`table`
    

    Alternatively, you can load partitions one by one

    ALTER TABLE `database`.`table` 
    ADD PARTITION (day='01', hour='00') 
    LOCATION 's3://some/location/on/s3/01/00';
    
    ALTER TABLE `database`.`table` 
    ADD PARTITION (day='01', hour='01') 
    LOCATION 's3://some/location/on/s3/01/01';
    
    ...
    

With AWS API

You can use AWS SDK, e.g. boto3 for python, which provide an easy to use, object-oriented API. Here you have two options:

  1. Use Athena client. Like in a previous approach, you would need to get table definition statement from AWS Console. But all other steps, can be done in scripted manner with the use of start_query_execution method of Athena Client. There are plenty resources online, e.g. this one

  2. Use AWS Glue client. This method is solely based on operation within AWS Glue Data Catalog, which is used by Athena during query execution. Main idea is to create two glue clients, one for source and one for destination catalog. For example

    import boto3
    KEY_ID = "__KEY_ID__"
    SECRET = "__SECRET__"
    
    glue_source = boto3.client(
        'glue',
        region_name="ap-south-1",
        aws_access_key_id=KEY_ID,
        aws_secret_access_key=SECRET
    )
    
    glue_destination = boto3.client(
        'glue',
        region_name="us-east-1",
        aws_access_key_id=KEY_ID,
        aws_secret_access_key=SECRET
    )
    
    # Or you can do it with creating sessions
    glue_source = boto3.session.Session(profile_name="profile_for_ap_south_1").client("glue")
    glue_destination = boto3.session.Session(profile_name="profile_for_us_east_1").client("glue")
    

    Then you would need to use get and create type methods. This would also require parsing responses that would get from glue clients.

With AWS Glue crawlers

Although, you can use AWS Glue crawlers to "rediscover" data on S3, I wouldn't recommend this approach since you already know structure of you data.

Ilya Kisil
  • 2,490
  • 2
  • 17
  • 31
0

The answer of @Ilya Kisil is correct but I would like to bring some more details and alternative solutions.

There are two different approaches you can take.

  1. As suggested by Ilya, copy the table definitions from one region (source region) to another (destination region). The idea is to reference the data of the other region.

    I found the Glue Crawlers much easier and faster. You need to create a Glue Crawler in the source region and specify the S3 bucket of the destination region where the metadata is located. Once you do it, you will see in the Athena source region all the tables of the destination region! Behind the scenes what Glue Crawler does is what Ilya explained in the "With AWS Console" section. So, instead of creating the table one by one and loading the partitions (if exist), you can just create one Glue Crawler.

    Note, that it holds a reference to your destination region tables. So that it doesn't copy the data. At first glance, it seems to be great! Why should we copy the data if we could reference it? But when you take a deeper look, you can find that you are probably going to pay more money $$$. When you reference data, you will pay for the data each query returns and if you consume the data a lot, and you have TB/PB of data, it might be too expensive, and if cost is a consideration for you, I would recommend you consider the second solution.

    Also note, that although the data is not being copied to the source region and just referenced, behind the scenes, when you execute a query, AWS saves the data temporarily in the source region. So, if you need to be GDPR compliant you might need to be aware of that.

  2. Copy the data from the destination region to the source region and have a process that keeps synchronizing it. Then you will not pay for the Athena queries, but rather pay for the storage that is usually cheaper. If possible, you can also copy just what you need or aggregate the data, so you have less copied storage => and less cost.

    A convenient way to do it is by creating a Glue Job that will be responsible for copying the data from the destination region S3 bucket to the source region S3 bucket. And then you can add it to a Glue Workflow that will run this job once a day or whatever is proper for you.

To Summarize:

There are lots of things to consider and I mentioned some of them. In each use case, you have advantages and disadvantages and you can find what is the right one for you.

(Solution 1) Advantages:

  • Easy. Just some clicks.
  • Fast.
  • Referencing the data and no need to have duplicated data.

(Solution 1) Disadvantages:

  • Might be way more expensive (depends on the data usage).

(Solution 2) Advantages:

  • Might be much cheaper

(Solution 2) Disadvantages:

  • Slow/Longer solution
  • Need to copy existing data and then have a process to copy new data
Misha Zaslavsky
  • 8,414
  • 11
  • 70
  • 116