0

As per our AWS environment , we have 2 different types SAGs( service account Group) for Data storage. One SAG is for generic storage , another SAG for secure data which will only hold PII or restricted data. In our environment, we are planning to deploy Glue . In that case , Would we have one metastore over both secure and non-secure? If we needed two meta stores, how would this work with Databricks? If one metastore, how to handle the secure datas ? Please help us to more details on this in .

CHEEKATLAPRADEEP
  • 12,191
  • 1
  • 19
  • 42

2 Answers2

0

In AWS Glue, each AWS account has one persistent metadata store per region (called Glue Data catalog). It contains database definitions, table definitions, job definitions, and other control information to manage your AWS Glue environment. You manage permissions to that objects using IAM (e.g., who can make GetTable or GetDatabase API calls to that objects).

In addition to AWS Glue permissions, you would also need to configure permissions to the data itself (e.g., who can make GetObject API call to the data stored on S3).

So, answering your questions. Yes, you would have a single data catalog. However, depending on your security requirements, you would be able to define resource-based and role-based permissions on metadata and content.

You can find a detailed overview here - https://aws.amazon.com/blogs/big-data/restrict-access-to-your-aws-glue-data-catalog-with-resource-level-iam-permissions-and-resource-based-policies

shuraosipov
  • 101
  • 1
  • 4
0
  1. If you are using a single region with one AWS Account, there will be only one metastore for both secure and generic data, and you will have to handle access with fine grained access policies.
  2. A better approach would be to either use 2 different regions in a single AWS Account, or two different AWS accounts, so that access is easily managed for two different metastores.

To integrate your metastore with Databricks for (1), you will have to create two Glue Catalog instance profiles with resource level access. One instance profile will have access to generic database and tables while the other will have access to the secure databases and tables.

To integrate your metastores with Databricks for (2), you will simply create two Glue Catalog instance profiles with access to the respective metastore.

It is recommended to go with the second option as it will save you guys a lot of maintenance cost and human errors on longer run. More details on Glue Catalog and Databricks integration.

Edit: Based on the discussion in comments, if we have to access both datasets inside the same Databricks Runtime, option 2 won't work. Option 1 can be used with 2 permission sets. First only for generic data and second for both generic and secure data.

amsh
  • 3,097
  • 2
  • 12
  • 26
  • Thanks for your details . " To integrate your metastores with Databricks for (2), you will simply create two Glue Catalog instance profiles with access to the respective metastore". From this statement Suppose if we want to access both Metastore , can we configure more than One Glue Catalog instance with as the metastore for Databricks Runtime ? – – Karthikeyan Rasipalay Durairaj Oct 06 '20 at 16:31
  • @KarthikeyanRasipalayDurairaj, no we can not do that. Databricks connects with one metastore at a time, even for switch we have to restart the Databricks Runtime. – amsh Oct 06 '20 at 16:43
  • Sure . In that case , if we want to access both secure and non secure data storage through same cluster is not at all possible ? sorry . This may be invalid question . but want to understand more information on this . Thanks in advance . – Karthikeyan Rasipalay Durairaj Oct 06 '20 at 16:49
  • @KarthikeyanRasipalayDurairaj That's right. It won't be possible to access both datasets inside the same cluster. – amsh Oct 06 '20 at 16:50
  • Another question here , where Glue can be deployed either in Data storage SAG level or user level SAG ? which one is your recommendation and why ? Please share this details too if you aware. Thanks in advance – Karthikeyan Rasipalay Durairaj Oct 06 '20 at 17:15
  • If I choose, option-01 (with multiple permission level) , what is disadvantage in it when we choose this . – Karthikeyan Rasipalay Durairaj Oct 06 '20 at 17:18
  • @KarthikeyanRasipalayDurairaj if you have complex access patterns, you will have to be diligent while creating IAM Roles and policies else users may access unauthorized datasets. Else if access patterns are simple, e.g. two group, one with all data access and other with only generic data access, then it's manageable. Disadvantage depends on access patterns. – amsh Oct 06 '20 at 17:22
  • Really Great info. where Glue can be deployed either in Data storage SAG level or user level SAG ? which one is your recommendation and why ? Please share this details too if you aware. Thanks in advance . How about deployment model for Glue . – Karthikeyan Rasipalay Durairaj Oct 06 '20 at 17:27
  • @KarthikeyanRasipalayDurairaj, it is better to keep Glue Catalog with Data storage, else you will have to take care of cross account access policies. Once deployed, you will have to integrate Glue with DataBricks Runtime in either case. – amsh Oct 06 '20 at 17:36
  • Do you have any reference page for this recommendation ? – Karthikeyan Rasipalay Durairaj Oct 06 '20 at 17:40
  • @KarthikeyanRasipalayDurairaj these recommendations are based on practical experience, so I only have reasoning. – amsh Oct 06 '20 at 17:42
  • If we go ahead with option-01 (single region ) , is it possible maintain 2 different Glue data catalogs like One for secure and another for One for non-secure in same region ? In this deployment model what you see advantage and disadvantages in it ? – Karthikeyan Rasipalay Durairaj Oct 06 '20 at 20:43
  • Each AWS account has one Glue Catalog per region. If you are using 2 separate accounts for these 2 datasets with same region but different data catalogs, you won't be able to use both of them in a single databricks runtime. On the other hand your access management will be simpler. – amsh Oct 07 '20 at 02:24