Traditional Data Lake vs AWS Lake Formation

Question

I have been setting up data lakes for clients wherein we load the data from onprem or any other sources, into the S3 (a data lake). We will create an AWS Glue catalog on these raw data to create schemas.

The next step would be to either use an EMR or AWS Glue for some data cleansing, load the transformed data into RDS / REDSHIFT / S3 as final target.

The jobs can be scheduled using Data pipeline, Glue Jobs, or AWS Lambda event trigger depending on the use case / service used.

The analysts, other users would be provided required data / S3 bucket access using IAM service for Quicksight visualizations or data querying using Athena, Drill, etc. or use the data for ML applications in Sagemaker.

My question is how is AWS Lake Formation different from above traditional Data Lakes?

I can define that AWS Lake Formation provides all the above services such as S3, Glue Catalog, ETL code generator in Glue, Job scheduler, etc. are available in a single window? With some more advanced security for users / data (record / column level) that can be configured from within the Lake Formation console.

Is there anything else that makes Lake formation stand out from the traditional cloud based Data Lake?

Thanks

Golammott · Answer 1 · 2021-03-05T02:26:06.970

Your understanding is correct, Lake Formation is essentially just a permissions model over the Glue Catalog that allows close integration with the other AWS data lake tools: Athena, S3, Glue, EMR, etc. As well as some additional features like Blueprints (for syncing data from RDBMS to S3), Jobs (for ETL), and Crawlers (for data discovery).

Lake Formation allows easier permission management for "user" IAM roles in your environment by allowing them to be centrally managed through the Lake Formation UI and API. Instead of having to update individual IAM/bucket policies each time a role needs a new access, Lake Formation allows you to onboard a single "service" IAM role to have bucket access and then grant Database/Table/Column level access to the user IAM roles that need it.

The user roles essentially assume the service role to perform their operations (Might not be assume exactly as this is an AWS black-box). So Lake Formation saves you from the hassle of having to manage permissions for all user IAM roles via a mess of IAM/bucket policies.

It also offers some ease of integration with sharing data to cross account resources if your setup requires it.

score 4 · Answer 2 · answered Sep 12 '20 at 06:04

AWS Lake Formation is primarily a Permission control layer which is coupled with AWS Glue to basically provide catalog coupled with permissions control. Lake Formation provides reprieve from managing IAM Permissions and instead provides its own Grant based fine grain permission control using simple DB like grants.

Lake Formation still has some challenges with regards to integration with some data services like EMR.(It requires additional IAM policies) But overall using Lake Formation with S3, Glue ETL provides everything needed to build a data lake.

Lake Formation can still benefit from a improved UI and Data Discovery.

You can use Lake Formation to implement traditional styled Data Lake or make them more modular and provide support across multiple AWS accounts.

Traditional Data Lake vs AWS Lake Formation

2 Answers2