Questions tagged [data-lake]
161 questions
0
votes
1 answer
AWS Glue ETL Job getting final dataFrame with Join.apply Vs SQL JOIN Query
I am fairly new to AWS and I am currently exploring it. I was hoping to get an insight or suggestion on the best way to implement the job.
I wanted to get data from multiple mysql tables.
user_transaction
user_loans
promo_offers
To get the final…

Vivek Raskar
- 711
- 5
- 4
0
votes
1 answer
ADLS to Azure Storage Sync Using AzCopy
Looking for some help to resolve the errors I'm facing. Let me explain the scenario. I'm trying to sync one of the ADLS Gen2 container to Azure BLOB Storage. I have AzCopy 10.4.3, I'm using Azcopy Sync to do this. I'm using the command below
azcopy…

VidhyaSagar
- 53
- 1
- 4
0
votes
1 answer
Can we restrict specific users to access some databases in Athena and remaining users should see the other databases?
Problem statement:
Can we restrict specific users to access some databases in Athena and remaining users should see the other database?
We have datalake created in s3 bucket. It is in development stage.
Same s3 bucket datalake is used by end user…
0
votes
2 answers
Is there any function or way to retrieve table names from Snowflake in the order of referential integrity (FK) dependencies?
I would like to retrieve table names from a given schema in the sorted order based on their foreign key dependencies. For example, if I have three following tables created in Snowflake
CREATE TABLE TAB_X
(
COL_A CHAR(18),
COL_B CHAR(18),
…

StewieGGriffin
- 349
- 1
- 4
- 14
0
votes
1 answer
Azure Data Lake Storage Gen2 label appearing as Containers and NOT File System
I just started with Data Lakes in Azure and countered an issue with the ADLS Gen2 screens in the Azure Portal.
Using the Azure Portal, I created a new Storage account to setup a new Azure Data Lake Gen2 storage by following the online instructions.…

Govindarajan
- 33
- 5
0
votes
1 answer
Recommended ETL solution for big data coming from MySQL?
I have a situation where a third party stores data in daily tables, where if the record count exceeds two million a subsequent table is created, and so on, named [date]_x.
Now, we have a reporting requirement and need to consume this data. Manual…

ElHaix
- 12,846
- 27
- 115
- 203
0
votes
2 answers
Searching through data stored in Azure Data Lake
I have the following use case for building a Data Lake (e.g. in Azure):
My organization deals with companies that go into bankruptcy. Once a company goes bankrupt, it needs to hand over all of their data to us, including structured data (e.g. CSVs)…

RobW
- 128
- 2
- 10
0
votes
1 answer
Design data provisioning strategy for big data system?
I'm designing Data provisioning module in an big data system. Data provisioning is describe as
The process of providing the data from the Data Lake to downstream systems is referred to as Data Provisioning; it provides data consumers with secure…

Dong Nguyen Chi
- 21
- 2
0
votes
2 answers
Create a file in DataLake and make it expire after 2 minutes
I am using Data Lake Gen 1 and i would like to create a file and set the expiration time to be 2 minutes after creation.
I am using this method:
public virtual System.Threading.Tasks.Task SetExpiryTimeAsync (string path,…

Angela
- 477
- 1
- 10
- 20
0
votes
1 answer
AWS Glue append to paruqet file
I am currently in the process of designing AWS backed Data Lake.
What I have right now:
XML files uploaded to s3
AWS Glue crawler buids catalogue
AWS ETL job transforms data and saves it in the parquet format.
Each time etl jobs transforms the…

SirKometa
- 1,857
- 3
- 16
- 26
0
votes
1 answer
Do I track changes to my data in a data lake?
Recently I'm discovering the data lake world, I'm planning on setting up a data lake with ADL. One of the things I'm not sure on is how a data lake is supposed to track changes over time/handle different version from a source.
I've come across site…

Remco
- 172
- 9
0
votes
1 answer
How should I partition data for AWS Athena (Presto) if I don't know the queries in advance?
I have big data events (TBs) I need to query and I am trying to partition it correctly.
I have client and each client has many games.
The problem is there are fields we query for, that might be null in some events, therefore they cannot be used as…

Ido Barash
- 4,856
- 11
- 41
- 78
0
votes
1 answer
Terraform : Seperate modules VS one big project
I'm working on a Datalake project composed by many services : 1VPC (+ subnets, security groups, internet gateway, ...), S3 buckets, EMR cluster, Redshift, ElasticSearch, some Lambdas functions, API Gateway and RDS.
We can say that some resources are…

user1297406
- 1,241
- 1
- 18
- 36
0
votes
1 answer
Service to Support Data Lake Set Up
I have to test and compare the available solutions to create a Data Lake.
Is there any other service that makes it easy to set up a secure data lake besides AWS Lake Formation?
I know that I can create an account on Azure and Google Cloud…

Maria Luiza
- 27
- 5
0
votes
2 answers
Do datalakes offered from amazon and azure support rosbag files?
I have a lot of camera, radar and lidar data stored with me in a hard-disk in rosbag formats. I would now like to shift this to a cloud service, preferably trying to preserve the rosbag format.
From what I understand in data lakes you can store…

power.puffed
- 357
- 2
- 10