Questions tagged [data-lake]

161 questions
0
votes
1 answer

AWS Glue ETL Job getting final dataFrame with Join.apply Vs SQL JOIN Query

I am fairly new to AWS and I am currently exploring it. I was hoping to get an insight or suggestion on the best way to implement the job. I wanted to get data from multiple mysql tables. user_transaction user_loans promo_offers To get the final…
Vivek Raskar
  • 711
  • 5
  • 4
0
votes
1 answer

ADLS to Azure Storage Sync Using AzCopy

Looking for some help to resolve the errors I'm facing. Let me explain the scenario. I'm trying to sync one of the ADLS Gen2 container to Azure BLOB Storage. I have AzCopy 10.4.3, I'm using Azcopy Sync to do this. I'm using the command below azcopy…
VidhyaSagar
  • 53
  • 1
  • 4
0
votes
1 answer

Can we restrict specific users to access some databases in Athena and remaining users should see the other databases?

Problem statement: Can we restrict specific users to access some databases in Athena and remaining users should see the other database? We have datalake created in s3 bucket. It is in development stage. Same s3 bucket datalake is used by end user…
0
votes
2 answers

Is there any function or way to retrieve table names from Snowflake in the order of referential integrity (FK) dependencies?

I would like to retrieve table names from a given schema in the sorted order based on their foreign key dependencies. For example, if I have three following tables created in Snowflake CREATE TABLE TAB_X ( COL_A CHAR(18), COL_B CHAR(18), …
0
votes
1 answer

Azure Data Lake Storage Gen2 label appearing as Containers and NOT File System

I just started with Data Lakes in Azure and countered an issue with the ADLS Gen2 screens in the Azure Portal. Using the Azure Portal, I created a new Storage account to setup a new Azure Data Lake Gen2 storage by following the online instructions.…
0
votes
1 answer

Recommended ETL solution for big data coming from MySQL?

I have a situation where a third party stores data in daily tables, where if the record count exceeds two million a subsequent table is created, and so on, named [date]_x. Now, we have a reporting requirement and need to consume this data. Manual…
ElHaix
  • 12,846
  • 27
  • 115
  • 203
0
votes
2 answers

Searching through data stored in Azure Data Lake

I have the following use case for building a Data Lake (e.g. in Azure): My organization deals with companies that go into bankruptcy. Once a company goes bankrupt, it needs to hand over all of their data to us, including structured data (e.g. CSVs)…
RobW
  • 128
  • 2
  • 10
0
votes
1 answer

Design data provisioning strategy for big data system?

I'm designing Data provisioning module in an big data system. Data provisioning is describe as The process of providing the data from the Data Lake to downstream systems is referred to as Data Provisioning; it provides data consumers with secure…
0
votes
2 answers

Create a file in DataLake and make it expire after 2 minutes

I am using Data Lake Gen 1 and i would like to create a file and set the expiration time to be 2 minutes after creation. I am using this method: public virtual System.Threading.Tasks.Task SetExpiryTimeAsync (string path,…
Angela
  • 477
  • 1
  • 10
  • 20
0
votes
1 answer

AWS Glue append to paruqet file

I am currently in the process of designing AWS backed Data Lake. What I have right now: XML files uploaded to s3 AWS Glue crawler buids catalogue AWS ETL job transforms data and saves it in the parquet format. Each time etl jobs transforms the…
SirKometa
  • 1,857
  • 3
  • 16
  • 26
0
votes
1 answer

Do I track changes to my data in a data lake?

Recently I'm discovering the data lake world, I'm planning on setting up a data lake with ADL. One of the things I'm not sure on is how a data lake is supposed to track changes over time/handle different version from a source. I've come across site…
Remco
  • 172
  • 9
0
votes
1 answer

How should I partition data for AWS Athena (Presto) if I don't know the queries in advance?

I have big data events (TBs) I need to query and I am trying to partition it correctly. I have client and each client has many games. The problem is there are fields we query for, that might be null in some events, therefore they cannot be used as…
Ido Barash
  • 4,856
  • 11
  • 41
  • 78
0
votes
1 answer

Terraform : Seperate modules VS one big project

I'm working on a Datalake project composed by many services : 1VPC (+ subnets, security groups, internet gateway, ...), S3 buckets, EMR cluster, Redshift, ElasticSearch, some Lambdas functions, API Gateway and RDS. We can say that some resources are…
user1297406
  • 1,241
  • 1
  • 18
  • 36
0
votes
1 answer

Service to Support Data Lake Set Up

I have to test and compare the available solutions to create a Data Lake. Is there any other service that makes it easy to set up a secure data lake besides AWS Lake Formation? I know that I can create an account on Azure and Google Cloud…
0
votes
2 answers

Do datalakes offered from amazon and azure support rosbag files?

I have a lot of camera, radar and lidar data stored with me in a hard-disk in rosbag formats. I would now like to shift this to a cloud service, preferably trying to preserve the rosbag format. From what I understand in data lakes you can store…
power.puffed
  • 357
  • 2
  • 10