Questions tagged [data-lake]

161 questions
3
votes
2 answers

Powershell -recursive in Azure Data Lake Store

Do someone know how to list every file in a directory inside data lake store and sub directories? apparently the -recursive instruction does not work as it does in a normal environment I need to run this script in Azure Data Lake Store, (which runs…
Rafa
  • 443
  • 5
  • 14
2
votes
1 answer

Creating a data lake from a DynamoDB table

We have a service where a DynamoDB table ~50GB is our feature repository, which we use for real-time, online applications. We want to create a data lake from this table for historical data, model training and analytics insights. We want to guarantee…
2
votes
2 answers

Can Glue Crawler crawl the deltalake files to create tables in aws glue catalogue?

We have an existing infrastructure where we are crawling the S3 directories through aws crawlers. These S3 directories are created as part of AWS datalake and dumped through the spark job. Now in order to implement the delta feature, we were doing a…
2
votes
2 answers

Can the raw data layer of a Data Lake contain a Table?

All the Data Lake articles I have read on the web say that the landing area contains raw data in the form of files. But let us say, I am ingesting streaming data from some IoT devices. Can I then put this data directly into a Table (For example a…
MetallicPriest
  • 29,191
  • 52
  • 200
  • 356
2
votes
2 answers

Data Lake: fix corrupted files on Ingestion vs ETL

Objective I'm building datalake, the general flow looks like Nifi -> Storage -> ETL -> Storage -> Data Warehouse. The general rule for Data Lake sounds like no pre-processing on ingestion stage. All ongoing processing should happen at ETL, so you…
VB_
  • 45,112
  • 42
  • 145
  • 293
2
votes
2 answers

Querying across S3 buckets using Athena

I am trying to understand Data Lakes, and most examples show only simple use cases. What I want to understand is effectively 'join queries'. For example, I have files with product data (uploaded to S3-Product-Data) and a database with product annual…
Steve M
  • 89
  • 1
  • 2
  • 9
2
votes
0 answers

How can I have multiple partitions based on different parameters for my data in data lake

We're building a new Data Lake for a huge amount of data from various data sources, storing the data in Parquet format in Amazon S3 buckets. We're currently creating the partitions based on a particular field (e.g., Record-Creation-Time). So we're…
user2869520
  • 193
  • 1
  • 13
2
votes
2 answers

Building Data Lake from scratch

I am trying to build a "Data Lake" from scratch. I understand how a data lake works and the purpose of it; it's all over the internet. But when the question arises how to build one from scratch there is no source. I want to understand if: Data…
2
votes
1 answer

How does a Data Lake Store Data and what Format?

I heard Data Lakes can store any kind of data: Relational, NoSql , Pictures/images, Adobe Pdf, Excel. How is the data stored, in a No-SQL format, or in binary tree? Or does it just save it like a regular hard drive? If so, why don't they just call…
user10241913
2
votes
0 answers

Eventhub Capture puts files in the wrong place

I have an eventhub with capture to data lake enabled. File pattern is: data lake path: / mydata/{Namespace}-{EventHub}-{PartitionId}/{Year}-{Month}-{Day}/{Hour}-{Minute}-{Second} e.g mydata/mydatahubs-mydatahub-0/2018-3-12/11-54-14 It puts data…
Andrei
  • 42,814
  • 35
  • 154
  • 218
2
votes
0 answers

Capture data from SCADA system into HDFS(Hadoop DataLake) for analytics

I am looking for a way where I can capture the PLC data from SCADA application in real time and store in hdfs (of a data lake) for analytics on it. If it can be done, what are the possible ways to do it? Any help or guidance will be really helpful.
Svk
  • 57
  • 6
2
votes
1 answer

How to move data from azure data lake to windows virtual machine

I have a requirement to investigate moving files from azure data lake store folders to an azure windows virtual machine. Just wondering what my options are - I have looked at Azcopy which looks like it might work - although I may need to shift the…
mitroberts
  • 193
  • 1
  • 16
2
votes
1 answer

AWS Data Lake Ingest

Do you need to ingest excel and other proprietary formats using glue or allow glue to work crawl your s3 bucket to use these data formats within your data lake? I have gone through the "Data Lake Foundation on the AWS Cloud" document and am left…
1
vote
1 answer

deltastreamer.HoodieDeltaStreamer exceptio: Filesystem closed

I am using HoodieDeltaStreamer to connect kafka and store data to hoodie table Hudi version : 0.10.1 Spark : 3.2.4 Hadoop : 3.3.5 Only one spark-submit job is running cmd : spark-submit --class…
Ankit Bansal
  • 2,162
  • 8
  • 42
  • 79
1
vote
1 answer

how Install the Delta Lake package on the on-premise environment?

I want make a data lake for my self without using any cloud service. I now have an Debian server and I want create this data lake with Databricks solution, Delta Lake. As I search all sample for stablish Delta Lake in could service. How can I do…
Tavakoli
  • 1,303
  • 3
  • 18
  • 36
1
2
3
10 11