Highest Voted 'data-lake' Questions

1

vote

1 answer

Trino on pure AWS S3

Is it possible to run Trino on top of pure AWS S3 without any other additional engine? In the Trino connectors there is no S3, but in the docs it is mentioned it could be run over S3 or e.g. Hive. So do I need some layer over S3 such as Hadoop/Hive…

asked Dec 08 '22 at 16:54

romanzdk

930
11
30

1

vote

1 answer

How to rewrite Apache Iceberg data files to another format?

I'd like to use the Apache Iceberg Apache Spark-Java based API for rewriting data files on my Iceberg table. I'm writing my data files in an Avro format, but I'd like to rewrite them to Parquet. Is it possible in a somewhat easy way? I've researched…

java apache-spark data-lake iceberg apache-iceberg

asked Nov 13 '22 at 23:22

apache-northeast

11
1

1

vote

2 answers

Dask writing into multiple parquet files by key

I have a very large dataset on disk as a csv file. I would like to load this into dask, do some cleaning, and then save the data for each value of date into a separate file/folder, as follows: . └── test └── 20211201 └── part.0.parquet …

python dask parquet data-lake

asked Oct 05 '22 at 01:40

Nezo

567
4
18

1

vote

1 answer

SSIS Runs perfectly on a remote server(Greenplum) Datalake but takes 8+ hours

SSIS Package performs the ETL on a remote server (Greenplum envt). It runs fine but takes 8+ hours to complete. Data on the Remote server's interaction tables are massive (~ 1 Billion rows each). Is there a way or any option available on SSIS…

ssis greenplum data-lake

asked Jun 25 '22 at 08:56

kashif ashraf

9
3

1

vote

1 answer

What Happens When a Delta Table is Created in Delta Lake?

With the Databricks Lakehouse platform, it is possible to create 'tables' or to be more specific, delta tables using a statement such as the following, DROP TABLE IF EXISTS People10M; CREATE TABLE People10M USING parquet OPTIONS ( path…

databricks delta-lake data-lake

asked Apr 28 '22 at 19:11

Minura Punchihewa

1,498
1
12
35

1

vote

1 answer

MWAA Airflow no_status for some specific tasks

I am using MWAA aiflow 1.10 and the tasks do not start, even though the last ones are successful. I do not see any logs problem or anything.

python data-lake mwaa

asked Feb 01 '22 at 13:51

Gabriel Lopes

23
1
3

1

vote

0 answers

AWS Lake Formation - Metadata access control vs Data location permissions

Quoting Lake Formation Access Control Overview : Metadata access control – Permissions on Data Catalog resources (Data Catalog permissions). These permissions enable principals to create, read, update, and delete metadata databases and tables in…

amazon-web-services data-lake aws-lake-formation

asked Sep 17 '21 at 14:19

justHelloWorld

6,478
8
58
138

1

vote

0 answers

How to create and maintain mongodb atlas datalake programmatically?

I want to create and maintain mongodb atlas datalake programmatically but seems there is no option available. I could find out one API which can be used to create/update/delete data lake but that it only allows to set some options. Here is the link…

mongodb data-lake atlas-data-lake

asked Jul 26 '21 at 04:25

Ashish Modi

7,529
2
20
35

1

vote

2 answers

Dbeaver doesn't display metadata from one of our hive instances. How to fix?

We use the DBeaver to connect to our hive datalake. I've found a very strange behavior. We have a test and a production datalake. In our test datalake it correctly displays the tables metadata in the Project tab (the left column). Here it is: But…

hive dbeaver database-metadata data-lake

asked Nov 26 '20 at 00:11

neves

33,186
27
159
192

1

vote

0 answers

How to process many tables using AWS Glue

As part of doing data validation I have use-case of processing many tables. Number of tables are almost 2000. Due to tight SLA there is a need now to process many tables concurrently. Due to Glue concurrency limit of 50 (which I got increased to 100…

amazon-web-services aws-glue data-lake aws-glue-spark

asked Nov 03 '20 at 01:53

Ankur Shrivastava

223
4
14

1

vote

1 answer

Data Lake with Kimball's Star Schema and Data Mart

Objective I little bit confused by terminology: I've built Data Lake (not DW) based on Kimball's data modeling approaches and now not sure if I can use Data Mart definition to name my MPP database layer. I came from the assumption that you still…

database-design architecture data-warehouse databricks data-lake

asked Sep 15 '20 at 08:03

VB_

45,112
42
145
293

1

vote

0 answers

Keeping track of Datalake schemas

I have a general question about keeping track of schemas in Datalake. In various logs, I have some fields which exist in every log. There are other fields which differ by log types. My team has a consensus to only add field, and not delete existing…

data-lake

asked Jul 04 '20 at 11:58

Piljae Chae

987
10
23

1

vote

1 answer

Implementing cdc and deduplication on AWS

I want to build a Data Lake in AWS S3 and asking my self how to work with CDC. I wanna avoid loading the whole data from the sources and furthermore I wanna avoid duplicates in the target. Are there some proven methodologies how to tackle that?

amazon-s3 duplicates etl cdc data-lake

asked May 23 '20 at 04:51

yavuz özsöz

11
2

1

vote

1 answer

Count the number of transactions per month for an individual group by date Hive

I have a table of customer transactions where each item purchased by a customer is stored as one row. So, for a single transaction there can be multiple rows in the table. I have another col called visit_date. There is a category column called…

sql hive hiveql data-lake

asked Apr 13 '20 at 20:01

krishna koti

659
1
6
10

1

vote

0 answers

Data Lake governance tools

I am seeking advice on data governance toolset(s) you currently use for data lake and your thoughts about those tools: Managing data models - ingress/at rest/egress Tracking data lineage - who is using what fields? Migration changes

data-lake data-governance

asked Apr 09 '20 at 17:06

amp123

43
5

Questions tagged [data-lake]