Questions tagged [data-lake]

161 questions
1
vote
1 answer

Trino on pure AWS S3

Is it possible to run Trino on top of pure AWS S3 without any other additional engine? In the Trino connectors there is no S3, but in the docs it is mentioned it could be run over S3 or e.g. Hive. So do I need some layer over S3 such as Hadoop/Hive…
romanzdk
  • 930
  • 11
  • 30
1
vote
1 answer

How to rewrite Apache Iceberg data files to another format?

I'd like to use the Apache Iceberg Apache Spark-Java based API for rewriting data files on my Iceberg table. I'm writing my data files in an Avro format, but I'd like to rewrite them to Parquet. Is it possible in a somewhat easy way? I've researched…
1
vote
2 answers

Dask writing into multiple parquet files by key

I have a very large dataset on disk as a csv file. I would like to load this into dask, do some cleaning, and then save the data for each value of date into a separate file/folder, as follows: . └── test └── 20211201 └── part.0.parquet …
Nezo
  • 567
  • 4
  • 18
1
vote
1 answer

SSIS Runs perfectly on a remote server(Greenplum) Datalake but takes 8+ hours

SSIS Package performs the ETL on a remote server (Greenplum envt). It runs fine but takes 8+ hours to complete. Data on the Remote server's interaction tables are massive (~ 1 Billion rows each). Is there a way or any option available on SSIS…
1
vote
1 answer

What Happens When a Delta Table is Created in Delta Lake?

With the Databricks Lakehouse platform, it is possible to create 'tables' or to be more specific, delta tables using a statement such as the following, DROP TABLE IF EXISTS People10M; CREATE TABLE People10M USING parquet OPTIONS ( path…
Minura Punchihewa
  • 1,498
  • 1
  • 12
  • 35
1
vote
1 answer

MWAA Airflow no_status for some specific tasks

I am using MWAA aiflow 1.10 and the tasks do not start, even though the last ones are successful. I do not see any logs problem or anything.
Gabriel Lopes
  • 23
  • 1
  • 3
1
vote
0 answers

AWS Lake Formation - Metadata access control vs Data location permissions

Quoting Lake Formation Access Control Overview : Metadata access control – Permissions on Data Catalog resources (Data Catalog permissions). These permissions enable principals to create, read, update, and delete metadata databases and tables in…
justHelloWorld
  • 6,478
  • 8
  • 58
  • 138
1
vote
0 answers

How to create and maintain mongodb atlas datalake programmatically?

I want to create and maintain mongodb atlas datalake programmatically but seems there is no option available. I could find out one API which can be used to create/update/delete data lake but that it only allows to set some options. Here is the link…
Ashish Modi
  • 7,529
  • 2
  • 20
  • 35
1
vote
2 answers

Dbeaver doesn't display metadata from one of our hive instances. How to fix?

We use the DBeaver to connect to our hive datalake. I've found a very strange behavior. We have a test and a production datalake. In our test datalake it correctly displays the tables metadata in the Project tab (the left column). Here it is: But…
neves
  • 33,186
  • 27
  • 159
  • 192
1
vote
0 answers

How to process many tables using AWS Glue

As part of doing data validation I have use-case of processing many tables. Number of tables are almost 2000. Due to tight SLA there is a need now to process many tables concurrently. Due to Glue concurrency limit of 50 (which I got increased to 100…
1
vote
1 answer

Data Lake with Kimball's Star Schema and Data Mart

Objective I little bit confused by terminology: I've built Data Lake (not DW) based on Kimball's data modeling approaches and now not sure if I can use Data Mart definition to name my MPP database layer. I came from the assumption that you still…
VB_
  • 45,112
  • 42
  • 145
  • 293
1
vote
0 answers

Keeping track of Datalake schemas

I have a general question about keeping track of schemas in Datalake. In various logs, I have some fields which exist in every log. There are other fields which differ by log types. My team has a consensus to only add field, and not delete existing…
Piljae Chae
  • 987
  • 10
  • 23
1
vote
1 answer

Implementing cdc and deduplication on AWS

I want to build a Data Lake in AWS S3 and asking my self how to work with CDC. I wanna avoid loading the whole data from the sources and furthermore I wanna avoid duplicates in the target. Are there some proven methodologies how to tackle that?
1
vote
1 answer

Count the number of transactions per month for an individual group by date Hive

I have a table of customer transactions where each item purchased by a customer is stored as one row. So, for a single transaction there can be multiple rows in the table. I have another col called visit_date. There is a category column called…
krishna koti
  • 659
  • 1
  • 6
  • 10
1
vote
0 answers

Data Lake governance tools

I am seeking advice on data governance toolset(s) you currently use for data lake and your thoughts about those tools: Managing data models - ingress/at rest/egress Tracking data lineage - who is using what fields? Migration changes
amp123
  • 43
  • 5
1 2
3
10 11