Questions tagged [apache-iceberg]

Apache Iceberg is a high-performance table format to enable analytics purposes. It allows SQL tables to be consumed by analytics tools such as Apache Spark, Apache Flink, Apache Hive, Trino, PrestoDB, Impala, StarRocks, Doris, and Pig.

Apache Iceberg (often referred only as Iceberg) is a high-performance table format to enable analytics purposes. It allows SQL tables to be consumed by analytics tools such as Apache Spark, Apache Flink, Apache Hive, Trino, PrestoDB, Impala, StarRocks, Doris, and Pig.

68 questions
0
votes
0 answers

Spark SQL SELECT works but INSERT fails due to Spark issue

I am connecting to Iceberg running in Spark and querying tables. I am able to SELECT and INSERT from spark-shell. But face issues when executing the same statements from Java code SparkConf sparkConf = new SparkConf().setAppName("my…
ritratt
  • 1,703
  • 4
  • 25
  • 45
0
votes
0 answers

Iceberg - Spark - Hive in SparkThrift clarification needed

We are using a Spark Thrift server, with Iceberg table format, with Parquet data file, and Spark as execution engine. 1, When I submit an sql statement to Spark Thrift server, what type this statement is? Is it Spark SQL? or HiveSQL? Because I saw…
0
votes
0 answers

List the file and files versions in a particular snapshot of the iceberg

I want to list the file and files versions in a particular snapshot of the iceberg. I used the version-travel queries, such as: SELECT * FROM iceberg_table FOR VERSION AS OF 949530903748831860. However, this lists all the versions of the file…
0
votes
0 answers

spark structured streaming + apache iceberg how appends can be idempotent

I'm using spark structured streaming to append to iceberg partitioned table. I need to use foreachBatch or foreatch as I'm using custom iceberg catalog implementation. (one from google biglake). Spark doc says foreatchBatch is at-least once means It…
nir
  • 3,743
  • 4
  • 39
  • 63
0
votes
0 answers

Cannot add data files to target table because that table is partitioned and contains non-identity partition transforms which will not be compatible

I am integrating iceberg with spark , I tried to create the table test partitioned by hour(EndTime) , create table local.db.test ( MSISDN string, START_TIME timestamp, END_TIME timestamp ) USING iceberg PARTITIONED BY (hours(END_TIME)); then I…
Elsayed
  • 2,712
  • 7
  • 28
  • 41
0
votes
0 answers

Apache Iceberg: insert into/merge into/insert overwrite VS MOR/COW

i am learning iceberg currently. I can understand MOR and COW. In MOR, delete files are created to track updates/deletes. In COW, old data files are copied into new data files and deletes/updates are written into the new data files. but I have some…
BAE
  • 8,550
  • 22
  • 88
  • 171
0
votes
0 answers

Iceberg - how to avoid full table scan with bigint partition key

I have a Product and Order table with these schemas: Product: ( id: bigint, created_date: timestamp ) USING iceberg PARTITION BY (id) Order: ( order_id: bigint, product_id: bigint, ts: timestamp ) USING iceberg PARTITION BY day(ts) when I do Order…
huwng
  • 61
  • 2
0
votes
0 answers

Conflicting delete files error when running concurrent updates on an Iceberg table

When running 2 concurrent updates on the same partition of an Iceberg table using Spark, I get the following error: Found new conflicting delete files that can apply to records matching .... The updates are on two different entries in the partition…
CS1999
  • 23
  • 5
0
votes
1 answer

AWS Glue error when reading from Data Catalog with Iceberg (S3Exception: Access Denied)

I had this problem when trying to read from Glue Data Catalog. The complete exception was software.amazon.awssdk.services.s3.model.S3Exception: Access Denied I checked my IAM permission but realized that wasn't the issue.
0
votes
0 answers

what is the benefit of using delta lake or iceberg table format?

We currently store data on S3 using parquet format, and use AWS Glue data catalog to store table metadata. We add partitions by dates or hours. Most of queries that we have are read-only queries. I am wondering the benefits that we can get from…
yuyang
  • 1,511
  • 2
  • 15
  • 40
0
votes
0 answers

Apache Iceberg table from Spark Explain Plan

How we can check the query is running fine or not in terms of accessing partition. Is there anything we can run explain plan for the iceberg table. Example: I have created iceberg table using partition on month(tpep_pickup_datetime). Query I'm…
sho
  • 176
  • 2
  • 12
0
votes
1 answer

AWS Glue S3Exception on MERGE INTO query

I'm pretty new at working with Glue job and I encountered this problem. I have 2 Glue ETL jobs. First one process full export from DynamoDB table, transforms and partition the data and write it in Iceberg table. The second one takes latest cdc from…
0
votes
0 answers

FLINK + MINIO + ICEBERG: The bucket you are attempting to access must be addressed using the specified endpoint

Synopsis I can't successfully write the metadata.json to either the local file system or minio in the environment. Using the latest AWS SDK I get an error asking for AWS Region the URI despite using MINIO. I've tried. Defining the AWS_REGION,…
Alex Merced
  • 118
  • 1
  • 10
0
votes
0 answers

Dynamic Partition Overwrite in Apache Iceberg

I am trying to learn Apache Iceberg for building a data lake. We have late arriving data and the data is partitioned on date column. I will have a spark job that will transform the incoming data to iceberg format. Consider a scenario where the…
Rohit Anil
  • 236
  • 1
  • 11
0
votes
0 answers

Spark SQL Lag function returns null

I'm using Apache Iceberg using Spark SQL. For some reasons, the SQL is returning null for my lag function. My theory is that during the process behind the scene, Spark tries to parallelised the task so the data becomes to small to produce any…
user6308605
  • 693
  • 8
  • 26