Questions tagged [iceberg]

Apache Iceberg is an open table format for huge analytic datasets. Iceberg adds tables to Presto and Spark that use a high-performance format that works just like a SQL table. Use this tags for any questions relating to support for or usage of Iceberg.

134 questions
1
vote
0 answers

Connecting Iceberg's JdbcCatalog to Spark session

I have a JdbcCatalog initialized with H2 database in my local java code. It is able to create iceberg tables with proper schema and partition spec. When I create a spark session in the same class, it is unable to use the JdbcCatalog already created…
Ishan Das
  • 11
  • 3
1
vote
0 answers

How to insert comment in iceberg table?

everything good? I'm trying to put a comment on the ICEBRG table in glue catalog , and I used it as follows: spark.sql(f"""CREATE EXTERNAL TABLE IF NOT EXISTS {schema_name}.{table_name}({columns}) USING iceberg COMMENT 'table…
1
vote
0 answers

Apache Iceberg bug MERGE INTO PySpark with UDF causes: `Cannot generate code for expression`

I have encountered a major bug with MERGE INTO in Spark when writing into Apache Iceberg format using a python UDF. The problem is that when the column that is used in the ON clause of MERGE INTO has been affected by a UDF, the merge throws an…
thijsvdp
  • 404
  • 3
  • 16
1
vote
0 answers

Presto on iceberg: query failed

The same query succeed on spark sql, and the file also exists! exitERROR SplitRunner-4-42 com.facebook.presto.execution.executor.TaskExecutor Error processing Split…
zhenyu lee
  • 11
  • 1
1
vote
0 answers

Apache Iceberg Sort order id not being respected in Spark

Hi I have been seeing some unexpected behavior related to the sort ordering of a Iceberg table. The problem is that I set up SORT ORDER correctly such that the partitions are ordered. However, it seems from the data that it does not respect this…
thijsvdp
  • 404
  • 3
  • 16
1
vote
0 answers

Iceberg with Hive Metastore does not create a catalog in Spark and uses default

I have been experiencing some (unexpected?) behavior where a catalog reference in Spark is not reflected in the Hive Metastore. I have followed the Spark configuration according to the documentation, which looks like it should create a new catalog…
thijsvdp
  • 404
  • 3
  • 16
1
vote
0 answers

Why is it required to use a new Spark Session after writing a streaming dataframe into an Iceberg table to show new changes?

If you use a spark session to create an Iceberg table with Spark scala in batch mode, and after that you do a writestream process with a merge into operation it's not possible to see new changes with spark session used in batch process. You need to…
Emilio
  • 11
  • 1
1
vote
0 answers

Running spark job from AWS Lambda

I would like to get data from IceBerg table using AWS Lambda. I was able to create all the code and containers only to discover that AWS Lambda doesn't allow process substitution that spark uses…
Pawel
  • 93
  • 2
  • 7
1
vote
0 answers

how to copy a existing glue table to a iceberg format table with athena?

i have a a lot of json files in s3 which are updated frequently. Basically i am doing CRUD operations in a datalake. Because apache iceberg can handle item-level manipulations, i would like to migrate my data to use apache iceberg as table…
Khan
  • 1,418
  • 1
  • 25
  • 49
1
vote
0 answers

dynamic partition prunning not working in spark

There are two tables: one big (T0), one small (T1). I run code below and expect it to use DPP, but it does not: df = spark.table('T0').select('A', 'B', 'C') df1 = spark.table('T1').select('A') df.join(F.broadcast(df1), ['A']).explain() Then I do a…
Alex Loo
  • 73
  • 1
  • 1
  • 7
1
vote
1 answer

Does spark-sql query plan indicate which table partitions are used?

By looking at spark-sql plans, is there a way I can tell if a particular table (hive/iceberg) partition is being used or not? For example, we have a table that has 3 partitions, let's say A=A_VAL, B=B_VAL, C=C_VAL. By looking at the plan is there a…
hba
  • 7,406
  • 10
  • 63
  • 105
1
vote
0 answers

Apache Iceberg on GCS atomic rename

I have a spark on dataproc serverless use case which requires to read/write with iceberg format on GCS. Reading through documentation I realized that I cannot use hadoop table catalog because GCS does not support atomic rename: A Hadoop catalog…
1
vote
1 answer

How to rewrite Apache Iceberg data files to another format?

I'd like to use the Apache Iceberg Apache Spark-Java based API for rewriting data files on my Iceberg table. I'm writing my data files in an Avro format, but I'd like to rewrite them to Parquet. Is it possible in a somewhat easy way? I've researched…
1
vote
1 answer

Iceberg table does not see the generated Parquet file

In my use case, the table in Iceberg format is created. It only receives APPEND operations as it is about recording events in a time series stream. To evaluate the use of the Iceberg format in this use-case, I created a simple Java program that…
João Paraná
  • 1,031
  • 1
  • 9
  • 18
1
vote
1 answer

Apache Iceberg Scheme Evolution using Spark

Currently I am using Iceberg in my project, so I am having one doubt in that. My Current Scenario: I have loaded the data into my Iceberg table using spark data frame(this is my doing through spark…
1 2
3
8 9