Questions tagged [apache-hudi]

Apache Hudi is a transactional data lake platform with a focus on batch and event processing (with ACID support). Use this tag for questions specific to problems with Apache-Hudi. Do not use this tag for common issues with topic data lake or delta lake.

Questions on using Apache Hudi

158 questions
0
votes
0 answers

Is there anyway i can use StreamTableEnvrionment in ProcessWindowFunction?

Scenario Using Flink to read MySQL binlog and write to a Hudi table, but I want to partition the binlog data source into windows, and batch insert all the data within a window into the Hudi table when the window closes. My current approach is to use…
0
votes
2 answers

Unable to alter column name for a Hudi table in AWS

I'm unable to alter the column name of Hudi table . spark.sql("ALTER TABLE customer_db.customer RENAME COLUMN subid TO subidentifier") unbable to change the column name. A clear and concise description of the problem. I'm unable to alter the column…
0
votes
0 answers

Apache Hudi TimestampBasedKeyGenerator issue partitioning by year and Month

I am using Apache Hudi version 0.12.0 in AWS Glue Version 4.0. I am trying to get my table to have partitions by month and year, and I cannot get this to work. Here is the code in my Glue Job: base_s3_path =…
cjf280830
  • 3
  • 3
0
votes
0 answers

Issue with reading data from Hudi table incrementally in Spark-shell

I am encountering an error while attempting to read data from a Hudi table incrementally using Spark-shell. Below is the code I am using: import org.apache.hudi.DataSourceReadOptions._ import org.apache.hudi.HoodieDataSourceHelpers import…
0
votes
0 answers

Error while trying to stream data from Kafka and store it in apache hudi

I am trying to store Kafka data in apache hudi Spark version I am using is 3.3.1 Kafka clients 2.8.1 , hudi spark-sql-kafka 0-10-2.12 and I am getting error while writing that code org/apache/commons/pool2/impl/GenericKeyedObjectPool error. I am…
0
votes
0 answers

Unable to read Hudi file in Spark Databricks Environment

I am facing this error while running Spark in Databricks. I am trying to read Hudi file format. I’m using Hudi 0.13.0 with Databricks (12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12) Trying to load a hudi data set from S3 but failed with this…
0
votes
2 answers

Change the location of a Hudi table in AWS?

Describe the problem you faced How can we change the location of a hudi table to new location. I've customer table that is saved at s3://aws-amazon-com/Customer/ which I want to change to s3://aws-amazon-com/CustomerUpdated/ . I'm working on Glue…
0
votes
1 answer

Performance and Data Integrity Issues with Hudi for Long-Term Data Retention

Our project requires that we perform full loads daily, retaining these versions for future queries. Upon implementing Hudi to maintain 6 years of data with the following setup: "hoodie.cleaner.policy":…
Luiz
  • 1,275
  • 4
  • 19
  • 35
0
votes
1 answer

Querying Apache Hudi using PySpark on EMR by table name

While writing data to the Apache Hudi on EMR using PySpark, we can specify the configuration to save to a table name. See hudiOptions = { 'hoodie.table.name': 'tableName', 'hoodie.datasource.write.recordkey.field':…
Anurag A S
  • 725
  • 10
  • 23
0
votes
1 answer

Is there standard way to get the data lake format from parquet file? (e.g. Apache iceberg, Apache Hudi, Deltalake)

I am writing parquet clean job using PyArrow. However, I only want to process native parquet files and skip over any .parquet files in iceberg, hudi, or deltalake format. This is because these formats require updates to be done through the…
Sam.E
  • 175
  • 2
  • 10
0
votes
1 answer

Non Nested AVRO Schema For Postgres Change-Log Events (Debezium <> Confluent Schema Registry)

Purely Avro Question First Is it possible to have an Avro schema compatible with the following message whose before and after fields must be of the same record type: { "before": null, "after": { "id": 1, "name": "Bob" }, "op":…
0
votes
1 answer

Flink streaming Kinesis to Hudi not writing any data

I'm trying out PyFlink for streaming data from Kinesis into Hudi format, but can't figure out why it is not writing any data. I hope that maybe someone can provide any pointers. Versions: Flink 1.15.4, Python 3.7, Hudi 0.13.0 I use streaming table…
Timo
  • 5,188
  • 6
  • 35
  • 38
0
votes
1 answer

HUDI compaction using Flink raises NullPointerException: Value must not be null

I followed the example on Hudi's website. Instead of using hudi-flink-bundle_2.11-0.9.0-SNAPSHOT.jar, I use hudi-flink1.16-bundle-0.13.0.jar, acquired from here. Command: $FLINK_HOME/bin/flink run \ -c…
Bing-hsu Gao
  • 343
  • 3
  • 14
0
votes
0 answers

HUDI: how to apply the CDC delete and upsert events?

I am reading https://medium.com/slalom-build/data-lakehouse-building-the-next-generation-of-data-lakes-using-apache-hudi-41550f62f5f I cannot understand the following piece of codes. it seems that upserts CDC events applied before delete CDC…
BAE
  • 8,550
  • 22
  • 88
  • 171
0
votes
1 answer

Unable to see hive partitions when running Hudi DeltaStreamer with `TimestampBasedKeyGenerator` (but able to see hudi partitions)

In Hudi, I’m using the TimestampBasedKeyGenerator partitions (e.g. using the hudi cli I'm able to see partition for 2023-03-05) in my s3 path (e.g. s3://my_bucket/my_table/2023-03-25/): hudi:my_table->show fsview…
Will
  • 11,276
  • 9
  • 68
  • 76