Questions tagged [apache-hudi]

Apache Hudi is a transactional data lake platform with a focus on batch and event processing (with ACID support). Use this tag for questions specific to problems with Apache-Hudi. Do not use this tag for common issues with topic data lake or delta lake.

Questions on using Apache Hudi

158 questions
2
votes
1 answer

Custom Payload class in Python for precombine and combineAndGet in Apache Hudi And Pyspark

We are migrating our code base from spark-java to PySpark. We were handling custom aggregations for merging data using preCombine() and combineAndGetUpdateValue() and had implemented this in our Spark-Java code. Example below: package…
2
votes
0 answers

How to access hudi metrics

How can the Hudi metrics be accessed programatically. After a commit I would like to get metrics like records updated / records inserted and log them into a database. I tried setting hoodie.metrics.on=true and hoodie.metrics.reporter.type=INMEMORY.…
Joha
  • 935
  • 12
  • 32
2
votes
1 answer

Connect Redshift Spectrum/ AWS EMR with Hudi directly or via AWS Glue Data Catalog

I'm trying to understand how to properly connect Redshift Spectrum with Hudi data. Looks like I can directly create Redshift external table for data managed in Apache Hudi like it is described by the following documentation…
2
votes
0 answers

How to get the last commit for a key when reading apache hudi MOR table reading in incremental mode?

I have a MOR table with key = acctid, when I do 3 commits on same key and try to read in incremental mode I see only 1st commit, is there anyway to read the last commit or all commits for a given key using incremental mode ? Please check details…
sannidhi
  • 23
  • 5
2
votes
0 answers

java.lang.ClassNotFoundException: org.apache.parquet.hadoop.metadata.CompressionCodecName

Does anyone have such a problem when using Hudi to integrate with the spark shell? I just started learning Hudi by the official document. The version of the environment is CDH-5.16.2,spark-2.3.0. import…
shiwei
  • 21
  • 2
2
votes
2 answers

Apache Spark and Hudi: tons of output files

I'm trying to read data from many different .csv files ( all with the same "structure" ), perform some operations with Spark and finally save them in Hudi format. To store data in the same Hudi table I thought the best approach would be to use the…
2
votes
1 answer

How to run hudi on dataproc and write to gcs bucket

I want to write to a gcs bucket from dataproc using hudi. To write to gcs using hudi it says to set prop fs.defaultFS to value gs:// (https://hudi.apache.org/docs/gcs_hoodie) However when I set fs.defaultFS on dataproc to be a gcs bucket I get…
Funzo
  • 1,190
  • 2
  • 14
  • 25
2
votes
1 answer

Writing spark DataFrame In Apache Hudi Table

I am new to apace hudi and trying to write my dataframe in my Hudi table using spark shell. For type first time i am not creating any table and writing in overwrite mode so I am expecting it will create hudi table.I am Writing below code. …
Rahul Patidar
  • 47
  • 1
  • 3
  • 9
2
votes
2 answers

Issue for Integrating Hudi with Kafka using Avro Schema

I am trying to integrate Hudi with Kafka topic. Steps followed : Created Kafka topic in Confluent with schema defined in schema registry. Using kafka-avro-console-producer, I am trying to produce data. Running Hudi Delta Streamer in continuous mode…
2
votes
0 answers

Flink's hive streaming vs iceberg/hudi/delta

There are some open sourced datake solutions that support crud/acid/incremental pull,such as Iceberg, Hudi, Delta. I think they have done what flink's hive streaming wants to do and even do better, So, I would ask what the real power of flink's hive…
Tom
  • 5,848
  • 12
  • 44
  • 104
2
votes
2 answers

More than 1 column in record key in spark Hudi Job while making an upsert

I am currently doing a POC on deltalake where I came across this framework called Apache Hudi. Below is the data I am trying to write using apache spark framework. private val INITIAL_ALBUM_DATA = Seq( Album(800,810, "6 String Theory", Array("Lay…
user3199285
  • 177
  • 2
  • 12
2
votes
1 answer

Error while running Apache Hudi deltastreamer

I a trying to run Hudi deltastreamer on AWS EMR. Followed the steps in this blog. https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI But when I run the below spark submit, error comes: Exception in thread "main"…
raghuvd
  • 665
  • 1
  • 5
  • 6
2
votes
1 answer

Spark streaming - Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file

Im using spark to write my json data to s3. However, I keep getting the below error. We are using apache hudi for updates. This only happens for some data, everything else works fine. Caused by: org.apache.parquet.io.ParquetDecodingException: Can…
mythic
  • 535
  • 7
  • 21
1
vote
1 answer

deltastreamer.HoodieDeltaStreamer exceptio: Filesystem closed

I am using HoodieDeltaStreamer to connect kafka and store data to hoodie table Hudi version : 0.10.1 Spark : 3.2.4 Hadoop : 3.3.5 Only one spark-submit job is running cmd : spark-submit --class…
Ankit Bansal
  • 2,162
  • 8
  • 42
  • 79
1
vote
0 answers

Unarchiving Apache Hudi archived commits

Is it possible to unarchive an archived commit with Apache Hudi? For example, I've set the following configuration and have 4 commits. hoodie.keep.max.commits = 3 commit1 commit2 commit3 commit4 On the 4th commit, commit1 is archived and moved to…
1
2
3
10 11