Questions tagged [apache-hudi]

Apache Hudi is a transactional data lake platform with a focus on batch and event processing (with ACID support). Use this tag for questions specific to problems with Apache-Hudi. Do not use this tag for common issues with topic data lake or delta lake.

Questions on using Apache Hudi

158 questions
1
vote
2 answers

Cannot create hive connection jdbc:hive2://localhost:10000. spark-submit in cluster mode

I'm running Apache Hudi application on Apache Spark. While I'm submitting an application in client mode its working fine but when I'm submitting an application in cluster mode, getting an error py4j.protocol.Py4JJavaError: An error occurred while…
1
vote
1 answer

Hudi: Access to timeserver times out in embedded mode

I am testing Hudi 0.5.3 (supported by AWS Athena) by running it with Spark in embedded mode, i.e. with unit tests. At first, the test succeeded but now it's failing due to timeout when accessing Hudi's timeserver. The following is based on Hudi:…
alecswan
  • 3,670
  • 5
  • 25
  • 35
1
vote
1 answer

Apache Hudi deltastreamer throwing Exception in thread "main" org.apache.hudi.com.beust.jcommander.ParameterException' no main parameter was defined

Version Apache Hudi 0.6.1,Spark 2.4.6 Below is the standard spark-submit command for Hudi deltastreamer, where it is throwing as no main parameter is defined. I could see all the properties parameters are given. Appreciate any help on this…
Nizam
  • 77
  • 2
  • 11
1
vote
1 answer

Apache Hudi commit id for current ingestion

How to get the current ingestion commit Id .I know HoodieDataSourceHelpers.latestCommit method can use to find the latest commit. But what happen if there is concurrent write in different thread. i need to find each thread commitID
Nizamudeen
  • 121
  • 1
  • 7
0
votes
1 answer

Redshift interpreting boolean data type as bit and hence not able to move hudi table from S3 to Redshift if there is any boolean data type column

I'm creating a data pipeline in AWS for moving data from S3 to Redshift via EMR. Data is stored in the HUDI format in S3 in parquet files. I've created the Pyspark script for full load transfer and for POC purpose my hudi table in S3 contains 4…
0
votes
1 answer

Could not initialize class org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex

I'm trying to use Hudi with out Flink pipelines to publish data to S3 Object storage in parquet format. I am facing the below error while doing so: java.lang.NoClassDefFoundError: Could not initialize class…
user3497321
  • 443
  • 2
  • 6
  • 15
0
votes
0 answers

Flink + NoClassDefFoundError: Could not initialize class org.apache.hudi.avro.HoodieAvroWriteSupport

I'm using Flink SQL to write results to S3 object storage through Hudi. I'm facing an exception when doing the same: java.lang.NoClassDefFoundError: Could not initialize class org.apache.hudi.avro.HoodieAvroWriteSupport at…
user3497321
  • 443
  • 2
  • 6
  • 15
0
votes
0 answers

Update few columns & insert all columns during hudi merge

I have an existing table with these columns: record key is emp_id, precombine is log_ts and partition is log_dt emp_id emp_name log_ts load_ts log_dt 1 neo 2023-08-04 12:00:00 2023-08-04…
praneethh
  • 263
  • 4
  • 16
0
votes
0 answers

Most suitable architecture for AWS realt-time ETL pipeline with transactional tables as a sink

The task was as following - build a near real-time pipeline for the project in AWS infrastructure. Having Nosql(DynamoDB) and SQL(RDS databases) (S3 stored data can be added in the future) sources we need to combine them into tables, used by…
0
votes
1 answer

PySpark: querying Hudi partitioned table

I'm following the Apache Hudi documentation to write and read a Hudi table. Here's the code I'm using to create and save a PySpark DataFrame into Azure DataLake Gen2: tableName = "my_hudi_table" basePath = <> dataGen =…
0
votes
1 answer

Writing Hudi table into Azure DataLake Gen2

I need to create a Hudi table from a PySpark DataFrame using Azure Databricks notebook and save it into Azure DataLake Gen2. Here's my approach: spark.sparkContext.setSystemProperty("spark.serializer",…
jakeis
  • 29
  • 4
0
votes
1 answer

Apache Hudi Incremental Write with Hive Sync Enabled Failing with org.apache.hudi.hive.HiveSyncTool: Schema difference found

I am trying incremental write to a Hudi table with Hive sync enabled, but it is failing with following error: 23/07/24 11:52:48 INFO org.apache.hudi.hive.HiveSyncTool: Schema difference found for table1 23/07/24 11:52:48 INFO…
0
votes
0 answers

Can Hudi source be defined as retract stream in FlinkSQL

I'm trying to use hudi as source to perform aggregation in FlinkSQL,and i found hudi source is append stream but not retract stream.It makes the aggregation results accumulate incorrectly. My codes like as create table hudi_source ( many…
maple
  • 37
  • 2
  • 5
0
votes
0 answers

Async Clustering failing for MOR table with object not serializable (class: org.apache.avro.generic.GenericData$Record error

Problem Description We have a MOR table which is partitioned by yearmonth(yyyyMM). We would like to trigger async clustering after doing the compaction at the end of the day so that we can stitch together small files into larger files. Async…
0
votes
0 answers

Hudi Sink Connector shows broker disconnected

I am trying to add the Hudi Sink Connector to AWS MSK using the below config bootstrap.servers=****** connector.class=org.apache.hudi.connect.HoodieSinkConnector tasks.max=4 flush.size=10 s3.region=us-east-1 …