Highest Voted 'apache-hudi' Questions

1

vote

2 answers

Cannot create hive connection jdbc:hive2://localhost:10000. spark-submit in cluster mode

I'm running Apache Hudi application on Apache Spark. While I'm submitting an application in client mode its working fine but when I'm submitting an application in cluster mode, getting an error py4j.protocol.Py4JJavaError: An error occurred while…

asked Feb 16 '21 at 08:45

Pradeep Saini

680
6
17

1

vote

1 answer

Hudi: Access to timeserver times out in embedded mode

I am testing Hudi 0.5.3 (supported by AWS Athena) by running it with Spark in embedded mode, i.e. with unit tests. At first, the test succeeded but now it's failing due to timeout when accessing Hudi's timeserver. The following is based on Hudi:…

scala apache-spark apache-hudi

asked Dec 22 '20 at 01:22

alecswan

3,670
5
25
35

1

vote

1 answer

Apache Hudi deltastreamer throwing Exception in thread "main" org.apache.hudi.com.beust.jcommander.ParameterException' no main parameter was defined

Version Apache Hudi 0.6.1,Spark 2.4.6 Below is the standard spark-submit command for Hudi deltastreamer, where it is throwing as no main parameter is defined. I could see all the properties parameters are given. Appreciate any help on this…

apache-spark spark-submit apache-hudi

asked Sep 08 '20 at 04:37

Nizam

77
2
11

1

vote

1 answer

Apache Hudi commit id for current ingestion

How to get the current ingestion commit Id .I know HoodieDataSourceHelpers.latestCommit method can use to find the latest commit. But what happen if there is concurrent write in different thread. i need to find each thread commitID

apache-spark apache-hudi

asked May 29 '20 at 13:18

Nizamudeen

121
1
7

0

votes

1 answer

Redshift interpreting boolean data type as bit and hence not able to move hudi table from S3 to Redshift if there is any boolean data type column

I'm creating a data pipeline in AWS for moving data from S3 to Redshift via EMR. Data is stored in the HUDI format in S3 in parquet files. I've created the Pyspark script for full load transfer and for POC purpose my hudi table in S3 contains 4…

python pyspark amazon-redshift amazon-emr apache-hudi

asked Aug 22 '23 at 06:49

Rahul Kohli

1
2

0

votes

1 answer

Could not initialize class org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex

I'm trying to use Hudi with out Flink pipelines to publish data to S3 Object storage in parquet format. I am facing the below error while doing so: java.lang.NoClassDefFoundError: Could not initialize class…

apache-flink apache-hudi

asked Aug 18 '23 at 08:50

user3497321

443
2
6
15

0

votes

0 answers

Flink + NoClassDefFoundError: Could not initialize class org.apache.hudi.avro.HoodieAvroWriteSupport

I'm using Flink SQL to write results to S3 object storage through Hudi. I'm facing an exception when doing the same: java.lang.NoClassDefFoundError: Could not initialize class org.apache.hudi.avro.HoodieAvroWriteSupport at…

apache-flink apache-hudi

asked Aug 10 '23 at 09:25

user3497321

443
2
6
15

0

votes

0 answers

Update few columns & insert all columns during hudi merge

I have an existing table with these columns: record key is emp_id, precombine is log_ts and partition is log_dt emp_id emp_name log_ts load_ts log_dt 1 neo 2023-08-04 12:00:00 2023-08-04…

apache-hudi

asked Aug 04 '23 at 19:08

praneethh

263
4
16

0

votes

0 answers

Most suitable architecture for AWS realt-time ETL pipeline with transactional tables as a sink

The task was as following - build a near real-time pipeline for the project in AWS infrastructure. Having Nosql(DynamoDB) and SQL(RDS databases) (S3 stored data can be added in the future) sources we need to combine them into tables, used by…

amazon-web-services spark-streaming aws-glue apache-hudi amazon-kinesis-analytics

asked Aug 03 '23 at 11:53

ground_control

26
3

0

votes

1 answer

PySpark: querying Hudi partitioned table

I'm following the Apache Hudi documentation to write and read a Hudi table. Here's the code I'm using to create and save a PySpark DataFrame into Azure DataLake Gen2: tableName = "my_hudi_table" basePath = <> dataGen =…

apache-spark pyspark azure-databricks data-partitioning apache-hudi

asked Jul 26 '23 at 11:24

jakeis

29
4

0

votes

1 answer

Writing Hudi table into Azure DataLake Gen2

I need to create a Hudi table from a PySpark DataFrame using Azure Databricks notebook and save it into Azure DataLake Gen2. Here's my approach: spark.sparkContext.setSystemProperty("spark.serializer",…

apache-spark pyspark azure-databricks apache-hudi

asked Jul 25 '23 at 16:38

jakeis

29
4

0

votes

1 answer

Apache Hudi Incremental Write with Hive Sync Enabled Failing with org.apache.hudi.hive.HiveSyncTool: Schema difference found

I am trying incremental write to a Hudi table with Hive sync enabled, but it is failing with following error: 23/07/24 11:52:48 INFO org.apache.hudi.hive.HiveSyncTool: Schema difference found for table1 23/07/24 11:52:48 INFO…

pyspark apache-hudi incremental

asked Jul 24 '23 at 12:15

Yogesh Kumar

11
2

0

votes

0 answers

Can Hudi source be defined as retract stream in FlinkSQL

I'm trying to use hudi as source to perform aggregation in FlinkSQL,and i found hudi source is append stream but not retract stream.It makes the aggregation results accumulate incorrectly. My codes like as create table hudi_source ( many…

apache-flink flink-sql apache-hudi

asked Jul 18 '23 at 10:06

maple

37
2
5

0

votes

0 answers

Async Clustering failing for MOR table with object not serializable (class: org.apache.avro.generic.GenericData$Record error

Problem Description We have a MOR table which is partitioned by yearmonth(yyyyMM). We would like to trigger async clustering after doing the compaction at the end of the day so that we can stitch together small files into larger files. Async…

apache-hudi

asked Jul 06 '23 at 17:33

gaurav mathur

35
5

0

votes

0 answers

Hudi Sink Connector shows broker disconnected

I am trying to add the Hudi Sink Connector to AWS MSK using the below config bootstrap.servers=****** connector.class=org.apache.hudi.connect.HoodieSinkConnector tasks.max=4 flush.size=10 s3.region=us-east-1 …

aws-msk apache-hudi

asked Jul 03 '23 at 17:44

ennox108

1

Questions tagged [apache-hudi]