Questions tagged [apache-hudi]

Apache Hudi is a transactional data lake platform with a focus on batch and event processing (with ACID support). Use this tag for questions specific to problems with Apache-Hudi. Do not use this tag for common issues with topic data lake or delta lake.

Questions on using Apache Hudi

158 questions
1
vote
0 answers

java.lang.NoClassDefFoundError: org/apache/parquet/schema/LogicalTypeAnnotation$UUIDLogicalTypeAnnotation while fetching data from Hudi

I am trying to view some data from Hudi using below code in spark. import org.apache.hudi.DataSourceReadOptions; val hudiIncQueryDF = spark .read() .format("hudi") .option(DataSourceReadOptions.QUERY_TYPE,…
radhika sharma
  • 499
  • 1
  • 9
  • 28
1
vote
1 answer

pyspark: Get hudi last/latest commit using pyspark

I am doing an incremental query with spark-hudi every hour and saving that incremental query begin and end time in db(say mysql) everytime. For nexti ncemental query I use begin time as end time of previous query fetch from mysql. incremental query…
1
vote
1 answer

class "org.apache.flink.streaming.api.operators.MailboxExecutor" not found

when I use hudi 0.10.1 and flink1.14.0 , I got an exception " not found class org.apache.flink.streaming.api.operators.MailboxExecutor" I found "MailboxExecutor" is in the flink1.13.1 , how can I do it? complie with flink 1.14 ?
1
vote
0 answers

Can we incrementally query a Hudi table based on a custom column (Spark SQL)

I'm trying to ingest historical data into a data catalog using Apache Hudi upsert. As the data is years and months old, I wanted to iterate each month, adding the historical date as a column to be queryable. The problem is: incremental queries in…
1
vote
1 answer

Getting duplicate records while querying Hudi table using Hive on Spark Engine in EMR 6.3.1

I am querying a Hudi table using Hive which is running on Spark engine in EMR cluster 6.3.1 Hudi version is 0.7 I have inserted a few records and then updated the same using Hudi Merge on Read. This will internally create new files under the same…
vijayinani
  • 2,548
  • 2
  • 26
  • 48
1
vote
0 answers

Committing hudi files manually

I am using spark 3.x with apache-hudi 0.8.0 version. While I am trying to create presto table by using hudi-hive-sync tool I am getting below error. Got runtime exception when hive syncing java.lang.IllegalArgumentException: Could not find any data…
Shasu
  • 458
  • 5
  • 22
1
vote
1 answer

What does each section of the Parquet file name written with Apache Hudi represent?

Apache Hudi writes out each parquet file like below: 0743209d-51cb-4233-a7cd-5bb712fba1ff-0_21-64-5300_20211117172738.parquet I'm trying to understand what each section of the file represents. Here is my current understanding but I would like…
cauthon
  • 161
  • 1
  • 10
1
vote
2 answers

EMR Hudi cannot create hive connection jdbc:hive2://localhost:10000/

Trying to save hudi table in Jupyter notebook with hive-sync enabled. I am using EMR: 5.28.0 with AWS Glue as catalog enabled: # Create a DataFrame inputDF = spark.createDataFrame( [ ("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"), …
dytyniak
  • 364
  • 3
  • 10
1
vote
0 answers

AWS Partitioned Hudi

I have a dataset of around 180000000 records in .csv that I transform in hudi parquet through glue job. It's partitioned by one column. It writes all successfully, but it takes too long to read hudi data in glue job (>30min). I tried to read only…
1
vote
1 answer

Hudi partition and upsert are not working

what is wrong in this config , partition keys are not working in HUDI as well as all the records get updated in the hudi dataset while doing the upsert . so couldnt extract the delta from the tables. commonConfig = {'className' :…
Suganya
  • 37
  • 6
1
vote
0 answers

Partition pruning not working on Hudi dataset

We have created a Hudi dataset which has two level partition like this s3://somes3bucket/partition1=value/partition2=value where partition1 and partition2 is of type string When running a simple count query using Hudi format in spark-shell, it…
Raj
  • 2,368
  • 6
  • 34
  • 52
1
vote
1 answer

What is the timestamp format of _hoodie_commit_time column in Apache Hudi?

I'm exploring apache-hudi framework and following the quick guide. I'm trying out incremental query functionality, where we use the column _hoodie_commit_time for determining the incremental pull. I was wondering what is the timestamp format &…
Anoop Deshpande
  • 514
  • 1
  • 6
  • 23
1
vote
0 answers

After upsert is performed on the original table, writeToken in the Parquet file name of Hudi changes, resulting in Incremental query failure

@[toc] 0 Reason guess Every time we upsert the target, hoodie generates a log and compacts it, causing any incremental query before that point in time to die. 1 Here are all the operations to do with the original label. 1.1 Operation 1 (Update) An…
1
vote
1 answer

Is there a way to use Apache Hudi on AWS glue?

Trying to explore apach hudi for doing incremental load using S3 as a source and then finally saving the output to a different location in S3 through AWS glue job. Any blogs/articles which can help here as a starting point ?
shikeb
  • 11
  • 1
  • 3
1
vote
2 answers

Unable to run spark.sql on AWS Glue Catalog in EMR when using Hudi

Our setup is configured that we have a default Data Lake on AWS using S3 as storage and Glue Catalog as our metastore. We are starting to use Apache Hudi and we could get it working following de AWS documentation. The issue is that, when using the…
gabra
  • 9,484
  • 4
  • 29
  • 45