Questions tagged [apache-hudi]

Apache Hudi is a transactional data lake platform with a focus on batch and event processing (with ACID support). Use this tag for questions specific to problems with Apache-Hudi. Do not use this tag for common issues with topic data lake or delta lake.

Questions on using Apache Hudi

158 questions
0
votes
1 answer

AWS S3 (ap-south-1) returns Bad Request for Hudi DeltaStreamer job

I'm trying to run a DeltaStreamer job to push data to S3 bucket using the following cmd: spark-submit \ --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 \ --conf…
0
votes
2 answers

Why apache-hudi is creating COPY_ON_WRITE table even if I have given MERGE_ON_READ?

I am trying to create a simple hudi table with MERGE_ON_READ table type. After executing the code still in hoodie.properties file I see hoodie.table.type=COPY_ON_WRITE Am I missing something here ? Jupyter Notebook for this code:…
sannidhi
  • 23
  • 5
0
votes
2 answers

How prevent hudi to write patition columns into data?

Consider the following: data are read from partitioned structure y=,m=,d=. hudi DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY config is set to y=,m=,d= At first glance I have to remove y,m,d columns but without these columns hudi will not…
Cherry
  • 31,309
  • 66
  • 224
  • 364
0
votes
1 answer

Delete in Apache Hudi - Glue Job

I have to build a Glue Job for updating and deleting old rows in Athena table. When I run my job for deleting it returns an error: AnalysisException: 'Unable to infer schema for Parquet. It must be specified manually.;' My Glue Job: datasource0 =…
Mateja K
  • 57
  • 2
  • 12
0
votes
1 answer

How Can Apache Hudi merge delta asynchronously?

I'm new to Apache Hudi. In Apache Hudi, merge on read table type merge delta data asynchronously. It is merged when data is queried or the merge config(interval or unmerged commit count) is meet. But Hudi has not own background process, otherwise…
SHRIN
  • 318
  • 3
  • 15
0
votes
2 answers

need help on submitting hudi delta streamer job via apache livy

I am little confused with how to pass the arguments as REST API JSON. Consider below spark submit command. spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4 \ --master yarn \ …
0
votes
1 answer

Issue on creating external table hive with hudi

I am trying to create an external file in hive metastore, using apache hudi framework. Its able to connect with hive metastore but throws exception after the connection when trying to create table. dataFrame.writeStream …
sreekesh.s
  • 158
  • 1
  • 8
0
votes
1 answer

Spark Datasource Hudi table read using instant time

I'm reading Hudi table using Spark.read.format("hudi") want to understand how is this option works hoodie.datasource.read.begin.instanttime Will it similar to hudi's hoodie_commit_ts column available in parquets files? I'm not able to get same count…
0
votes
1 answer

Does latest versions of Hudi (0.7.0, 0.6.0) work with Spark 2.3.0 when reading orc files?

The documentation says: Hudi works with Spark-2.x & Spark 3.x versions. (https://hudi.apache.org/docs/quick-start-guide.html) But I have not been able to use hudi-spark-bundle_2.11 version 0.7.0 with Spark 2.3.0 and Scala 2.11.12. Is there any…
Joyan
  • 41
  • 1
  • 7
0
votes
1 answer

Error consuming records caused by SdkInterruptedException when inserting into Hudi Table

I have this Hudi table that I created from a migration, so this has billions of rows. There were no problems when migrating, but as soon as I started a streaming to start writing fresh data to this table, these errors occurred: ERROR - error…
0
votes
1 answer

Apache Hudi example from spark-shell throws error for Spark 2.3.0

I am trying to run this example (https://hudi.apache.org/docs/quick-start-guide.html) using spark-shell. The Apache Hudi documentation says "Hudi works with Spark-2.x versions" The environment details are: Platform: HDP 2.6.5.0-292 Spark version:…
Joyan
  • 41
  • 1
  • 7
0
votes
0 answers

Debezium + Schema Registry Avro Schema: why do I have the "before" and "after" fields, and how do I use that with HudiDeltaStreamer?

I have a table in PostgreSQL with the following schema: Table "public.kc_ds" Column | Type | Collation | Nullable | Default | Storage | Stats target…
0
votes
3 answers

Databricks - java.lang.NoClassDefFoundError: org/json/JSONException

We can't figure out the following issue: we are trying to use Apache Hudi to save data to the storage. The problem is when we upload a fat jar which includes the org.json package in dependencies, the df.save() application is failing…
eugen-fried
  • 2,111
  • 3
  • 27
  • 48
0
votes
2 answers

Install Hudi ver. 0.6.0 on AWS EMR

Can anyone help me with properly installing Hudi 0.6.0 on AWS EMR ver 6.0.0 ? I think AWS has some custom scripts added to make Hudi work in EMR properly
ASHISH M.G
  • 522
  • 2
  • 7
  • 23
0
votes
1 answer

Using Apache Hudi with Python/Pyspark

Has anyone used Apache Hudi in a Pyspark environment? If it is possible, are there any code samples available?
1 2 3
10
11