Questions tagged [apache-hudi]

Apache Hudi is a transactional data lake platform with a focus on batch and event processing (with ACID support). Use this tag for questions specific to problems with Apache-Hudi. Do not use this tag for common issues with topic data lake or delta lake.

Questions on using Apache Hudi

158 questions
0
votes
2 answers

Hudi supports `update` operation?

I have an exception when update record with spark sql for hudi as following. update hudi.cow1 set price=1300 where id=2; 22/10/17 19:24:44 ERROR Executor: Exception in task 0.0 in stage 206.0 (TID 2442) org.apache.avro.AvroRuntimeException: Not a…
Angle Tom
  • 1,060
  • 1
  • 11
  • 29
0
votes
1 answer

Can I use incremental, time travel, and snapshot queries with hudi only using spark-sql?

I'm trying to do incremental, snapshot, and time travel queries using spark-sql with hudi, but the only way that I can find to do this is creating a DataFrame with spark.read and then creating a temp view. Is there any way to accomplish this with…
0
votes
3 answers

How to add Hudi Package to local AWS Glue Interactive Notebook

I have setup Glue Interactive sessions locally by following https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html However, I am not able to add any additional packages like HUDI to the interactive session There are a few magic…
NarenS
  • 1
0
votes
1 answer

org.apache.flink.table.api.TableException: Unsupported query: Merge Into

I am working on a Flink streaming job where I need to upsert data in the Hudi table. I am using merge into a query to upsert data in the Hudi table. Table table = tableEnv.fromDataStream(KafkaStreamTableDataStreamStream); …
lucy
  • 4,136
  • 5
  • 30
  • 47
0
votes
2 answers

Can I use mysql database as destination storage for apache hudi

I am new to Apache Hudi,Please let me know if there any configuration is provided in apache hudi for writing data on mysql database.
ash
  • 1
  • 4
0
votes
2 answers

Hudi data overrides every time on new batch of spark structure streaming

I am working on spark structure streaming where job consuming Kafka message, do aggregation and save data in apache hudi table every 10 seconds. The below code is working fine but it overwrites the resultant apache hudi table data on every batch. I…
lucy
  • 4,136
  • 5
  • 30
  • 47
0
votes
1 answer

Hoodie (Hudi) precombine field failing on NULL

My AWS Glue job for Hudi CDC is failing on a column that is a precombine field (see error message below). I have validated that there are no NULL values on this column (it has an AFTER UPDATE Trigger and a default of NOW() set). When I query the…
J Weezy
  • 3,507
  • 3
  • 32
  • 88
0
votes
0 answers

Hudi Failed to delete for commit time for certain records

I have a COW Table and able to insert and update the records using Glue ETL with out any issues. How ever when i try to delete the records for some records i am getting the following error: An error occurred while calling…
Sateesh K
  • 1,071
  • 3
  • 19
  • 45
0
votes
1 answer

how to update/delete a record in hudi table in AWS?

I have a requirement to update or delete a record the hudi table, one way is to do that with pyspark/scala by following the steps mentioned in the below guide https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html Also…
GOPI M
  • 27
  • 7
0
votes
1 answer

Apache Hudi Serialize issue

Could some one please help to rectify this error It is showing the below error when I am trying to update the data py4j.protocol.Py4JJavaError: An error occurred while calling o84.save. : org.apache.hudi.exception.HoodieException: hoodie only…
0
votes
1 answer

How to change Hudi table version via Hudi CLI

How do I change the table version via the Hudi CLI? Steps: ssh into EMR kick off the hudi cli /usr/lib/hudi/cli/bin/hudi-cli.sh. Version of the Hudi CLI is 1. connect to my table connect --path s3://bucket/db/table In the desc of the table I see…
Andreina
  • 63
  • 7
0
votes
1 answer

AWS Glue- How to output only 1 latest file in s3 bucket

I use AWS Glue and Apache Hudi to replicate data in RDS to S3. If I execute the following job, 2 parquet files (initial one, and updated one) will be generated in the S3 bucket (basePath). In this case, I want only 1 latest file, and would like to…
satohh
  • 45
  • 4
0
votes
1 answer

[HUDI]Creating Append only Raw data in HUDI

I am trying to adopt HUDI in our project. I am looking for 3 levels of data. Raw (S3) --> Cleaned (HUDI, append only) ---> Standard (HUDI, upserts) The idea is to keep a Cleaned bucket for clean data with Append only mode. This can be used by…
Amit Joshi
  • 172
  • 1
  • 14
0
votes
1 answer

hudi delta streamer job via apache livy

Please help how to pass --props file and --source-class file to LIVY API POST . spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4 \ --master yarn \ --deploy-mode cluster \ --conf…
codek
  • 65
  • 1
  • 6
0
votes
0 answers

Issue with Apache Hudi Update and Delete Operation on Parquet S3 File

Here I am trying to simulate updates and deletes over a Hudi dataset and wish to see the state reflected in Athena table. We use EMR, S3 and Athena services of AWS. Attempting Record Update with a withdrawal object withdrawalID_mutate =…