Questions tagged [apache-hudi]

Apache Hudi is a transactional data lake platform with a focus on batch and event processing (with ACID support). Use this tag for questions specific to problems with Apache-Hudi. Do not use this tag for common issues with topic data lake or delta lake.

Questions on using Apache Hudi

158 questions
0
votes
0 answers

the compaction of the MOR hudi table keeps the old values

I am having hudi table, and I write it as MOR here's the config: conf = { 'className': 'org.apache.hudi', 'hoodie.table.name': hudi_table_name, 'hoodie.datasource.write.operation': 'upsert', 'hoodie.datasource.write.table.type':…
Mee
  • 1,413
  • 5
  • 24
  • 40
0
votes
1 answer

Why not write data as hudi or iceburg format in flink-table-store?

Recently I had a chance to get to know the flink-table-store project. I was attracted by the idea behind it at the first glance. After reading the docs, I've got a question in my head for a while. It's about the design of the file storage. It looks…
0
votes
1 answer

how to get the latest version of hudi table

I have a spark streaming job in which listens to kinesis stream, then it writes it to hudi table, what I want to do is say for example I added these two records to hudi table: | user_id | name | timestamp | -------- | --------------…
Mee
  • 1,413
  • 5
  • 24
  • 40
0
votes
1 answer

Hudi COW table - Bulks_Insert produces more number of files while clustering is enabled compare to Insert mode

I am trying to use clustering configurations in Hudi COW table to keep only a single file in the partition folders if the total partition data size is less than 128 MB. But it seems that clustering is not working with bulk_insert as expected. We…
Vinay Sinha
  • 193
  • 2
  • 13
0
votes
1 answer

Creating an Athena view on a HUDI table returns soft deleted records when the view is read using SPARK

I have multiple HUDI tables with differing column names and I built a view on top of it to standardize the column names. When this view is read from Athena, it returns a correct response. But, when the same view is read using SPARK using…
sashmi
  • 97
  • 1
  • 2
  • 14
0
votes
2 answers

Deleting records from Apache Hudi Table which is part of Glue Tables created using AWS Glue Job and Kinesis

I currently have a DynamoDB stream configured which is inputing streams into Kinesis Data streams whenever insertion/updation happens and subsequently I have Glue tables which is taking input from above kinesis stream and then displaying the…
0
votes
1 answer

Hudi with Spark perform very slow when trying to write data into filesystem

I'm trying Apache Hudi with Spark by a very simple demo: with SparkSession.builder.appName(f"Hudi Test").getOrCreate() as spark: df = spark.read.option('mergeSchema', 'true').parquet('s3://an/existing/directory/') hudi_options = { …
Rinze
  • 706
  • 1
  • 5
  • 21
0
votes
1 answer

How to encrypt apache hudi external tables data present in s3 synced into hive tables through spark jobs

Technical background: I am getting tables data from kafka and putting it into hudi and hive tables using spark. I am using AWS EMR. I want to encrypt data in transit within the cluster as well as synced external tables data present in s3 (Data at…
Roobal Jindal
  • 214
  • 2
  • 13
0
votes
0 answers

Pyspark Hudie writing timestamps as binary

I am trying to write a pyspark DF to s3 hudie parquet format. Evcerything is working fine, however, the timestamps are writing as binary format. I would like to write this as hive tiestamp format so that i can query data in Athena. Pyspark config as…
Sql_Peter
  • 3
  • 3
0
votes
0 answers

delete Apache hudi duplicate record key

I got some trouble in hudi when i delete rows with the same record key by spark-sql. e.g I created a table and set the recordKey=empno CREATE TABLE emp_duplicate_pk ( empno int, ename string, job string, mgr int, hiredate…
mcspter
  • 1
  • 2
0
votes
1 answer

How to insert struct, map type in Apache Hudi

I see the official document, there are no samples about inserting complex types like struct and map. So, what's the grammar? My table definition: spark-sql> desc struct_map; _hoodie_commit_time string NULL _hoodie_commit_seqno string …
Smith Cruise
  • 404
  • 1
  • 4
  • 19
0
votes
1 answer

I encountered an error when use flink to insert data into a Apachi hudi table

Environment: Flink: 1.15.2 Hudi flink: hudi-flink1.15-bundle-0.12.0.jar When I execute the statements: Flink SQL> CREATE TABLE t1( > uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED, > name VARCHAR(10), > age INT, > ts TIMESTAMP(3), > `partition`…
he wang
  • 11
  • 2
0
votes
0 answers

issue on inserting hudi with S3

I am using Hudi to insert data into S3. Hudi table can be created and data are also inserted into the table with no issue. But When I select the table, no result return. And when I check the S3, no relative files are generated either. Where can I…
0
votes
0 answers

Custom HoodieRecordPayload for use in flink sql

I am trying to use Apache Hudi with Flink sql by following Hudi's flink guide The basics are working, but now I need to provide custom implementation of HoodieRecordPayload as suggested on this FAQ. But when I am passing this config as shown in…
0
votes
1 answer

Flink write to hudi with different schemas extracted from kafka datastream

So I have a Kafka topic which contains avro record with different schemas. I want to consume from that Kafka topic in flink and create a datastream of avro generic record.(this part is done) Now I want to write that data in hudi using schema…
terminal
  • 72
  • 3