Questions tagged [apache-hudi]

Apache Hudi is a transactional data lake platform with a focus on batch and event processing (with ACID support). Use this tag for questions specific to problems with Apache-Hudi. Do not use this tag for common issues with topic data lake or delta lake.

Questions on using Apache Hudi

158 questions
0
votes
0 answers

How to pass AWS keys to access S3 with Apache Hudi on LocalStack?

I am using the Docker image of localstack/localstack:2.0.2 and attempting to write on S3 in it using PySpark(3.1.1) /Apache-Hudi (0.13.0) with the following options: { 'hoodie.table.name': 'foo', ... …
Randomize
  • 8,651
  • 18
  • 78
  • 133
0
votes
2 answers

Apache Spark: Exception in thread "main" java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3Adapter

I have run the following code via intellij and runs successfully. The code is shown below. import org.apache.spark.sql.SparkSession object HudiV1 { // Scala code case class Employee(emp_id: Int, employee_name: String, department: String,…
pacman
  • 725
  • 1
  • 9
  • 28
0
votes
1 answer

Hudi DeltaStreamer with AWS Glue Data Catalog syncs the database, but not the tables

This is similar to being unable to sync AWS Glue Data Catalog where you run a spark-submit with Hudi DeltaStreamer, except you only sync the database (and not the tables). E.g. you submit: spark-submit \ --conf…
Will
  • 11,276
  • 9
  • 68
  • 76
0
votes
1 answer

Running Hudi DeltaStreameron EMR succeeds, but does not sync to AWS Glue Data Catalog

When I run Hudi DeltaStreamer on EMR, I see the hudi files get created in S3 (e.g. I see a .hoodie/ dir and the expected parquet files in S3. The command looks something like: spark-submit \ --conf…
Will
  • 11,276
  • 9
  • 68
  • 76
0
votes
1 answer

Hudi DeltaStreamer error on `User class threw exception: java.lang.NullPointerException` at `...SchemaRegistryProvider.fetchSchemaFromRegistry`

If you run into this error: 23/04/03 16:19:49 INFO Client: client token: N/A diagnostics: User class threw exception: java.lang.NullPointerException at…
Will
  • 11,276
  • 9
  • 68
  • 76
0
votes
1 answer

Writing data from Multi-Cluster into Hudi tables in S3

For Multi-cluster writes in S3, Delta lake uses Dynamo Db to atomically check if the file is present before writing it because S3 not supporting the “put-if-absent” consistency guarantee. Therefore, in order to leverage this feature using Delta lake…
0
votes
0 answers

getting java.lang.NullPointerException while reading hudi table data using spark sql while calling in spark action

I am getting weird pyspark java null pointer exception while calling any action through pyspark. below is the snippet for this: py4j.protocol.Py4JJavaError: An error occurred while calling o432.showString. : org.apache.spark.SparkException: Job…
0
votes
1 answer

How to update record in OverwriteWithLatestAvroPayload.preCombine

I have a requirement where I need to combine fields from old and new records in OverwriteWithLatestAvroPayload.preCombine The default behaviour either selects old-record or this based on ordering Value. But in my case I need to combine fields from…
Yabha
  • 11
  • 2
0
votes
0 answers

Flink crashes when I try to create a new table

Hello I'm working with flink-MSK-Hudi architecture and I want to ingest data in my AWS Glue catalog. Currently If I try to ingest the data in a S3 bucket in hudi format in worked good, the problem is when I set the hive properties in the hudi…
0
votes
0 answers

GENERIC_INTERNAL_ERROR: Field new_test2 not found in log schema. Query cannot proceed! Derived Schema Fields:

there's a hudi table that is written as parquet file in s3, I ma trying to query it using athena, firstly it worked fine, then when I try to add a column and try to query it again I get this error: GENERIC_INTERNAL_ERROR: Field new_test2 not found…
Mee
  • 1,413
  • 5
  • 24
  • 40
0
votes
0 answers

AWS EMR managed auto scaling is automatically scaling down to 0 task nodes and again adding task nodes to and fro while the spark job is running

Environment: AWS EMR cluster with managed autosclaing turned on and running hudi job Issue: I enabled auto scaling with minimum 2 nodes and maximum 8 task nodes capacity and maximum 2 core nodes, with 2 on demand capacity. I ran a spark job, it…
0
votes
0 answers

How to use temporal table join in batch mode Flink SQL?

In order to revise data T+1, for the reason of data delay, i want to execute temporal table join using flink sql in batch mode. And flink official document show join support running in batch mode, but i got error when execute sql:…
Felix Feng
  • 281
  • 3
  • 7
0
votes
2 answers

Can partitioning data in Apache Hudi optimize AWS Spectrum query?

I'm using AWS Redshift Spectrum to query a Hudi table. As we know, filtering data by partition column when querying data in Spectrum could reduce the size of the data scanned by Spectrum and speed up the query. My question is, if I use Spectrum to…
Rinze
  • 706
  • 1
  • 5
  • 21
0
votes
1 answer

pyspark.sql.utils.AnalysisException: 'writeStream' can be called only on streaming Dataset/DataFram

I am having glue streaming job, and I need to write the data as stream but after applying some processing, so I did the following: data_frame_DataSource0 = glueContext.create_data_frame.from_catalog( database=database_name, …
Mee
  • 1,413
  • 5
  • 24
  • 40
0
votes
1 answer

Flink SQL-Cli: Hudi is abstract

i'm trying to recreate the flink common example working with hudi (https://hudi.apache.org/docs/flink-quick-start-guide), but when I try to insert the example data an error appears, can someone help me with this? The steps that I'm following in my…
Valle1208
  • 43
  • 4