I'm creating a data pipeline in AWS for moving data from S3 to Redshift via EMR. Data is stored in the HUDI format in S3 in parquet files. I've created the Pyspark script for full load transfer and for POC purpose my hudi table in S3 contains 4 columns (id,name,date,validated). Validated is of boolean data type which can be either true of false. If I remove the validated column, then table is successfully moving but if it contains validated column which is of boolean type, script is throwing the below error:
File "/home/hadoop/move.py", line 41, in <module>
hudi_df.write.format("jdbc").options(**redshift_conn_options).mode("append").save()
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1396, in save
File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1323, in call
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 169, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o77.save.
: com.amazon.redshift.util.RedshiftException: ERROR: Column "table_1.validated" has unsupported type "bit".
Redshift is considering boolean column(validated) as "bit" data type but in actual it is boolean. I'm running this Pyspark
script in EMR cluster.
Code I'm using:
from pyspark.sql import SparkSession
# Create Spark session
spark = SparkSession.builder \
.appName("HUDI to Redshift") \
.getOrCreate()
# Specify common DataSourceWriteOptions in the single hudiOptions variable
hudiOptions = {
'hoodie.table.name': 'test2',
'hoodie.datasource.write.recordkey.field': 'id',
#'hoodie.datasource.write.partitionpath.field': 'date',
#'hoodie.datasource.write.precombine.field': 'timestamp',
'hoodie.datasource.write.operation': 'insert',
#'hoodie.upsert.shuffle.parallelism': 150,
'hoodie.insert.shuffle.parallelism': 2
}
# Read the Hudi files into a DataFrame
hudi_df = spark.read \
.format('org.apache.hudi') \
.options(**hudiOptions) \
.load('s3a://new-incr-test-data-1/test2/')
# Define Redshift connection options
redshift_conn_options = {
"url": "jdbc:redshift://central-redshift-1.xxxxxxxxxx.us-east-1.redshift.amazonaws.com:5439/dev",
"dbtable": "bol",
"user": "user",
"password": "xxxxxxxx",
"aws_iam_role": "arn:aws:iam::xxxxxxxxxxxx:role/Redshift_Role"
}
# Write the Hudi DataFrame to Redshift
hudi_df.write.format("jdbc").options(**redshift_conn_options).mode("append").save()
# Stop Spark session
spark.stop()
I've tried multiple things like cast the "validated" column to the appropriate boolean type before writing to Redshift
. Made many changes to the script but none of it is working. Before writing data to Redshift I also print all the column data types on the screen and in that also "validated" column is showing as boolean type and don't know why Redshift is considering it as "bit" data type.
I'm expecting to move the entire table with all the columbs from S3 to Redshift
with my Pyspark
script.