0

I'm creating a data pipeline in AWS for moving data from S3 to Redshift via EMR. Data is stored in the HUDI format in S3 in parquet files. I've created the Pyspark script for full load transfer and for POC purpose my hudi table in S3 contains 4 columns (id,name,date,validated). Validated is of boolean data type which can be either true of false. If I remove the validated column, then table is successfully moving but if it contains validated column which is of boolean type, script is throwing the below error:

File "/home/hadoop/move.py", line 41, in <module>
hudi_df.write.format("jdbc").options(**redshift_conn_options).mode("append").save()
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1396, in save
File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1323, in call
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 169, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o77.save.
: com.amazon.redshift.util.RedshiftException: ERROR: Column "table_1.validated" has unsupported type "bit".

Redshift is considering boolean column(validated) as "bit" data type but in actual it is boolean. I'm running this Pyspark script in EMR cluster.

Code I'm using:

from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder \
    .appName("HUDI to Redshift") \
    .getOrCreate()

# Specify common DataSourceWriteOptions in the single hudiOptions variable
hudiOptions = {
'hoodie.table.name': 'test2',
'hoodie.datasource.write.recordkey.field': 'id',
#'hoodie.datasource.write.partitionpath.field': 'date',
#'hoodie.datasource.write.precombine.field': 'timestamp',
'hoodie.datasource.write.operation': 'insert',
#'hoodie.upsert.shuffle.parallelism': 150,
'hoodie.insert.shuffle.parallelism': 2
}

# Read the Hudi files into a DataFrame
hudi_df = spark.read \
.format('org.apache.hudi') \
.options(**hudiOptions) \
.load('s3a://new-incr-test-data-1/test2/')

# Define Redshift connection options
redshift_conn_options = {
    "url": "jdbc:redshift://central-redshift-1.xxxxxxxxxx.us-east-1.redshift.amazonaws.com:5439/dev",
    "dbtable": "bol",
    "user": "user",
    "password": "xxxxxxxx",
    "aws_iam_role": "arn:aws:iam::xxxxxxxxxxxx:role/Redshift_Role"
}

# Write the Hudi DataFrame to Redshift
hudi_df.write.format("jdbc").options(**redshift_conn_options).mode("append").save()

# Stop Spark session
spark.stop()

I've tried multiple things like cast the "validated" column to the appropriate boolean type before writing to Redshift. Made many changes to the script but none of it is working. Before writing data to Redshift I also print all the column data types on the screen and in that also "validated" column is showing as boolean type and don't know why Redshift is considering it as "bit" data type.

I'm expecting to move the entire table with all the columbs from S3 to Redshift with my Pyspark script.

MP24
  • 3,110
  • 21
  • 23
  • Welcome to Stack Overflow! Please take the [tour] and read [ask]. Your questions about code, i.e. debugging questions like these, are served by not only providing the error message, but also the code itself, a [mre]. We are not clairvoyant, so we cannot help you to debug code you do not show us. Also showing what you have already tried to solve the problem is appreciated. [edit] the question accordingly. – Adriaan Aug 22 '23 at 06:54
  • Also, you might want to consider adding a [tag:python] to this question. That seems more useful than [tag:data-engineering] on this particular question. – Adriaan Aug 22 '23 at 06:54
  • Note that you have exposed username and password as well as the URL and your account number in your post. You should immediately change the password. – MP24 Aug 23 '23 at 15:41

1 Answers1

0

With spark jdbc data source, you can provide a query instead of a table name. Then you should be able to cast the column on redshift side, before the spark dataframe is initiated.

So replace:

 "dbtable": "bol"

With

 "dbtable": "(select col1, col2, cast(bool_col as integer) as book_col from bol) as tmp"
parisni
  • 920
  • 7
  • 20