I am using AWS Glue to ETL data to Redshift. I have been encountering an issue where my date is loading as null in Redshift.
What I have set-up:
- Upload csv into S3, see sample data:
item | color | price | date
shirt| brown | 25.05 | 03-01-2018
pants| black | 20.99 | 02-14-2017
Crawl S3 object
Create a Redshift table, see schema:
item: string color: string price: decimal / numeric date: date
Script to load data to Redshift, see script:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import to_date, col
from awsglue.dynamicframe import DynamicFrame
glueContext = GlueContext(SparkContext.getOrCreate())
items_dynamicframe = glueContext.create_dynamic_frame.from_catalog(
database = "rdshft-test",
table_name = "items")
items_dynamicframe.printSchema()
#Attempt to get date loaded correctly to Redshift
data_frame = items_dynamicframe.toDF()
data_frame.show()
data_frame = data_frame.withColumn("date",
to_date(col("date"),"d-M-Y"))
data_frame.show()
Any feedback is appreciated. Thank you.