0

I am using AWS Glue to ETL data to Redshift. I have been encountering an issue where my date is loading as null in Redshift.

What I have set-up:

  • Upload csv into S3, see sample data:

item | color | price | date

shirt| brown | 25.05 | 03-01-2018

pants| black | 20.99 | 02-14-2017

  • Crawl S3 object

  • Create a Redshift table, see schema:

    item: string color: string price: decimal / numeric date: date

  • Script to load data to Redshift, see script:


    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
    from pyspark.sql.functions import to_date, col
    from awsglue.dynamicframe import DynamicFrame
    
    glueContext = GlueContext(SparkContext.getOrCreate())
    
    items_dynamicframe = glueContext.create_dynamic_frame.from_catalog(
           database = "rdshft-test",
           table_name = "items")
    items_dynamicframe.printSchema()
    
    #Attempt to get date loaded correctly to Redshift
    data_frame = items_dynamicframe.toDF()
    data_frame.show()
    data_frame = data_frame.withColumn("date",
              to_date(col("date"),"d-M-Y"))
    data_frame.show()

Any feedback is appreciated. Thank you.

TechNewbie
  • 164
  • 2
  • 15
  • Check the data type after you convert it to format MM/DD/YYYY and let glue create a new redshift table with out creating it manually – Prabhakar Reddy May 17 '21 at 03:31

1 Answers1

0

I was able to resolve this issue by converting back to dynamic frame. When porting my data into notebook, I am using a dynamicframe. But, to convert string to date, I must use dataframe (more specifically pyspark sql functions). To load into Redshift, I must convert back to dynamicframe. Assuming this is a requirement with Glue?

TechNewbie
  • 164
  • 2
  • 15