Glue: map/process source table's column data and write it to columns in pre-existing redshift table

Question

I am very new to Glue and came across to a scenario where we've source table in glue catalog and we need to write it's data to specific columns in pre-existing table in redshift. e.g.

source_table_name[source_table_column_name].       target_table_name[target_table_column_name]

employee[id]                                      resource[resource_id]
employee[name]                                    resource[resource_name]
employee[policy]                                  resource[policy_name]
employee[zip]                                     resource[zipcode]
...                                               ...
...                                               ...
...                                               ...

could you please share how/what glue function can be utilized to write UDF in python which can iterate through given subset of column names from source_table and map/write this data to specified column name in target table (as in example above) in redshift?

For example: write id column data from employee <source table> to resource_id column in resource <target table> in redshift and so on.

I've written following logic to load data in source_dynf:

    def load_data(self):
        self.logger.info(f"Loading data from Glue Catalog source [{self.source_database}/{self.source_table}]")
        source_dynf = self.glue_context.create_dynamic_frame.from_catalog(
            database=self.source_database,
            table_name=self.source_table,
            transformation_ctx=f"load_{self.source_database}_{self.source_table}"
        )
   return source_dynf

  def process_data(self, source_dynf):
      ###how I can map data as mentioned above and return processed_dynf from here which I can write to redshift target table

  def write_data(self):
      ###write to redshift target table

thanks in advance for suggestions/help!

So you just need to rename the columns? No changes to the data itself? — jonlegend, Jun 18 '21 at 15:59
@jonlegend: so I should just rename the table_name and column_name after loading data in source_dataframe and write it out? is there a sample you can post for one column? — newbieitTech, Jun 18 '21 at 19:17

jonlegend · Accepted Answer · 2021-06-25T16:14:15.960

2

If you are just renaming all columns, the typical pattern is:

# in your imports
from awsglue.transforms import ApplyMapping

#just after your from_catalog
source_dynf = ApplyMapping.apply(frame=source_dynf, mappings=[
    ("id", "string", "resource_id","string"),#the pattern here is source column name, source column type, target column name, target column type
    ("name", "string", "resource_name","string")
    #and so on, follow the pattern.
], transformation_ctx="mapping")

If you plan to use pyspark dataframes instead, the syntax is easier, and doesn't fool with the types:

#in your imports
from pyspark.context import SparkContext
from awsglue.context import GlueContext

spark_context = SparkContext.getOrCreate()
glue_context = GlueContext(spark_context)

frame = source_dynf.toDF()
frame = frame.toDF("resource_id","resource_name")#and so on, the arguments are the new column names, make sure the number of string arguments equals the number of columns in the frame
source_dynf = DynamicFrame.fromDF(frame, glue_context, "final")

Per discussion below you would like to pull the schema from your target database and push it onto the source data. Something like this should do the trick:

#get the schema for the target frame
# see https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-redshift.html
# note: you may want to read from a small partition for performance, see:
# https://docs.aws.amazon.com/glue/latest/dg/run-jdbc-parallel-read-job.html
my_conn_options = {  
    "url": "jdbc:redshift://host:port/redshift database name",
    "dbtable": "redshift table name",
    "user": "username",
    "password": "password",
    "redshiftTmpDir": args["TempDir"],
    "aws_iam_role": "arn:aws:iam::account id:role/role name"
}

target_frame = glueContext.create_dynamic_frame_from_options("redshift", my_conn_options)
frame = source_dynf.toDF()
frame = frame.ToDf(*[field.name for field in target_frame.schema.fields])# note, number of columns must match!
source_dynf = DynamicFrame.fromDF(frame, glue_context, "final")

edited Jun 25 '21 at 16:14

answered Jun 21 '21 at 16:20

jonlegend

361
2
6

Thanks for the solution @jonlegend. One quick question: ```Since source columns are from glue-catalog-table and target columns are in redshift table, would it be better to read column and type from glue-catalog-table schema and map it to redshift columns using some sort of udf(if there is a way to do it)? (I think it'll be better for tables which have more than 100 or 200 columns. I am not sure though how to do that using ApplyMapping.apply).``` – newbieitTech Jun 22 '21 at 02:19
Do the names of the target columns follow a pattern? – jonlegend Jun 22 '21 at 02:54
@ jonlegend ```Sadly not```. I can put some other target table column names(from redshift table) here as well e.g. ```CLM_CAT_CD, CITY, CLM_NTR_CD, LOSS_DESC_TEXT, CITY, etc.``` – newbieitTech Jun 22 '21 at 03:05
So a UDF would only work on rows not columns. If you have glue crawlers (or a from_options read) pull your schema from the target, you could push that schema onto the data loaded from the source table dynamically (provided the number of columns match) as above (let me know if you need some example code). You would ideally want to have a partition on the target with just 1 row so that the load (just to get schema) would be fast and cheap. – jonlegend Jun 22 '21 at 14:21
@ jonlegend: It'll be awesome if you can post some sample code for that ```to read source table columns from glue catalog(we've crawled source table and schema is in glue catalog```. @jonlegend Another question I've is ```is there an option to read schema/column names directly from redshift given that we've glue connection established or it has to be crawled before we can read columns from target redshift table and dynamically map them?``` – newbieitTech Jun 23 '21 at 13:59
I'll update my answer a bit later today with some sample code. Basically, you can avoid crawlers by doing a "from_options" read instead of a "from_catalog" read. Following that you can use the toDF function from above on a pyspark dataframe to change the column names to the column names you retrieved from redshift. – jonlegend Jun 24 '21 at 14:20
@newbieitTech -- The code above should do the trick. – jonlegend Jun 25 '21 at 16:14

Glue: map/process source table's column data and write it to columns in pre-existing redshift table

1 Answers1