2

I'm trying to create an ETL job in AWS Glue. The use-case is as follows: When a column gets added in one of the source table after running ETL job, and when we try to re run the etl job, the etl job fails saying column not found (in target table)

How can I enable ETL to create that column in target table. Because ETL already has right to create table when it doesn't exist.

Example:

Source Table:

Table X: column_1, column_2

Table Y: column_1, column_3, column_4

ETL Job Configured to join both of them resulting into

Table_XY: column_1, column_2, column_3, column_4

Until this it works perfectly.

Now if Table Y gets modifies as below

Table Y: column_1, column_3, column_4, **column_5**

And I rerun crawlers (which detects column on source)

Then I rerun the ETL job, it fails with below error message

Column "column_5" not found in schema

How can I solve this?

Updated with Glue Script:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "db_source", table_name = "sourc_table_x", transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "db_source", table_name = "sourc_table_x", transformation_ctx = "datasource0")

## @type: DataSource
## @args: [database = "db_source", table_name = "sourc_table_y", redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasource1"]
## @return: datasource1
## @inputs: []
datasource1 = glueContext.create_dynamic_frame.from_catalog(database = "db_source", table_name = "sourc_table_y", redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasource1")

## @type: Join
## @args: [keys1 = ['column_1'], keys2 = ['column_1']]
## @return: join2
## @inputs: [frame1 = datasource0, frame2 = datasource1]
join2 = Join.apply(frame1 = datasource0, frame2 = datasource1, keys1 = ['column_1'], keys2 = ['column_1'], transformation_ctx = "join2")

## @type: ResolveChoice
## @args: [choice = "make_cols", transformation_ctx = "resolvechoice2"]
## @return: resolvechoice2
## @inputs: [frame = join2]
resolvechoice2 = ResolveChoice.apply(frame = join2, choice = "make_cols", transformation_ctx = "resolvechoice2")

## @type: DropNullFields
## @args: [transformation_ctx = "dropnullfields3"]
## @return: dropnullfields3
## @inputs: [frame = resolvechoice2]
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")

## @type: DataSink
## @args: [catalog_connection = "my-db-connection", connection_options = {"dbtable": "target_table_xy", "database": "db_target"}, transformation_ctx = "datasink4"]
## @return: datasink4
## @inputs: [frame = dropnullfields3]
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields3, catalog_connection = "my-db-connection", connection_options = {"dbtable": "target_table_xy", "database": "db_target"}, transformation_ctx = "datasink4")
job.commit()
Jake
  • 391
  • 1
  • 4
  • 22
  • Look at [this](https://i.imgur.com/KTSxhqU.jpg) image, According to your use case, I think you should go with second option. – Achyut Vyas Sep 21 '20 at 15:37
  • Can you update your question with the Glue script ? – Prabhakar Reddy Sep 22 '20 at 04:14
  • @PrabhakarReddy updated with glue script - thanks! – Jake Sep 22 '20 at 08:17
  • @AchyutVyas it will affect the source table data catalogue. But for target table it has no effect. – Jake Sep 22 '20 at 08:20
  • Looks like you are using redshift in destination? – Prabhakar Reddy Sep 23 '20 at 04:21
  • @PrabhakarReddy - Actually it's MySQL on EC2. Anyway you can consider the case to be Redshift destination also. – Jake Sep 23 '20 at 06:24
  • So you want to write data with schema different than schema that is already present in mysql table. – Prabhakar Reddy Sep 23 '20 at 10:29
  • @PrabhakarReddy - Yes, 2 (or multiple) source table from some schema and joined and written into single table of another schema to achieve denormalized structure which helps in (easy) reporting purpose. – Jake Sep 23 '20 at 10:55
  • Hey @Jack I'm quite confused that at which stage you were getting Error in Glue Job (column_5 not found). you said that your source_table (sourc_table_y) got updated, after that you rerun the Crawler and It will change the schema of your data catalog **sourc_table_y** and Execution of Glue Job is giving you the error or You are referencing **target_table_xy** in another program and that from there you are getting error? – Achyut Vyas Sep 23 '20 at 14:13
  • @AchyutVyas - Let me clear your confusion. The column added in one of the source table. The crawler of source table detects the added column and updates schema (or table on aws) accordingly. Now if I run ETL job it should ideally be creating one more column in target (target_table_xy) but it says column not found. It is obvious that column is absent on target table, but it is the expectation that the column is added automatically since ETL has the right to alter target table (Remember - ETL is the one who actually created target tabe on first run) – Jake Sep 23 '20 at 15:19

0 Answers0