1

I was able to create a small glue job to ingest data from one S3 bucket into another, but not clear about few last lines in the code(below).

applymapping1 = ApplyMapping.apply(frame = datasource_lk, mappings = [("row_id", "bigint", "row_id", "bigint"), ("Quantity", "long", "Quantity", "long"),("Category", "string", "Category", "string") ], transformation_ctx = "applymapping1")

selectfields2 = SelectFields.apply(frame = applymapping1, paths = ["row_id", "Quantity", "Category"], transformation_ctx = "selectfields2")

resolvechoice3 = ResolveChoice.apply(frame = selectfields2, choice = "MATCH_CATALOG", database = "mydb", table_name = "order_summary_csv", transformation_ctx = "resolvechoice3")

datasink4 = glueContext.write_dynamic_frame.from_catalog(frame = resolvechoice3, database = "mydb", table_name = "order_summary_csv", transformation_ctx = "datasink4")
job.commit()
  1. From the above code snippet, what is the use 'ResolveChoice'? is it mandatory?
  2. When I ran this job, It has created a new folder and file(with some random file name) in the destination(order_summary.csv) and ingested data instead of ingesting directly into my order_summary_csv table(a CSV file) residing in the S3 folder. Is it possible for spark(Glue) to ingest data into a desired CSV file?
Ahmed Ashour
  • 5,179
  • 10
  • 35
  • 56
NikRED
  • 1,175
  • 2
  • 21
  • 39

2 Answers2

1

I think this ResolveChoice apply method call is out of date since there is no such choice in the doc like "MATCH_CATALOG"

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-ResolveChoice.html

The general idea behind ResolveCHoice is that if you have field with int values and string values - you should resolve how to handle this field:

  1. Cast it to Int
  2. Cast it to String
  3. Leave both and create 2 columns in the result dataset
1

-You can't write a glue dynamicframe/dataframe to csv format with specific file name as in the backend spark write it with random partition name.

-ResolveChoice is useful when your dynamic frame has a column having records with different datatype.So, unlike spark dataframe,glue dynamicframe doesn't provide string as default datatype rather retain both datatypes. Now, you using resolveChoice we can choose which datatype ideally it should have and records with other datatype will set to null.