0

I have a messy data source where some field values can come in with two different names but should map to one conformed field name on the output.

e.g. data source contains update_date or modified_date and both should map to timestamp.

Both field names are never present simultaneously on the same row of data.

The glue script looks like this. Some lines have been redacted for clarity:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Script generated for node Data Catalog table
DataCatalogtable_node1 = glueContext.create_dynamic_frame.from_catalog(
    database="mydb",
    table_name="crawl_rawdata",
    transformation_ctx="DataCatalogtable_node1",
)

# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
    frame=DataCatalogtable_node1,
    mappings=[
        ...
        ("update_date", "string", "timestamp", "string"),
        ...
        ("modified_date", "string", "timestamp", "string"),
        ...
    ],
    transformation_ctx="ApplyMapping_node2",
)

# Script generated for node S3 bucket
S3bucket_node3 = glueContext.write_dynamic_frame.from_options(
    frame=ApplyMapping_node2,
    connection_type="s3",
    format="orc",
    connection_options={
        "path": "s3://mybucket/data-lake/glue/",
        "compression": "snappy",
        "partitionKeys": [ ... ],
    },
    transformation_ctx="S3bucket_node3",
)

job.commit()

How to make it work?

Alex R
  • 11,364
  • 15
  • 100
  • 180

1 Answers1

0

Could you provide the information about the line that is provoking this error? You can find it in AWS logs. I also experienced the same error. In my case, the solution was just replacing a variable with a list of strings with this list of strings. Let me show you a piece of code:

#This raises the IllegalArgumentException: Duplicate value for path
list_of_columns = ["col_a","col_b"]
data.select_fields(list_of_columns)

#This raises no error
data.select_fields( ["col_a","col_b"])
Ktos
  • 47
  • 8