How to merge CSV file from S3 bucket and save it back into S3 using AWS Glue

Question

Objective is to transform the data (csv files) from one S3 bucket to another S3 bucket - using Glue.

What I already tried:

I created a CSV classifier. I created a crawler which scans the data coming in S3 bucket. Where I am stuck:

Unable to find how can we store the output in S3 again without saving it in any RDS or other database services. Because Glue output is asking for database output, which I don't have and don't want to use.

Is there any way I can achieve the goal without using any other DB system, just plain - S3, Glue?

More Information Sample single CSV file, I am trying to merge

Classifier with delimeter of ";"

Crawler Configuration

Crawler Result (No schema detected)

@PrabhakarReddy I have 1 row in each file. My goal is to merge all these single row files and create a merged file (after adding header). — Kumar Vivek, Sep 10 '20 at 11:40

score 0 · Answer 1 · answered Sep 10 '20 at 12:05

The reason why Glue crawler detected schema is UNKNOWN because of the number of rows present in the source files. Refer to section Built-In CSV Classifier in this doc which you are using in your case.

According to the doc to be classified as CSV, the table schema must have at least two columns and two rows of data.

In your case you can use AWS Glue job and read files directly from S3 using either of below ways:

1.Create a dynamicframe and pass spearator as ; in format_options. Below is sample which you can modify according to your needs.

dyF = GlueContext.create_dynamic_frame_from_options(connection_type="s3",connection_options = {"paths": [InputDir]},format="csv",format_options={"withHeader": True,"separator": ";","quoteChar": '"',"escaper": '"'},transformation_ctx = "taxidata")

2.Use spark dataframe to read data from S3 and then convert it back to dynamicframe if you want to levarage Glue native transformations:

df = spark.read.options(delimiter=';').csv("s3://path-to-files/")

If you want to merge files with different schemas then read data containing different schema into different frames of your choice and then merge them using a Join operator.

Refer to this which has example code to join and write data back to s3.

Where do I need to write this code? Is it should be in AWS Glue dashboard? — Kumar Vivek, Sep 10 '20 at 12:10
you should read this https://docs.aws.amazon.com/glue/latest/dg/author-job.html — Prabhakar Reddy, Sep 10 '20 at 12:13
Thanks, going through it now. But do you believe the usecase is possible? Running a Glue job over a S3 bucket and then mergeing files and saving into another bucket? Without the use of any other service just Glue. — Kumar Vivek, Sep 10 '20 at 12:15

How to merge CSV file from S3 bucket and save it back into S3 using AWS Glue

1 Answers1