aws Glue job: how to merge multiple output .csv files in s3

Question

I created an aws Glue Crawler and job. The purpose is to transfer data from a postgres RDS database table to one single .csv file in S3. Everything is working, but I get a total of 19 files in S3. Every file is empty except three with one row of the database table in it as well as the headers. So every row of the database is written to a seperate .csv file. What can I do here to specify that I want only one file where the first row are the headers and afterwards every line of the database follows?

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "gluedatabse", table_name = "postgresgluetest_public_account", transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "gluedatabse", table_name = "postgresgluetest_public_account", transformation_ctx = "datasource0")
## @type: ApplyMapping
## @args: [mapping = [("password", "string", "password", "string"), ("user_id", "string", "user_id", "string"), ("username", "string", "username", "string")], transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("user_id", "string", "user_id", "string"), ("username", "string", "username", "string"),("password", "string", "password", "string")], transformation_ctx = "applymapping1")
## @type: DataSink
## @args: [connection_type = "s3", connection_options = {"path": "s3://BUCKETNAMENOTSHOWN"}, format = "csv", transformation_ctx = "datasink2"]
## @return: datasink2
## @inputs: [frame = applymapping1]
datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://BUCKETNAMENOTSHOWN"}, format = "csv", transformation_ctx = "datasink2")
job.commit()

The database looks like this: Databse picture

It looks like this in S3: S3 Bucket

One sample .csv in the S3 looksl ike this:

password,user_id,username
346sdfghj45g,user3,dieter

As I said, there is one file for each table row.

Edit: The multipartupload to s3 doesn't seem to work correctly. It just uplaods the parts but doesn't merge them together when finished. Here are the last lines of the job log: Here are the last lines of the log:

19/04/04 13:26:41 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 1 blocks
19/04/04 13:26:41 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
19/04/04 13:26:41 INFO Executor: Finished task 16.0 in stage 2.0 (TID 18). 2346 bytes result sent to driver
19/04/04 13:26:41 INFO MultipartUploadOutputStream: close closed:false s3://bucketname/run-1554384396528-part-r-00018
19/04/04 13:26:41 INFO MultipartUploadOutputStream: close closed:true s3://bucketname/run-1554384396528-part-r-00017
19/04/04 13:26:41 INFO MultipartUploadOutputStream: close closed:false s3://bucketname/run-1554384396528-part-r-00019
19/04/04 13:26:41 INFO Executor: Finished task 17.0 in stage 2.0 (TID 19). 2346 bytes result sent to driver
19/04/04 13:26:41 INFO MultipartUploadOutputStream: close closed:true s3://bucketname/run-1554384396528-part-r-00018
19/04/04 13:26:41 INFO Executor: Finished task 18.0 in stage 2.0 (TID 20). 2346 bytes result sent to driver
19/04/04 13:26:41 INFO MultipartUploadOutputStream: close closed:true s3://bucketname/run-1554384396528-part-r-00019
19/04/04 13:26:41 INFO Executor: Finished task 19.0 in stage 2.0 (TID 21). 2346 bytes result sent to driver
19/04/04 13:26:41 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
19/04/04 13:26:41 INFO CoarseGrainedExecutorBackend: Driver from 172.31.20.76:39779 disconnected during shutdown
19/04/04 13:26:41 INFO CoarseGrainedExecutorBackend: Driver from 172.31.20.76:39779 disconnected during shutdown
19/04/04 13:26:41 INFO MemoryStore: MemoryStore cleared
19/04/04 13:26:41 INFO BlockManager: BlockManager stopped
19/04/04 13:26:41 INFO ShutdownHookManager: Shutdown hook called
End of LogType:stderr

more info: Found out that these are not csv files yet. These are parts from the s3 upload, because the job uses mulitpartupload. But why doesn't it merges all the part when it finishes uplaoding? — , Apr 04 '19 at 12:45
I might be wrong but it might be because of the number of Spark executors. Look at the cluster and see how many executors you have. There should be a function named `collect` that will allow you to collect all the results and output into a single file. That's not recommended if the output is large though — simplycoding, Apr 04 '19 at 13:48
Can you post your final table definition from the Data Catalog? — Aida Martinez, Apr 05 '19 at 18:00
@simplycoding can you tell me more about the collect function? How do I call it? — , Apr 09 '19 at 07:03

score 7 · Accepted Answer · answered Apr 09 '19 at 14:26

Can you try the following?

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "gluedatabse", table_name = "postgresgluetest_public_account", transformation_ctx = "datasource0")

applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("user_id", "string", "user_id", "string"), ("username", "string", "username", "string"),("password", "string", "password", "string")], transformation_ctx = "applymapping1")

## Force one partition, so it can save only 1 file instead of 19
repartition = applymapping1.repartition(1)

datasink2 = glueContext.write_dynamic_frame.from_options(frame = repartition, connection_type = "s3", connection_options = {"path": "s3://BUCKETNAMENOTSHOWN"}, format = "csv", transformation_ctx = "datasink2")
job.commit()

Also, if you want to check how many partitions you have currently, you can try the following code. I'm guessing there are 19, that's why is saving 19 files back to s3:

 ## Change to Pyspark Dataframe
 dataframe = DynamicFrame.toDF(applymapping1)
 ## Print number of partitions   
 print(dataframe.rdd.getNumPartitions())
 ## Change back to DynamicFrame
 datasink2 = DynamicFrame.fromDF(dataframe, glueContext, "datasink2")

THANK YOU! The file is still called "run-1554901534650-part-r-00000" but it contains all rows now. That's a big step forward. One last question. Is it possible to add the .csv ending to this file somehow? — , Apr 10 '19 at 13:09
I think something weird is going on. It should save the file as a .csv. See this example https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-samples-legislators.html I use the spark methods to write because they are way more flexible, so you can try that as well. ```datasink4.write.\ format("com.databricks.spark.csv").\ option("header", "true").\ mode("overwrite").\ save("s3://your-bucket-name")``` — Aida Martinez, Apr 10 '19 at 16:15
Found a solutin. I don't know, if it is intended like this. But it seems to work. I removed the datasink 2 line and added the following line after the repartition: 'repartition.toDF().write.mode("overwrite").format("csv").save("s3://BUCKETNAME/subfolder")' Now it is a csv file — , Apr 11 '19 at 08:26

score 1 · Answer 2 · answered Jul 05 '23 at 18:58

For those who are looking for a solution trying to maintain the visual editor, just use a Custom Transformer with this code:

def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
  df = dfc.select(list(dfc.keys())[0]).toDF()
  df_reparted = df.repartition(1)
  dyf_reparted = DynamicFrame.fromDF(df_reparted, glueContext, "repart")
  return(DynamicFrameCollection({"CustomTransform0": dyf_reparted}, glueContext))

This will output a collection of one DF element that you can chain with a Select From Collection

Akhil R · Answer 3 · 2022-11-30T12:29:52.497

If these are not big datasets, then you could easily consider converting glue dynamicframe (glue_dyf) to spark df (spark_df) and then spark df to pandas df (pandas_df) as below:

spark_df = DynamicFrame.toDF(glue_dyf)
pandas_df = spark_df.toPandas()
pandas_df.to_csv("s3://BUCKETNAME/subfolder/FileName.csv",index=False)

In this method, you need not worry about repartition for small volumes of data. Large datasets are advised to be treated like the previous answers by leveraging glue workers, spark partitions..

aws Glue job: how to merge multiple output .csv files in s3

3 Answers3