AWS Glue, data filtering before loading into a frame, naming s3 objects

Question

I have 3 questions, for the following context: I'm trying to migrate my historical from RDS postgresql to S3. I have about a billion rows of dat in my database, Q1) Is there a way for me to tell an aws glue job what rows to load? For example i want it to load data from a certain date onwards? There is no bookmarking feature for a PostgreSQL data source,

Q2) Once my data is processed, the glue job automatically creates a name for the s3 output objects, I know i can speciofy the path in DynamicFrame write, but can I specify the object name? if so, how? I cannot find an option for this.

Q3) I tried my glue job on a sample table with 100 rows of data, and it automatically separated the output into 20 files with 5 rows in each of those files, how can I specify the batch size in a job?

Thanks in advance

This is a question I have also posted in AWS Glue forum as well, here is a link to that: https://forums.aws.amazon.com/thread.jspa?threadID=280743

score 1 · Answer 1 · answered Apr 27 '18 at 13:38

1

Glue supports pushdown predicates feature, however currently it works with partitioned data on s3 only. There is a feature request to support it for JDBC connections though.
It's not possible to specify name of output files. However, looks like there is an option with renaming files (note that renaming on s3 means copying file from one location into another so it's costly and not atomic operation)
You can't really control the size of output files. There is an option to control min number of files using coalesce though. Also starting from Spark 2.2 there is a possibility to set max number of records per file by setting config spark.sql.files.maxRecordsPerFile

answered Apr 27 '18 at 13:38

Yuriy Bondaruk

4,512
2
33
49

Our company is considering using AWS Glue as a primary ETL tool for data warehouse project.You seem to have an experience with Glue. It is interesting to me how come an ETL tool can not filter source data? It seems wrong to me from architecture stand point. Nobody wants to load 100 million rows into memory every single time. What is your opinion on this? Source is RDBMS: SQL Server and PostgreSQL. Thanks – Feb 14 '19 at 20:19
1

I completely agree that pushdown predicate feature should work for all types of data source especially for SQL sources since it's quite easy to implement it (just add where clause in the first place). There is alternative though - [load data using Spark](https://stackoverflow.com/questions/32573991/does-spark-predicate-pushdown-work-with-jdbc) – Yuriy Bondaruk Feb 14 '19 at 20:57

AWS Glue, data filtering before loading into a frame, naming s3 objects

1 Answers1