How to specify explicit schema AWS Glue PySpark and use Bookmarks

Question

Reading through the AWS Glue Python ETL documentation I can't tell if there is a way to provide an explicit schema when using the following DynamicFrameReder class and reading json files from s3:

create_dynamic_frame_from_options()

Additionally, is it a requirement for Bookmarking to use the DynamicFrameReader class specified above?

The reason I ask that is I could always read using vanilla PySpark and pass in the schema in that way, but I'm not totally sure Bookmarking will work without using Glue functions.

score 1 · Answer 1 · answered Sep 22 '20 at 07:01

I've been at this for 1 straight month as of now. Given that you need to:

use the bookmark feature
and get data from s3 which has no header information

I am sorry you have run into a dead end. There is just no option to add the schema while reading csv or json files or even after-the-fact (using Glue API). Please comment if this is not true anymore

As @Aida Martinez mentioned, you can use crawlers to create the schema(table), or create the table manually either from the Glue console or by running a "Create table..." script from Athena.

As opposed to @Aid Martinez' comment, I believe if you are taking in files from Kafka Connect then bookmarking will work because the files on the s3 will be timestamped with their date of creation/modification. And Glue bookmarks will take this as the default bookmarking key and as long as you've set the transformation_ctx, bookmarking is possible.

score 0 · Answer 2 · answered Jun 07 '19 at 17:18

It's unclear what type of file you're dealing with. If it's a csv, glue should be able to infer schema based on the header provided that you have given it the right format options.

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format.html

If inferring the schema won't work for you a way that should work for any data set is to create a glue catalog database and tables. If a crawler will work, that's probably the easiest way to create (and maintain) that schema. However if you are unable to use a crawler it is also possible to manually create tables and their schemas. Then you could use create_dynamic_frame_from_catalog and when the Dynamic Frame is created the schema from the data catalog will be used.

You are correct in assuming that bookmarking will not work without using glue functions.

The data is JSON and in my case inferring the schema won't work because there are a few fields that are free form. I ended up grabbing the schema from glue catalog with boto3 and then building a DDL string and using `_parse_datatype_string` from `pyspark.sql.types` module to get a struct schema to pass to `spark.read.schema().json()` — moku, Jun 07 '19 at 17:24

score 0 · Answer 3 · answered Jun 13 '19 at 15:42

0

When using the DynamicFrameReader you can specify the schema in the dbtable parameter provided in connection_options like this:

datasource0 = glueContext\
              .create_dynamic_frame\
              .from_options("redshift", 
                           {"url": "jdbc-url/database", 
                           "user": "username", 
                           "password": "password",
                           "dbtable": "schema.table-name", 
                            "redshiftTmpDir": "s3-tempdir-path"},
                            transformation_ctx = "datasource0")

In order for the bookmarks to work, you need to use the AWS Glue methods and define the transformation_ctx. Following the Documentation you will find the following:

For job bookmarks to work properly, enable the job bookmark parameter and set the transformation_ctx parameter. If you don't pass in the transformation_ctx parameter, then job bookmarks are not enabled for a dynamic frame or a table used in the method.

Be aware that Job Bookmarks only work for s3 data sources and limited use cases for a relational database.

answered Jun 13 '19 at 15:42

Aida Martinez

559
3
7

Thank you for the comment however my data source is Json files in s3 so I still don't believe there is a way to add a schema. – moku Jun 13 '19 at 15:47
Yes, that's right. You would need to use AWS Glue Crawlers to have the schema created in AWS Glue Data Catalog. Then you can edit that schema and use the `DataFrameReader.fromCatalog` method. – Aida Martinez Jun 13 '19 at 15:58
We define our schema beforehand using terraform to populate the glue catalog. My source data is not partitioned and is streaming in from Kafka Connect so I have no way to limit the files I read in using the `fromCatalog` method. Or do I? – moku Jun 13 '19 at 16:01
No, I don't think you can limit them. Using AWS Glue methods doesn't give you flexibility. If the files are not partitioned the way it's expected (https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html) then neither the bookmarks will work. – Aida Martinez Jun 13 '19 at 16:22

How to specify explicit schema AWS Glue PySpark and use Bookmarks

3 Answers3

Linked