One approach: Use the source/sink read/write APIs from AWS Glue and keep the DataFrame
transformations as Pyspark code. This enables "easy" integration with AWS services (e.g. S3, Glue Catalog) and makes unit testing DataFrame
transformations simple (since this is well known in Pyspark).
Example:
import sys
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame, DynamicFrameWriter
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
# init Glue context (and Spark context)
spark_context = SparkContext()
glue_context = GlueContext(spark_context)
# init Glue job
args = getResolvedOptions(sys.argv, ["JOB_NAME", "PARAM_1"])
job = Job(glue_context)
job.init(args["JOB_NAME"], args)
# read from source (use Glue APIs)
dynamic_frame = glue_context.create_dynamic_frame.from_options(
connection_type="s3",
connection_options={},
format="json",
format_options={},
)
# convert DynamicFrame to DataFrame
df = dynamic_frame.toDF()
# do DataFrame transformations (use Pyspark API)
# convert DataFrame back to DynamicFrame
dynamic_frame = DynamicFrame.fromDF(
dataframe=df,
glue_ctx=glue_context,
)
# write to sink (use Glue APIs)
DynamicFrameWriter(glue_context).from_options(
frame=dynamic_frame,
connection_type="s3",
connection_options={},
format="json",
format_options={},
)
# commit job
job.commit()
There are different ways to organize this example code into classes and functions, etc. Do whatever is appropriate for your existing script.
References:
- GlueContext class (AWS)
- DynamicFrame class (AWS)
- DynamicFrameWriter class (AWS)