0

i have a bunch of existing pyspark scripts that I want to execute using AWS Glue. The scripts use APIs like SparkSession.read and various transformation in pyspark DataFrames.

I wasn't able to find docs outlining how to convert such a script. Do you have a hint / examples where I could find more infos? Thanks :)

2 Answers2

0

Pyspark script should run as is on AWS Glue since Glue is basically Spark with some custom AWS library added. For start, I would just paste it into Glue and try to run it.

If you need some functionality of Glue like dynamic frames or bookmarks, then you will need to modify the scripts to get GlueContext and work with that. The basic initialization is:

from pyspark.sql import SparkSession
spark_session = SparkSession.builder.getOrCreate()
glueContext = GlueContext(spark_session.sparkContext)

From here onwards, you can use glueContext for Glue features or spark_session for plain Spark functionality.

I would however avoid using Glue-specific stuff just for the sake of it, because:

  • it will reduce portability
  • there is much better community support for Spark than it is for Glue
bzu
  • 1,242
  • 1
  • 8
  • 14
0

One approach: Use the source/sink read/write APIs from AWS Glue and keep the DataFrame transformations as Pyspark code. This enables "easy" integration with AWS services (e.g. S3, Glue Catalog) and makes unit testing DataFrame transformations simple (since this is well known in Pyspark).

Example:

import sys

from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame, DynamicFrameWriter
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext


# init Glue context (and Spark context)
spark_context = SparkContext()
glue_context = GlueContext(spark_context)

# init Glue job
args = getResolvedOptions(sys.argv, ["JOB_NAME", "PARAM_1"])
job = Job(glue_context)
job.init(args["JOB_NAME"], args)

# read from source (use Glue APIs)
dynamic_frame = glue_context.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={},
    format="json",
    format_options={},
)

# convert DynamicFrame to DataFrame
df = dynamic_frame.toDF()

# do DataFrame transformations (use Pyspark API)

# convert DataFrame back to DynamicFrame
dynamic_frame = DynamicFrame.fromDF(
    dataframe=df,
    glue_ctx=glue_context,
)

# write to sink (use Glue APIs)
DynamicFrameWriter(glue_context).from_options(
    frame=dynamic_frame,
    connection_type="s3",
    connection_options={},
    format="json",
    format_options={},
)

# commit job
job.commit()

There are different ways to organize this example code into classes and functions, etc. Do whatever is appropriate for your existing script.

References:

  1. GlueContext class (AWS)
  2. DynamicFrame class (AWS)
  3. DynamicFrameWriter class (AWS)
Andrew Nguonly
  • 2,258
  • 1
  • 17
  • 23