3

I'm writing a Python Transform and need to get the SparkSession so I can construct a DataFrame.

How should I do this?

hjones
  • 168
  • 1
  • 8

1 Answers1

2

You can pass the SparkContext as an argument in the transform, which can then be used to generate the SparkSession.

@transform(
    output=Output('/path/to/first/output/dataset'),
)
def my_compute_function(ctx, output):
    # type: (TransformContext, TransformOutput) -> None

    # In this example, the Spark session is used to create an empty data frame.
    columns = [
        StructField("col_a", StringType(), True)
    ]
    empty_df = ctx.spark_session.createDataFrame([], schema=StructType(columns))

    output.write_dataframe(empty_df)

This example can also be found in the Foundry documentation here: https://www.palantir.com/docs/foundry/transforms-python/transforms-python-api/#transform

hjones
  • 168
  • 1
  • 8
tomwhittaker
  • 331
  • 2
  • 8
  • you where faster than me, was literally copying some code over, ty for answering – fmsf May 06 '22 at 10:17
  • Calling `SparkSession.getActiveSession()` will also work in most cases, but explicitly using the Transform's Spark context as you suggest will avoid potential issues if your Transform sets another SparkSession up manually. – hjones May 06 '22 at 11:14