1

I am attempting to build a Glue job that will execute a SQL query against an existing glue catalog, and store the results in another glue catalog (in the example below, only return the record with the highest cost for each value of sn.) When executing a spark query against CSV sourced data, however, it is including the header in the results. This issue does not occur when the source is parquet. The glue catalog Serde parameters includes skip.header.line.count 1, and executing the query against the source data through Athena does not include the headers.

Is there a way to explicitly tell spark to ignore header rows when using .sql()?

Here is the essence of the python code my glue job is executing:

        from pyspark.context import SparkContext
        from pyspark.sql.session import SparkSession

        glue_source_database_name = 'source_database'
        glue_destination_database_name = 'destination_database'
        table_name = 'diamonds10_csv'
        partition_count = 5

        merge_query = 'SELECT SEQ.`sn`,SEQ.`carat`,SEQ.`cut`,SEQ.`color`,SEQ.`clarity`,SEQ.`depth`,SEQ.`table`,SEQ.`price`,SEQ.`x`,SEQ.`y`,SEQ.`z` FROM ( SELECT SUB.`sn`,SUB.`carat`,SUB.`cut`,SUB.`color`,SUB.`clarity`,SUB.`depth`,SUB.`table`,SUB.`price`,SUB.`x`,SUB.`y`,SUB.`z`, ROW_NUMBER() OVER ( PARTITION BY SUB.`sn` ORDER BY SUB.`price` DESC ) AS test_diamond FROM `diamonds10_csv` AS SUB) AS SEQ WHERE SEQ.test_diamond = 1'


        spark_context = SparkContext.getOrCreate()
        spark = SparkSession( spark_context )
        spark.sql( f'use {glue_source_database_name}')

        targettable = spark.sql(merge_query) 
        targettable.repartition(partition_count).write.option("path",f'{s3_output_path}/{table_name}').mode("overwrite").format("parquet").saveAsTable(f'`{glue_destination_database_name}`.`{table_name}`')
Brandon
  • 11
  • 1

0 Answers0