0

I'm using Apache Iceberg using Spark SQL. For some reasons, the SQL is returning null for my lag function. My theory is that during the process behind the scene, Spark tries to parallelised the task so the data becomes to small to produce any results in lag() function. If that even make sense. Correct me if I'm way off.

Example of my query:

INSERT INTO
    my_iceberg.temp_table_1 (
        date,
        region,
        country,
        total
    )
SELECT
    date,
    region,
    country,
    SUM(profit) as total # complex transformation
FROM
   my_iceberg.original_table


# Rewrite into a single file to be used below

CALL {spark_catalog}.system.rewrite_data_files(
    table => 'my_iceberg.temp_table_1'
    ,options => map(
        'min-input-files','1'
        ,'partial-progress.enabled','true'
    )
);


INSERT INTO
    my_iceberg.temp_table_2 (
        date,
        region,
        country,
        total
    )
SELECT
    date,
    region,
    country,
    total,
    lag(total, 12) OVER (
        PARTITION BY region,
        country
        ORDER BY
            date
    ) total_amount
FROM
    my_iceberg.temp_table_1;

# Rewrite again
CALL {spark_catalog}.system.rewrite_data_files(
    table => 'my_iceberg.temp_table_2'
    ,options => map(
        'min-input-files','1'
        ,'partial-progress.enabled','true'
    )
);

My idea:

  1. Separate into TWO temporary table where (1) for transformation, (2) for lag function
  2. Make temp_table_1 into a single file so that ALL values will be available for temp_table_2 to perform lag().

However, it still returns null.

user6308605
  • 693
  • 8
  • 26

0 Answers0