I'm using Apache Iceberg using Spark SQL. For some reasons, the SQL is returning null for my lag function. My theory is that during the process behind the scene, Spark tries to parallelised the task so the data becomes to small to produce any results in lag()
function. If that even make sense. Correct me if I'm way off.
Example of my query:
INSERT INTO
my_iceberg.temp_table_1 (
date,
region,
country,
total
)
SELECT
date,
region,
country,
SUM(profit) as total # complex transformation
FROM
my_iceberg.original_table
# Rewrite into a single file to be used below
CALL {spark_catalog}.system.rewrite_data_files(
table => 'my_iceberg.temp_table_1'
,options => map(
'min-input-files','1'
,'partial-progress.enabled','true'
)
);
INSERT INTO
my_iceberg.temp_table_2 (
date,
region,
country,
total
)
SELECT
date,
region,
country,
total,
lag(total, 12) OVER (
PARTITION BY region,
country
ORDER BY
date
) total_amount
FROM
my_iceberg.temp_table_1;
# Rewrite again
CALL {spark_catalog}.system.rewrite_data_files(
table => 'my_iceberg.temp_table_2'
,options => map(
'min-input-files','1'
,'partial-progress.enabled','true'
)
);
My idea:
- Separate into TWO temporary table where (1) for transformation, (2) for lag function
- Make
temp_table_1
into a single file so that ALL values will be available fortemp_table_2
to performlag()
.
However, it still returns null.