I'm using snowflake and the snowpark.functions and I'm stuck on trying to generate the five number summary elements as part of a grouped table/dataframe, but particularly the upper and lower whiskers for my data in a groupby
statement.
I imported the library with import snowflake.snowpark.functions as F
. I've been able to use the below code fine to generate a temp table and use the .cache_result()
function fine; I'm now trying to add the five number summary elements and was using this users question here as a base.
trans_pairs = (
trans_times
.groupBy(['"source_country_code"', '"destination_country_code"', '"route_code"'])
.agg(
F.count('*').alias('"num_transactions"'),
F.min('"transfer_full_time"').alias('"min_trans_time"'),
F.max('"transfer_full_time"').alias('"max_trans_time"'),
F.median('"transfer_full_time"').alias('"median_trans_time"'),
F.approx_percentile('"transfer_full_time"', .25).alias('"lower_qt"'),
F.approx_percentile('"transfer_full_time"', .75).alias('"upper_qt"')
)
.withColumn('"iqr"', F.col('"upper_qt"') - F.col('"lower_qt"'))
#.withColumn('"up_whisk"', F.max(F.col('"upper_qt"')+(1.5*F.col('"iqr"'))))
#.withColumn('"low_whisk"', F.min(F.col('"lower_qt"')-(1.5*F.col('"iqr"'))))
.withColumn('"min_trans_days"', F.col('"min_trans_time"')/86400)
.withColumn('"max_trans_days"', F.col('"max_trans_time"')/86400)
)
Obviously obtaining median
was no issue and I was able to use .approx_percentile
to get lower and upper quartiles; subtracting these provides the interquartile range. However I'm having an issue when I try to add the two commented lines (preceded by #
in case not formatting correctly for viewers).
When I use the lines WITHOUT the F.max
nested I get numbers, negative numbers for the low_whisk
, although all data values are positive, but based on the calculation its understandable why. As per comments on the other user question, the max and min are required as the whisker from a box plot is an actual data point in the set, which may not fall right on the upper_qt + 1.5 * iqr
or lower_qt - 1.5 * iqr
.
The error thrown up is excessively long but summarised in the last few lines as:
SnowparkSQLException: (1304): 01ab257c-0202-5f2d-0000-063d2f08d3f6: 000979 (42601): SQL compilation error: ["values"."source_country_code"] is not a valid group by expression
I understand that its saying I'm trying to group something that can't be but not sure what or how to resolve this as without these lines it executes fine.
As always, any help is greatly appreciated!