I have a table which has 6070800 records after filtering on a particular sff_file_id. See below SS for schema. There are 299 total measures, like measure1, measure2 ... measure299
I am reading it as dataframe, then using stack operation to get the measures and their values as attribute_name and attribute_value columns. Stack operation -
def get_stack_expr(dataframe, pattern, attribute_name = "attribute_name", attribute_value = "attribute_value"):
cols = [x for x in dataframe.columns if x.lower().startswith(pattern.lower())]
len_cols = len(cols)
str_stack = f"stack({len_cols}, " + ", ".join([f"'{x}', `{x}`" for x in cols]) + f") as ({attribute_name}, {attribute_value})"
return str_stack
df = df.select(*cols_to_select, F.expr(get_stack_expr(df, 'measure')))
When I am trying to display, print count or saving in into delta table it is taking more than 15 hours via job and then job fails.
How to optimise this operation?