0

I have a table which has 6070800 records after filtering on a particular sff_file_id. See below SS for schema. There are 299 total measures, like measure1, measure2 ... measure299 Schema of df

I am reading it as dataframe, then using stack operation to get the measures and their values as attribute_name and attribute_value columns. Stack operation -

def get_stack_expr(dataframe, pattern, attribute_name = "attribute_name", attribute_value = "attribute_value"):
    cols = [x for x in dataframe.columns if x.lower().startswith(pattern.lower())]
    len_cols = len(cols)
    str_stack = f"stack({len_cols}, " + ", ".join([f"'{x}', `{x}`" for x in cols]) + f") as ({attribute_name}, {attribute_value})"
    
    return str_stack

df = df.select(*cols_to_select, F.expr(get_stack_expr(df, 'measure')))

Final schema SS below. enter image description here

When I am trying to display, print count or saving in into delta table it is taking more than 15 hours via job and then job fails.

How to optimise this operation?

0 Answers0