How to optimise the operations on pyspark dataframe?

Asked Oct 27 '22 at 02:15

Active Oct 27 '22 at 04:38

Viewed 40 times

I have a table which has 6070800 records after filtering on a particular sff_file_id. See below SS for schema. There are 299 total measures, like measure1, measure2 ... measure299

I am reading it as dataframe, then using stack operation to get the measures and their values as attribute_name and attribute_value columns. Stack operation -

def get_stack_expr(dataframe, pattern, attribute_name = "attribute_name", attribute_value = "attribute_value"):
    cols = [x for x in dataframe.columns if x.lower().startswith(pattern.lower())]
    len_cols = len(cols)
    str_stack = f"stack({len_cols}, " + ", ".join([f"'{x}', `{x}`" for x in cols]) + f") as ({attribute_name}, {attribute_value})"
    
    return str_stack

df = df.select(*cols_to_select, F.expr(get_stack_expr(df, 'measure')))

Final schema SS below.

When I am trying to display, print count or saving in into delta table it is taking more than 15 hours via job and then job fails.

How to optimise this operation?

edited Oct 27 '22 at 04:38

asked Oct 27 '22 at 02:15

Rahul Kumar

1

If you can provide more details about your stack operations, that will be of great help in answering your question! – Purushothaman Srikanth Oct 27 '22 at 03:14

How to optimise the operations on pyspark dataframe?

0 Answers0