I have the following code which creates a new column based on combinations of columns in my dataframe, minus duplicates:
import itertools as it
import pandas as pd
df = pd.DataFrame({
'a': [3,4,5,6,3],
'b': [5,7,1,0,5],
'c': [3,4,2,1,3],
'd': [2,0,1,5,9]
})
orig_cols = df.columns
for r in range(2, df.shape[1] + 1):
for cols in it.combinations(orig_cols, r):
df["_".join(cols)] = df.loc[:, cols].sum(axis=1)
df
I need to generate the same results using Pyspark through a UDF. What would be the equivalent code in Pyspark?