0

I try to create a max column with this code. The sum column works

sum:

for col in list_names:
    for month in [3,6,9,12]:
        companies = companies.withColumn(col + 'sum_'+ str(month) + '_months', sum(companies[col + ult_pats2[month_ix - ix]] for ix in range(month)) )

max:

for col in list_names:
    for month in [3,6,9,12]:
        companies = companies.withColumn(col + 'max_'+ str(month) + '_months', max(companies[col + ult_pats2[month_ix - ix]] for ix in range(month)) )

the error message is:

"ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions"

ochs.tobi
  • 3,214
  • 7
  • 31
  • 52
ecan
  • 1
  • 2

2 Answers2

0

This looks to me like overwriting the max function by some other package. Try:

import pyspark.sql.functions as f

And then use the reference f.max(...)

michalrudko
  • 1,432
  • 2
  • 16
  • 30
  • This option -f.max()- gets max value of one column, but it doesn't work with multiple columns. – ecan Aug 21 '19 at 07:06
0

Finally it worked with this code, using sf.greatest:

import pyspark.sql.functions as sf

for col in list_names:  
    for month  in [3,6,9,12]:
            companies = companies.withColumn('max_'+ col + str(month) + '_months',
                                             sf.greatest( *[sf.col(col + ult_pats2[month_ix - ix]) for ix in range(month)] ) )
ecan
  • 1
  • 2