0
from pyspark.sql.window import Window
from pyspark.sql import functions as F
maxcol = func.udf(lambda row: F.max(row))
temp = [(("ID1", '2019-01-01', '2019-02-01')), (("ID2", '2018-01-01', '2019-05-01')), (("ID3", '2019-06-01', '2019-04-01'))]
t1 = spark.createDataFrame(temp, ["ID", "colA", "colB"])
maxDF = t1.withColumn("maxval", maxcol(F.struct([t1[x] for x in t1.columns[1:]])))

All I want is a new column with maximum date from colA and ColB. I am running the same code and when I am doing maxDF.show then I am getting below error :

 'NoneType' object has no attribute '_jvm'
Kshitij Agrawal
  • 233
  • 1
  • 8
  • Possible duplicate of [how to calculate max value in some columns per row in pyspark](https://stackoverflow.com/questions/44833836/how-to-calculate-max-value-in-some-columns-per-row-in-pyspark) – pault Sep 26 '19 at 17:22
  • 1
    You don't need a `udf` for this. Use [`pyspark.sql.functions.greatest`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.greatest). In your case, you're probably looking for `maxDF = t1.withColumn("maxval", F.greatest(*t1.columns[1:]))` – pault Sep 26 '19 at 17:24
  • 1
    Your code doesn't work because you're using `pyspark.sql.functions.max` when you should be using `__builtin__.max` – pault Sep 26 '19 at 17:26

2 Answers2

0

similar to this code

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import column

spark = SparkSession.builder.appName("Python Spark").getOrCreate()

temp = [("ID1", '2019-01-01', '2019-02-01'), ("ID2", '2018-01-01', '2019-05-01'),
        ("ID3", '2019-06-01', '2019-04-01')]

t1 = spark.createDataFrame(temp, ["ID", "colA", "colB"])

maxDF = t1.withColumn("maxval", F.greatest(t1["colA"], t1["colB"]))
maxDF.show()

output

| ID|      colA|      colB|    maxval|
+---+----------+----------+----------+
|ID1|2019-01-01|2019-02-01|2019-02-01|
|ID2|2018-01-01|2019-05-01|2019-05-01|
|ID3|2019-06-01|2019-04-01|2019-06-01|
+---+----------+----------+----------+
oetzi
  • 1,002
  • 10
  • 21
0

You could also try something like this... use to_date() to convert to Date objects first, then compare:

from pyspark.sql.functions import *

temp = [(("ID1", '2019-01-01', '2019-02-01')), (("ID2", '2018-01-01', '2019-05-01')), (("ID3", '2019-06-01', '2019-04-01'))]
t1 = spark.createDataFrame(temp, ["ID", "colA", "colB"])
t2 = t1.select("ID", to_date(t1.colA).alias('colADate'), to_date(t1.colB).alias('colBDate'))
t3 = t2.withColumn('maxDateFromRow', when(t2.colADate > t2.colBDate, t2.colADate).otherwise(t2.colBDate))

t3.show()

Which returns:

+---+----------+----------+--------------+
| ID|  colADate|  colBDate|maxDateFromRow|
+---+----------+----------+--------------+
|ID1|2019-01-01|2019-02-01|    2019-02-01|
|ID2|2018-01-01|2019-05-01|    2019-05-01|
|ID3|2019-06-01|2019-04-01|    2019-06-01|
+---+----------+----------+--------------+
Zac Roberts
  • 141
  • 1
  • 4
  • [Do not do `from pyspark.sql.functions import *`](https://stackoverflow.com/a/55711135/5858851). – pault Sep 26 '19 at 21:47