27

I have a pySpark dataframe that looks like this:

+-------------+----------+
|          sku|      date|
+-------------+----------+
|MLA-603526656|02/09/2016|
|MLA-603526656|01/09/2016|
|MLA-604172009|02/10/2016|
|MLA-605470584|02/09/2016|
|MLA-605502281|02/10/2016|
|MLA-605502281|02/09/2016|
+-------------+----------+

I want to group by sku, and then calculate the min and max dates. If I do this:

df_testing.groupBy('sku') \
    .agg({'date': 'min', 'date':'max'}) \
    .limit(10) \
    .show()

the behavior is the same as Pandas, where I only get the sku and max(date) columns. In Pandas I would normally do the following to get the results I want:

df_testing.groupBy('sku') \
    .agg({'day': ['min','max']}) \
    .limit(10) \
    .show()

However on pySpark this does not work, and I get a java.util.ArrayList cannot be cast to java.lang.String error. Could anyone please point me to the correct syntax?

Thanks.

masta-g3
  • 1,202
  • 4
  • 17
  • 27

1 Answers1

56

You cannot use dict. Use:

>>> from pyspark.sql import functions as F
>>>
>>> df_testing.groupBy('sku').agg(F.min('date'), F.max('date'))
  • 1
    Thanks! This solves the problem. Initially I tried `from pyspark.sql.functions import min, max` and the approach you propose, just without the F. Maybe python was confusing the SQL functions with the native ones. – masta-g3 Oct 27 '16 at 01:31
  • I mean, I wouldn't exactly call this an answer, since it doesn't solve the issue of having to essentially denormalize your dictionary. Oh my mistake! It does solve the original question, but the denormalization issue still stands. – information_interchange Jul 24 '19 at 20:08