2

I have a scala dataframe with two columns:

  • id: String
  • updated: Timestamp

From this dataframe I just want to get out the latest date, for which I use the following code at the moment:

df.agg(max("updated")).head()
// returns a row

I've just read about the collect() function, which I'm told to be safer to use for such a problem - when it runs as a job, it appears it is not aggregating the max on the whole dataset, it looks perfectly fine when it is running in a notebook -, but I don't understand how it should be used.

I found an implementation like the following, but I could not figure how it should be used...

df1.agg({"x": "max"}).collect()[0]

I tried it like the following:

df.agg(max("updated")).collect()(0)

Without (0) it returns an Array, which actually looks good. So idea is, we should apply the aggregation on the whole dataset loaded in the drive, not just the partitioned version, otherwise it seems to not retrieve all the timestamps. My question now is, how is collect() actually supposed to work in such a situation?

Thanks a lot in advance!

Eve
  • 604
  • 8
  • 26

2 Answers2

1

I'm assuming that you are talking about a spark dataframe (not scala). If you just want the latest date (only that column) you can do:

df.select(max("updated"))

You can see what's inside the dataframe with df.show(). Since df are immutable you need to assign the result of the select to another variable or add the show after the select(). This will return a dataframe with just one row with the max value in "updated" column. To answer to your question:

So idea is, we should apply the aggregation on the whole dataset loaded in the drive, not just the partitioned version, otherwise it seems to not retrieve all the timestamp

When you select on a dataframe, spark will select data from the whole dataset, there is not a partitioned version and a driver version. Spark will shard your data across your cluster and all the operations that you define will be done on the entire dataset.

My question now is, how is collect() actually supposed to work in such a situation?

The collect operation is converting from a spark dataframe into an array (which is not distributed) and the array will be in the driver node, bear in mind that if your dataframe size exceed the memory available in the driver you will have an outOfMemoryError.

In this case if you do:

df.select(max("Timestamp")).collect().head

You DF (that contains only one row with one column which is your date), will be converted to a scala array. In this case is safe because the select(max()) will return just one row.

Take some time to read more about spark dataframe/rdd and the difference between transformation and action.

LFilips
  • 96
  • 1
  • 8
  • There is one thing I didn't understand... why does it work seamlessly when it runs in a Databricks Notebook and not as a job when I used simply head not with collect.. we will see, I'm going to test this soon as a job :) Thanks – Eve Jan 15 '20 at 09:19
  • I'm sure 100% sure if i understood the question :D. Databricks notebooks under the hood are running jobs on spark cluster, but everything is handled by the databricks runtime which is highly optimised. Since you asked why is not working when run as a "job" make me think that you tried to do the same thing with a custom job (not databricks) that was deployed on a cluster. If this is the case there can be lots reasons (unhealthy cluster, not enough memory, not available executors, wrong spark version at runtime etc). – LFilips Jan 15 '20 at 13:50
  • Thank you for the info, I can also see now why this question is so noob, but at first, I actually thought it was because I used wrong implementation. :) – Eve Jan 16 '20 at 15:38
0

It sounds weird. First of all you don´t need to collect the dataframe to get the last element of a sorted dataframe. There are many answers to this topics:

How to get the last row from DataFrame?

Emiliano Martinez
  • 4,073
  • 2
  • 9
  • 19
  • I thought the same, but when I run this max aggregation as a job, it doesn't load all the rows in the drive, and it actually doesn't return the latest entry... It does sound weird... – Eve Jan 14 '20 at 10:30
  • the max aggregation function is performed in the executors not in the driver. There is no need to load the dataframe in the driver at all – Emiliano Martinez Jan 14 '20 at 10:34