I have a scala dataframe with two columns:
- id: String
- updated: Timestamp
From this dataframe I just want to get out the latest date, for which I use the following code at the moment:
df.agg(max("updated")).head()
// returns a row
I've just read about the collect() function, which I'm told to be safer to use for such a problem - when it runs as a job, it appears it is not aggregating the max on the whole dataset, it looks perfectly fine when it is running in a notebook -, but I don't understand how it should be used.
I found an implementation like the following, but I could not figure how it should be used...
df1.agg({"x": "max"}).collect()[0]
I tried it like the following:
df.agg(max("updated")).collect()(0)
Without (0) it returns an Array, which actually looks good. So idea is, we should apply the aggregation on the whole dataset loaded in the drive, not just the partitioned version, otherwise it seems to not retrieve all the timestamps. My question now is, how is collect() actually supposed to work in such a situation?
Thanks a lot in advance!