I want to get the last observation at the end of each minute for each stock. My high-frequency dataframe looks like:
+-----+--------+-------+----------+----------+----------+
|stock| date | hour | minute | second | price |
+-----+--------+-------+----------+----------+----------+
VOD | 01-02 | 10 | 13 | 11 | 85.35 |
VOD | 01-02 | 10 | 13 | 12 | 85.75 |
VOD | 01-02 | 10 | 14 | 09 | 84.35 |
VOD | 01-02 | 10 | 14 | 16 | 82.85 |
VOD | 01-02 | 10 | 14 | 26 | 85.65 |
VOD | 01-02 | 10 | 15 | 07 | 84.35 |
... ... ... .... ... ...
ABC | 01-02 | 11 | 13 | 11 | 25.35 |
ABC | 01-02 | 11 | 13 | 15 | 25.39 |
ABC | 01-02 | 11 | 13 | 19 | 25.26 |
The desired output should be like
+-----+--------+-------+-------+-------+
|stock| date | hour | minute| Price |
+-----+--------+-------+-------+-------+
VOD | 01-02 | 10 | 13 | 85.75 |
VOD | 01-02 | 10 | 14 | 85.65 |
VOD | 01-02 | 10 | 15 | 84.35 |
VOD | 01-02 | 10 | 16 | 85.75 |
... ... ... .... ...
ABC | 01-02 | 11 | 13 | 25.26 |
I knew I probably have to use partitionBy
and orderBy
syntax to get the results, but I am confused with those two. I am familiar with groupby
function in SQL. I wonder which one is more similar to groupby
function. Can someone help?