Get latest records in a data frame based on time stamp with condition

Question

My Question heading might not be accurate but i hope i will be able to explain my question So i have a data frame like below

DataPartition_1|^|PartitionYear_1|^|TimeStamp|^|OrganizationId|^|AnnualPeriodId|^|InterimPeriodId|^|InterimNumber_1|^|FFAction_1
SelfSourcedPublic|^|2001|^|1510044629598|^|4295858941|^|5|^|21|^|2|^|I|!|
SelfSourcedPublic|^|2002|^|1510044629599|^|4295858941|^|1|^|22|^|2|^|I|!|
SelfSourcedPublic|^|2002|^|1510044629600|^|4295858941|^|1|^|23|^|2|^|I|!|
SelfSourcedPublic|^|2016|^|1510044629601|^|4295858941|^|35|^|36|^|1|^|I|!|
SelfSourcedPublic|^|2016|^|1510044629624|^|4295858941|^|null|^|35|^|null|^|D|!|
SelfSourcedPublic|^|2016|^|1510044629625|^|4295858941|^|null|^|36|^|null|^|D|!|
SelfSourcedPublic|^|2016|^|1510044629626|^|4295858941|^|null|^|37|^|null|^|D|!|
SelfSourcedPublic|^|2001|^|1510044629596|^|4295858941|^|19|^|5|^|1|^|I|!|
SelfSourcedPublic|^|2001|^|1510044629597|^|4295858941|^|20|^|5|^|2|^|I|!|
SelfSourcedPublic|^|2001|^|1510044629598|^|4295858941|^|21|^|5|^|2|^|I|!|

So the primary key for my data frame is

OrganizationId", "AnnualPeriodId","InterimPeriodId"

below is my code to get the latest records based on time stamp and arranged by seq of primary key .

import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("OrganizationId", "AnnualPeriodId","InterimPeriodId").orderBy($"TimeStamp".cast(LongType).desc) 
val latestForEachKey = df2result.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")

Now my issue is some time i get null in some of the primary key column like record with time stamp 1510044629624.

Now my requirement is that below records have same primary key except first one has null .In this case i still need only one records with latest time stamp

    SelfSourcedPublic|^|2016|^|1510044629601|^|4295858941|^|35|^|36|^|1|^|I|!|
SelfSourcedPublic|^|2016|^|1510044629625|^|4295858941|^|null|^|36|^|null|^|D|!|

I should get SelfSourcedPublic|^|2016|^|1510044629625|^|4295858941|^|null|^|36|^|null|^|D|!|

My current code gives output with both records because of the null ..

I hope my question is clear.

You are dropping timestamp column but how come you have that column in the final df? and in what sense(logic) the record with 1510044629624 timestamp should be the latest one? — Ramesh Maharjan, Nov 14 '17 at 05:31
the two records don't have the same primary key except the organisationId. AnnualPeriodId and InterimPeriodId both are different. So you must be saying according to OrganizationId only that the record with 1510044629624 timestamp is latest. But there are other records with that OrganizationId. please explain that. — Ramesh Maharjan, Nov 14 '17 at 05:43
@RameshMaharjan `OrganizationId` and `InterimPeriodId ` is same for both records ...Only `AnnualPeriodId` is null ..Can you please refresh the question once ... — Sudarshan kumar, Nov 14 '17 at 05:48
you just edited the question :) but you forgot to change the timestamp to 1510044629625 instead of 1510044629624 — Ramesh Maharjan, Nov 14 '17 at 05:54
if thats the case then you need change the primary keys you used in partition by to OrganizationId and InterimPeriodId only. It should work then. please try that and let me know — Ramesh Maharjan, Nov 14 '17 at 05:57
@RameshMaharjan yes it is working ...Thank you ..I will test all use cases now ...All my project architecture and problem is solved by you ..Thank you so much ... — Sudarshan kumar, Nov 14 '17 at 06:21
@RameshMaharjan There is one last question which is pending now ...Only you can understand now ..If i create fresh question no one will understand that. and i get down voted .Can i update one question ..There is only one condition that needs to be added in the solution — , Nov 14 '17 at 06:24
just ask it logically in another question and give me the link. I shall try to answer that if i have some spare time. Meanwhile can I answer this question so that you can accept and upvote and I too get benefited :) — Ramesh Maharjan, Nov 14 '17 at 07:21
done :) please ask another question and let me know if you don't get other answers :) — Ramesh Maharjan, Nov 14 '17 at 07:48

score 1 · Accepted Answer · answered Nov 14 '17 at 07:47

1

From what I understood from your question and problem is that you are using an extra column as primary key.

AnnualPeriodId column is getting null and since you are using that field in partitionBy, it is causing the null to be separate group and thus separate row

val windowSpec = Window.partitionBy("OrganizationId", "AnnualPeriodId","InterimPeriodId").orderBy($"TimeStamp".cast(LongType).desc)

So the solution is to remove it from the partitionBy so that the above line becomes

val windowSpec = Window.partitionBy("OrganizationId", "InterimPeriodId").orderBy($"TimeStamp".cast(LongType).desc)

I hope this should solve the issue you are having.

answered Nov 14 '17 at 07:47

Ramesh Maharjan

41,071
6
69
97

There is one problem in this ..My primary key is combination of three columns but i am doing partitionBy on two columns ...In that case when two records will have two columns same and third different then also it will consider as one ...How can i handle this ? – Sudarshan kumar Nov 14 '17 at 14:15
For example in last three row `"OrganizationId", "InterimPeriodId"` but different `AnnualPeriodId` so when i will do partitionBy it will get men the one records where as in this case i need all three because it is not null – Sudarshan kumar Nov 14 '17 at 14:25
So there is two case ..When i will have Action `D` then i will have to consider `partitionBy("OrganizationId", "InterimPeriodId")` and when it is `I` i will have to consider `partitionBy("OrganizationId", "AnnualPeriodId","InterimPeriodId")` – Sudarshan kumar Nov 14 '17 at 14:28
for that I would suggest you to separate the dataframe with D and I into two dataframes and apply the logics you explained in above comments and finally union them. That should solve the issue you are facing – Ramesh Maharjan Nov 14 '17 at 15:23
So you are saying to create data frame with which has only D and another data frame which has only I .then compare both seperatly each with DF1 then union both output and create final output – Sudarshan kumar Nov 14 '17 at 16:07
Ok but order of the records will be maintained?So for example Action with `I` will be in one file and Action with `D` will be in another file .Then how order will be maintained ?For the i have to sort based on Timestamp and which ever comes latest that operation has to happen . – Sudarshan kumar Nov 14 '17 at 17:50
you can order it at the last. :) you can ask another question and I shall answer that. :) – Ramesh Maharjan Nov 14 '17 at 17:57
Sure ...Created another one ...https://stackoverflow.com/questions/47292886/implementing-outer-join-in-spark-data-frame-with-conditions – Sudarshan kumar Nov 14 '17 at 18:30
Hi Sir if you get some time please have a look at this question ..https://stackoverflow.com/questions/48670551/adding-part-of-the-parent-schema-column-to-child-in-nested-json-in-spark-data-fr – Sudarshan kumar Feb 08 '18 at 04:15

Get latest records in a data frame based on time stamp with condition

1 Answers1

Linked