-1

I have the following table of data (as Delta table, which is mapped as Hive table)

UtilEvents:

-----------------------------------------------------------------------------
SerialNumber    EventTime               UseCase     RemoteHost  RemoteIP
-----------------------------------------------------------------------------
131058          2022-12-02T00:31:29     Send        Host1       RemoteIP1
131058          2022-12-21T00:33:24     Receive     Host1       RemoteIP1
131058          2022-12-22T01:35:33     Send        Host1       RemoteIP1
131058          2022-12-20T01:36:53     Receive     Host1       RemoteIP1
131058          2022-12-11T00:33:28     Send        Host2       RemoteIP2
131058          2022-12-15T00:35:18     Receive     Host2       RemoteIP2
131058          2022-12-12T02:29:11     Send        Host2       RemoteIP2
131058          2022-12-01T02:30:56     Receive     Host2       RemoteIP2

I need a result set which is grouped by UseCase and RemoteHost, but with max value of EventTime. So the result should look something like :

Result_UtilEvents:

----------------------------------------------------------------
SerialNumber    EventTime               UseCase     RemoteHost
----------------------------------------------------------------
131058          2022-12-21T00:33:24     Receive     Host1
131058          2022-12-22T01:35:33     Send        Host1
131058          2022-12-15T00:35:18     Receive     Host2
131058          2022-12-12T02:29:11     Send        Host2

Could you suggest an efficient Databricks SQL Query which can give this result.

PS : Intermediate dataframe results can not be used in this case. It has to be in pure SQL format.

Ganesha
  • 145
  • 1
  • 10

1 Answers1

0

I think that you just need to group by and get max together with column on which you are grouping. I added SerialNumber to group by as it is not clear how to treat this column

import datetime
import pyspark.sql.functions as F

x = [
    (131058, datetime.datetime(2022, 12, 2, 0, 31, 29), "Send", "Host1", "RemoteIP1"),
    (131058, datetime.datetime(2022, 12, 21, 0, 33, 24), "Receive", "Host1", "RemoteIP1"),
    (131058, datetime.datetime(2022, 12, 22, 1, 35, 33), "Send", "Host1", "RemoteIP1"),
    (131058, datetime.datetime(2022, 12, 20, 1, 36, 53), "Receive", "Host1", "RemoteIP1"),
    (131058, datetime.datetime(2022, 12, 11, 0, 33, 28), "Send", "Host2", "RemoteIP2"),
    (131058, datetime.datetime(2022, 12, 15, 0, 35, 18), "Receive", "Host2", "RemoteIP2"),
    (131058, datetime.datetime(2022, 12, 12, 2, 29, 11), "Send", "Host2", "RemoteIP2"),
    (131058, datetime.datetime(2022, 12, 1, 2, 30, 56), "Receive", "Host2", "RemoteIP2")
]
df = spark.createDataFrame(x, schema=["SerialNumber", "EventTime", "UseCase", "RemoteHost", "RemoteIp"])

df.createOrReplaceTempView("test_table")

spark.sql(
    "select SerialNumber, Max(EventTime) as EventTime, UseCase, RemoteHost "
    "from test_table "
    "group by SerialNumber, UseCase, RemoteHost"
).show()

output

+------------+-------------------+-------+----------+
|SerialNumber|          EventTime|UseCase|RemoteHost|
+------------+-------------------+-------+----------+
|      131058|2022-12-22 01:35:33|   Send|     Host1|
|      131058|2022-12-21 00:33:24|Receive|     Host1|
|      131058|2022-12-12 02:29:11|   Send|     Host2|
|      131058|2022-12-15 00:35:18|Receive|     Host2|
+------------+-------------------+-------+----------+
M_S
  • 2,863
  • 2
  • 2
  • 17